Parallel performance

liu · August 29, 2006, 22:02

I have a case of about 1M cells. I run it parallel on 32 partitions (16nodes X 2cores = 32).
I want to simulate a process which takes about 1800s in reality.
I run the case on the supercomputer for 12hours. It only simulates about 600s.

One thing I noticed is the time information in the log file:
at time step n:
ExecutionTime = 25185.6 s ClockTime = 43755 s
at time step n+1:
ExecutionTime = 25194.1 s ClockTime = 43769 s

The clocktime is almost twice that of execution time. Does execution time means CPU time and clocktime means CPU time plus communication time between nodes? Does it mean the program spent a lot of time just on waiting for data transfer?

liu · August 29, 2006, 22:07

For time step n+1:
ExecutionTime = 8.5s ClockTime=14s
So about 5.5s is spent on communication?

mattijs · August 30, 2006, 03:16

The 'missing' time is probably spent waiting for communication. This is due to imperfect balancing and just purely the communication time and latency. What interconnect do you have?

olwi · August 30, 2006, 07:00

It's a common problem: You need a lot of cells in each partition for the cpu:s to spend more time iterating than waiting... From my experience using Fluent on a 16 node (32 cpu) cluster, you should have 100.000 to 200.000 cells in each partition to get decent parallel efficiency. I'm rather surprised it went so "well" for you! (We have a ordinary Gigabit ethernet; With a high-speed interconnect it would be better, but you still loose a lot.)

/Ola

hsieh · October 13, 2006, 14:34

Hi, Xiaofeng,

What computer you were running the parallel case?
I recently tested my cluster (2 dual CPU workstations + 2 dual core workstation). When I used all CPUs/cores, that is a totoal of 8, I got 45% - 50% efficiency (executionTime/ClockTime). When I used on 1 CPU (or 1 core) from each workstation, I got 65% - 70% efficiency. However, executionTime in the 4 CPU run is longer than the 8 CPU/core case. So, in real time, the 8 CPU run is still "slightly" faster than the 4 CPU run.

I was told that even a 70% efficiency is not good. Each workstation has 1 gigabit NIC connected to a Linksys gigabit switch (SD2008) which support non-blocking/Jumo Frames. I mgiht want to try out GAMMA. But, is there a way to improve efficiency without GAMMA? What is typical parallel efficiency people get? Any suggestion?

pei

hsieh · October 14, 2006, 07:07

Hi,

I was looking at the benchmark results posted on the OpenFOAM wiki. I noticed that for the interFoam case (case #4) when ran on the Waltons cluster, the 3-CPU run and the 4-CPU run actually were 50% slower than the serial run (1 CPU). Is this real?

pei

gschaider · October 15, 2006, 16:06

Hi Pei!
(about case #4 on the Wiki)
Yep. I'm afraid so. The case is just too small (18MB according to the table on the top, don't know how many cells right now). If you look on the other small cases on that machine: they don't scale that good either. (partly the network on that machine can be blamed but not totally)

hsieh · October 16, 2006, 15:36

Hi, Bernard,

How is memory determined? The case I am testing is about 1,158,000 hex cells. I am trying to find out what could be the cause(s) of the low executionTime/ClockTime ratio.

I ran a case on a dual core AMD workstation, the ratio between executionTime/ClockTime is about 1. However, the speed up is only about 1.3. This could be due to both cores accessing the same memory bus. I am hoping to improve the efficiency of the executionTime/ClockTime ratio. Any suggestion?

Pei

gschaider · October 17, 2006, 10:04

Hi Pi!

@memory usage: For the benchmark cases the memory usage was "measured" by getting the amount of residential memory every 5 seconds from the operating system and reporting the maximum value that occured durnig the benchmark.

In general I think the rule of thumb is approx 800bytes/cell (double precision). More if you use additional models

Bernhard

August 29, 2006, 22:02	I have a case of about 1M cell	#1
liu Senior Member Xiaofeng Liu Join Date: Mar 2009 Location: State College, PA, USA Posts: 118 Rep Power: 17	I have a case of about 1M cells. I run it parallel on 32 partitions (16nodes X 2cores = 32). I want to simulate a process which takes about 1800s in reality. I run the case on the supercomputer for 12hours. It only simulates about 600s. One thing I noticed is the time information in the log file: at time step n: ExecutionTime = 25185.6 s ClockTime = 43755 s at time step n+1: ExecutionTime = 25194.1 s ClockTime = 43769 s The clocktime is almost twice that of execution time. Does execution time means CPU time and clocktime means CPU time plus communication time between nodes? Does it mean the program spent a lot of time just on waiting for data transfer? __________________ Xiaofeng Liu, Ph.D., P.E., Assistant Professor Department of Civil and Environmental Engineering Penn State University 223B Sackett Building University Park, PA 16802 Web: http://water.engr.psu.edu/liu/

August 29, 2006, 22:07	For time step n+1: ExecutionT	#2
liu Senior Member Xiaofeng Liu Join Date: Mar 2009 Location: State College, PA, USA Posts: 118 Rep Power: 17	For time step n+1: ExecutionTime = 8.5s ClockTime=14s So about 5.5s is spent on communication? __________________ Xiaofeng Liu, Ph.D., P.E., Assistant Professor Department of Civil and Environmental Engineering Penn State University 223B Sackett Building University Park, PA 16802 Web: http://water.engr.psu.edu/liu/

August 30, 2006, 03:16	The 'missing' time is probably	#3
mattijs Senior Member Mattijs Janssens Join Date: Mar 2009 Posts: 1,419 Rep Power: 26	The 'missing' time is probably spent waiting for communication. This is due to imperfect balancing and just purely the communication time and latency. What interconnect do you have?

October 13, 2006, 14:34	Hi, Xiaofeng, What computer	#5
hsieh Senior Member Pei-Ying Hsieh Join Date: Mar 2009 Posts: 317 Rep Power: 18	Hi, Xiaofeng, What computer you were running the parallel case? I recently tested my cluster (2 dual CPU workstations + 2 dual core workstation). When I used all CPUs/cores, that is a totoal of 8, I got 45% - 50% efficiency (executionTime/ClockTime). When I used on 1 CPU (or 1 core) from each workstation, I got 65% - 70% efficiency. However, executionTime in the 4 CPU run is longer than the 8 CPU/core case. So, in real time, the 8 CPU run is still "slightly" faster than the 4 CPU run. I was told that even a 70% efficiency is not good. Each workstation has 1 gigabit NIC connected to a Linksys gigabit switch (SD2008) which support non-blocking/Jumo Frames. I mgiht want to try out GAMMA. But, is there a way to improve efficiency without GAMMA? What is typical parallel efficiency people get? Any suggestion? pei

October 14, 2006, 07:07	Hi, I was looking at the be	#6
hsieh Senior Member Pei-Ying Hsieh Join Date: Mar 2009 Posts: 317 Rep Power: 18	Hi, I was looking at the benchmark results posted on the OpenFOAM wiki. I noticed that for the interFoam case (case #4) when ran on the Waltons cluster, the 3-CPU run and the 4-CPU run actually were 50% slower than the serial run (1 CPU). Is this real? pei

August 30, 2006, 07:00	It's a common problem: You nee	#4
olwi Member Ola Widlund Join Date: Mar 2009 Location: Sweden Posts: 87 Rep Power: 17	It's a common problem: You need a lot of cells in each partition for the cpu:s to spend more time iterating than waiting... From my experience using Fluent on a 16 node (32 cpu) cluster, you should have 100.000 to 200.000 cells in each partition to get decent parallel efficiency. I'm rather surprised it went so "well" for you! (We have a ordinary Gigabit ethernet; With a high-speed interconnect it would be better, but you still loose a lot.) /Ola

October 15, 2006, 16:06	Hi Pei! (about case #4 on the	#7
gschaider Assistant Moderator Bernhard Gschaider Join Date: Mar 2009 Posts: 4,225 Rep Power: 51	Hi Pei! (about case #4 on the Wiki) Yep. I'm afraid so. The case is just too small (18MB according to the table on the top, don't know how many cells right now). If you look on the other small cases on that machine: they don't scale that good either. (partly the network on that machine can be blamed but not totally) __________________ Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request

October 16, 2006, 15:36	Hi, Bernard, How is memory	#8
hsieh Senior Member Pei-Ying Hsieh Join Date: Mar 2009 Posts: 317 Rep Power: 18	Hi, Bernard, How is memory determined? The case I am testing is about 1,158,000 hex cells. I am trying to find out what could be the cause(s) of the low executionTime/ClockTime ratio. I ran a case on a dual core AMD workstation, the ratio between executionTime/ClockTime is about 1. However, the speed up is only about 1.3. This could be due to both cores accessing the same memory bus. I am hoping to improve the efficiency of the executionTime/ClockTime ratio. Any suggestion? Pei

October 17, 2006, 10:04	Hi Pi! @memory usage: For t	#9
gschaider Assistant Moderator Bernhard Gschaider Join Date: Mar 2009 Posts: 4,225 Rep Power: 51	Hi Pi! @memory usage: For the benchmark cases the memory usage was "measured" by getting the amount of residential memory every 5 seconds from the operating system and reporting the maximum value that occured durnig the benchmark. In general I think the rule of thumb is approx 800bytes/cell (double precision). More if you use additional models Bernhard __________________ Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
parallel performance	ivandipia	CFX	6	January 29, 2009 15:26
Performance of interFoam running in parallel	hsieh	OpenFOAM Running, Solving & CFD	8	September 14, 2006 09:15
ANSYS CFX 10.0 Parallel Performance for Windows XP	Saturn	CFX	4	August 13, 2006 12:27
Parallel Performance of Fluent	Soheyl	FLUENT	2	October 30, 2005 06:11
Parallel performance	hsing	OpenFOAM Running, Solving & CFD	16	August 30, 2005 14:38