CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM Running, Solving & CFD

Problem with parallelization speedup using many CPUbs

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Display Modes
Old   January 19, 2009, 06:01
Default It seems you have a bandwidth
  #1
Member
 
Martin Aunskjaer
Join Date: Mar 2009
Location: Denmark
Posts: 48
Rep Power: 8
aunola is on a distinguished road
It seems you have a bandwidth problem on your interconnect. One possible explanation is that your case is too small for an efficient parallel run using more than 2 CPU's. Small cases require low latency interconnects due to heavy network traffic. I am not familiar with the benchmark case, but my sugesstion is to make it larger. Keep us posted.
aunola is offline   Reply With Quote

Old   January 19, 2009, 06:13
Default Thanks for the reply, I will t
  #2
New Member
 
Andreas Håkansson
Join Date: Mar 2009
Location: Lund, Sweden
Posts: 12
Rep Power: 8
andreas_haakansson is on a distinguished road
Thanks for the reply, I will try increasing the size of the case and report what happens.

/Andreas
andreas_haakansson is offline   Reply With Quote

Old   January 19, 2009, 06:22
Default Hi Why not you try smaller
  #3
Member
 
Velan
Join Date: Mar 2009
Location: India
Posts: 50
Rep Power: 8
velan is on a distinguished road
Hi

Why not you try smaller case by grid size of
* 16*16*16
* 16*128*128

Decompose the grid in x direction, and if possible compare the results like

nproc realtime
1 ??secs
2 ??
4 ??

and samething for second case also(128cells in x-direction).

I found the problem between AMD vs INTEL on this issues(bandwidth), i will post the results later.
velan is offline   Reply With Quote

Old   January 19, 2009, 12:32
Default Martin: I have now tried expan
  #4
New Member
 
Andreas Håkansson
Join Date: Mar 2009
Location: Lund, Sweden
Posts: 12
Rep Power: 8
andreas_haakansson is on a distinguished road
Martin: I have now tried expanding the case from its original 12 000 cells to about 300 000 cells. I have run this in 2h on 1, 2 and 4 CPU's to see what happens.
My results are then:
1 CPU Clock=7222 Execution=7207.49 Speedup 1.00
2 CPU Clock=7187 Execution=3610.08 Speedup 1.00
4 CPU Clock=7187 Execution=1804.03 Speedup 0.99

This means that the execution/clock time ratio now drops even earlier (already at 2 CPU's) and I get no speedup even between 1 and 2 CPU's. When I try this on my PC there is a clear speedup between 1.5 and 2.
I also wonder if this is the result of slow networking; can it be fixed or do I need to use another cluster?

Velan: Thanks for your reply but i am not sure exactly what you mean. As I understand your posting you want me to check a very small case to see if the problem has to do with low memory or something like this, I will try following your instructions and post results tomorrow.

/Andreas
andreas_haakansson is offline   Reply With Quote

Old   January 20, 2009, 06:53
Default I have now done some asking ar
  #5
New Member
 
Andreas Håkansson
Join Date: Mar 2009
Location: Lund, Sweden
Posts: 12
Rep Power: 8
andreas_haakansson is on a distinguished road
I have now done some asking around at the cluster support. The network uses Gigabit Ethernet (1000Base-T) with a theoretical speed of 1000Mbit/s which according to support should imply that it transfers data at 125Mbtes/s between processors IF the switch is not overloaded. Time to initiate communication between nodes (latency) is 0.35 ms.
Does these numbers say anything to anyone? My main concern is if the cluster is fast enough...

I have also tried out the case suggested by Velan. The time simulated when run for 2h decreases (i.e. speedup lower then 1) when comparing 1 and 2 or 4 CPU's even for this small case. The ratio between execution time and clock time also decreases below 50% for 2 CPU's and becomes even lower for 4 CPU's.

For a oodles case with 16x16x16 cells:
CPUs RelSimTime ExecutionTime ClockTime
1 1.000 6512 7226
2 0.733 3255 7249
4 0.532 1653 7235


Thanks in advance for any reply
Andreas
andreas_haakansson is offline   Reply With Quote

Old   January 20, 2009, 07:04
Default Some comments on scaling behav
  #6
caw
Member
 
Christian Winkler
Join Date: Mar 2009
Location: Mannheim, Germany
Posts: 63
Rep Power: 8
caw is on a distinguished road
Some comments on scaling behavior:

Intel harpertown and woodcrest cpus (quadcore xeons) have a significant memory bottleneck between cores and RAM. I work on a cluster with dual quadcore nodes connected by infiniband.
Result: only use half of the cores per node! If i use all cores the ExecutionTime doubles! Otherwise scaling is good enough.

Regarding your problem: bandwidth is not that important for cfd (but i does not hurt), latency is the key.

You might want to use a low latency mpi implementation for gigabit: Gamma
http://www.disi.unige.it/project/gam...mma/index.html

I havent tried it myself, but they have performance data for OF 1.4 which looks quite nice.

have a look at OpenFOAM-1.5.x/src/Pstream/gamma

Best regards
Christian
caw is offline   Reply With Quote

Old   January 20, 2009, 07:17
Default Can this be all latency induce
  #7
Member
 
Martin Aunskjaer
Join Date: Mar 2009
Location: Denmark
Posts: 48
Rep Power: 8
aunola is on a distinguished road
Can this be all latency induced? It appears that for N>1 CPUs, each executes for about the time one would expect but sits doing nothing for (N-1)/N of the total wall clock time.

As I understand it, this is a cluster of single-core, single-CPU nodes with a Gbit interconnect. This is exactly what I'm planning to invest in for OF, albeit for turbFoam. However, I suspect turbFoam and oodles only differ in the turbulence models, so this result is not at all encouraging.

I have no further ideas at this time, other than maybe trying other cases that use other sparse matrix solvers.
aunola is offline   Reply With Quote

Old   January 20, 2009, 09:25
Default Thanks for your answers, I wil
  #8
New Member
 
Andreas Håkansson
Join Date: Mar 2009
Location: Lund, Sweden
Posts: 12
Rep Power: 8
andreas_haakansson is on a distinguished road
Thanks for your answers, I will check this with the latency.
Martin: Yes the cluster consists of 200 AMD Opteron 148 single-cpu nodes. I will report on any progress.

Regards
Andreas
andreas_haakansson is offline   Reply With Quote

Old   January 20, 2009, 12:27
Default Having thought a bit about it,
  #9
Member
 
Martin Aunskjaer
Join Date: Mar 2009
Location: Denmark
Posts: 48
Rep Power: 8
aunola is on a distinguished road
Having thought a bit about it, the fact that you see no difference using floatTransfer might indicate that this is not a bandwidth problem; rather it might indeed be a latency problem. You might want to examine your network performance.

Also, have a look at these threads for possible further assistance:

http://www.cfd-online.com/OpenFOAM_D...es/1/2970.html

http://www.cfd-online.com/OpenFOAM_D...es/1/5473.html

(posts from Sep 27 and onwards in the latter).
aunola is offline   Reply With Quote

Old   January 21, 2009, 10:32
Default Dear Andreas, maybe I didn'
  #10
Member
 
Carsten Thorenz
Join Date: Mar 2009
Location: Germany
Posts: 32
Rep Power: 8
carsten is on a distinguished road
Dear Andreas,

maybe I didn't understand your table:

> My results are then:
> 1 CPU Clock=7222 Execution=7207.49 Speedup 1.00
> 2 CPU Clock=7187 Execution=3610.08 Speedup 1.00
> 4 CPU Clock=7187 Execution=1804.03 Speedup 0.99

But this looks like rather perfect speed-up for me?! 7207.4s on 1 CPU, 3610.08s on 2 CPUS, 1804.03s on 4 CPUs. Perfect. Dumb question: You're sure that you're interpreting the numbers correctly?

Bye,

Carsten
carsten is offline   Reply With Quote

Old   January 21, 2009, 11:21
Default @Martin: Thanks for the info.
  #11
New Member
 
Andreas Håkansson
Join Date: Mar 2009
Location: Lund, Sweden
Posts: 12
Rep Power: 8
andreas_haakansson is on a distinguished road
@Martin:
Thanks for the info. I will continue to check out what I can do about the MPI and latency.

@Carsten:
Sorry, maybe my table was not very clear. What I did was to run each case for 2h (giving ClockTime approx= 7200 for all cases). Still the 4 CPU case only spends 1804.03s computing while the 1 CPU spends almost the whole 2h for computing. The remaining time is probably spent waiting for communication between the nodes. What I mean in the last column is that the different cases reaches almost the same simulation time at the end of the 2h.
So, as I see this I would have a very good speedup if the nodes didn't spend so much time waiting: if ExecutionTime where equal to clockTime, but they are not.

Regards
Andreas
andreas_haakansson is offline   Reply With Quote

Old   January 21, 2009, 14:12
Default Hi Andreas Somewhere on the
  #12
ngj
Senior Member
 
Niels Gjoel Jacobsen
Join Date: Mar 2009
Location: Rotterdam, The Netherlands
Posts: 1,564
Rep Power: 24
ngj will become famous soon enoughngj will become famous soon enough
Hi Andreas

Somewhere on the Forum (cannot recall where), I read that you needed to have O(1e4) cells per processor to get reasonable results, otherwise transfer of BCs from processor to processor would eat all your time.

I have a rather pragmatic way of looking at it:

surface area / volume, i.e. number of processor patch faces divided by number of cells on each processor. If this is large (and the effort in solving the Poisson eq. is small), then you must expect a rather large still stand when running on multiple processors.

Thus I would like to suggest for you to run the exact same case just with 32 * 32 * 32. For the 2 processor case, you would get half of "surface area / volume", thus hopefully a better scaling.

I am not at all an expert, but I hope it is helpful.

Best regards,

Niels
__________________
Please note that I do not use the Friend-feature, so do not be offended, if I do not accept a request.
ngj is offline   Reply With Quote

Old   January 21, 2009, 17:09
Default Hi, I have no benchmark wit
  #13
Member
 
Ola Widlund
Join Date: Mar 2009
Location: Sweden
Posts: 87
Rep Power: 8
olwi is on a distinguished road
Hi,

I have no benchmark with OF, but we run Fluent (similar algoritm and MPI impl.) on a cluster with 50 nodes, 2 AMD Opterons per node, and Gbit ethernet. Our rule of thumb is to have minimum 200-300 thousand cells per cpu, otherwise latency will kill us...

/Ola
olwi is offline   Reply With Quote

Old   January 22, 2009, 03:54
Default Hi Andreas, I don't want to so
  #14
Member
 
Carsten Thorenz
Join Date: Mar 2009
Location: Germany
Posts: 32
Rep Power: 8
carsten is on a distinguished road
Hi Andreas, I don't want to sound stubborn, but I think you misinterprete your results or your set-up is wrong or I still don't understand your set-up.

What your're saying is that for a mesh of 16x16x16=4096 cells you have speedup of

CPUs RelSimTime ExecutionTime ClockTime
1 1.000 6512 7226
2 0.733 3255 7249
4 0.532 1653 7235

Then, for 300000 cells you have a speedup of
1 CPU Clock=7222 Execution=7207.49 Speedup 1.00
2 CPU Clock=7187 Execution=3610.08 Speedup 1.00
4 CPU Clock=7187 Execution=1804.03 Speedup 0.99

From your interpretation this means that the speed-up is worse for larger grids. This is very very improbable from my experience.

So, I would try the following:

- use a testcase that is big enough (1e6 cells if you have enough RAM, this reduces latency impact)
- adjust the testcase so that it does produce as little result output as possible (IO can slow everything down)
- adjust "endTime" of the testcase so that it runs ~30min on 1 CPU (what did you mean by "giving ClockTime"?)
- execute all runs with the "time" command in front of the foam-solver and in front of mpirun
(e.g. "time mpirun -np 4 -hostfile mymachines time /pathtofoam/mysolver -case mycase -parallel" , syntax may differ for you)
- decompose and run it for 2 and 4 CPUs.
- post the reults

Maybe you can log in on each of your client nodes while the job is running and execute "top" on it. Check how much time is spend on your job, on system, on idle, on wait. Maybe activating "Sleeping in Function" in "top" may help (hit f y in top) to identify the culprit. As already stated by Martin, latency can be a big issue on gig-ethernet networks. But I wouldn't expect this to be so severe.

Bye,

Carsten
carsten is offline   Reply With Quote

Old   January 22, 2009, 04:20
Default Hi, i have the feeling that
  #15
caw
Member
 
Christian Winkler
Join Date: Mar 2009
Location: Mannheim, Germany
Posts: 63
Rep Power: 8
caw is on a distinguished road
Hi,

i have the feeling that your job is running on a single node. That would explain everything!

In detail: you have four nodes with a single CPU-Core in each node. Now you start your job mit 4 processes. The idea is, that your system distributes those 4 processes to the 4 physical CPU-Cores. If this distribution is not done properly, all 4 processes are assigned to ONE CPU-Core.

Then you get the behavior you see: each process uses up a fourth of the whole clocktime.

I made this mistake once myself. That would be a bug in you job submission script or in the batch scheduling system.

So you should check, if each node is running your job. Just log in during the run using ssh and do "top".

Best regards
Christian
caw is offline   Reply With Quote

Old   January 22, 2009, 04:26
Default Some more comments on cell num
  #16
caw
Member
 
Christian Winkler
Join Date: Mar 2009
Location: Mannheim, Germany
Posts: 63
Rep Power: 8
caw is on a distinguished road
Some more comments on cell numbers per process and speedup:

I have some cases that i partitioned to have as low as 10.000 cells per CPU and even the show reasonable speedup with OF (also with Fluent, but not as good as OF).

But the bigger the chuncks of the mesh per CPU the better the scaleup, as stated before.

Best regards
Christian
caw is offline   Reply With Quote

Old   January 22, 2009, 06:06
Default Hi again, @Carsten: I thin
  #17
New Member
 
Andreas Håkansson
Join Date: Mar 2009
Location: Lund, Sweden
Posts: 12
Rep Power: 8
andreas_haakansson is on a distinguished road
Hi again,

@Carsten: I think I understand now what you mean. In each tabel I have standardized the simulation time to that of 1 CPU or comparison, clock time is posted in order to compare to execution time. If looking at actual numbers the small case have much longer simulation time.
I have made a 1M case with low output. Then I set the maximum wall time when submitting to the cluster at 2h for all cases and see how far they get in simulation time.
I have not used the "time" command. What is it for? Also see below.

@Christian: Maybe you are right, I also recently got an email from another user who suggested this. It would most definetly make sense of my low efficiency. My logg files say that more than one CPU is used:
bench_doc_1.o693801
bench_doc_2.o693802
(Only have 1 and 2 CPUs yet but for them I still have the problem)
but I do not know if they are really sharing the load. The cluster supports has told me that they do not allow access to single nodes so I can not go there directly and use "top" as I do one my own computer. But probably it would be possible to get the information from the running program, but I do not know how. Any suggestions would be very welcome.

Regards
Andreas
andreas_haakansson is offline   Reply With Quote

Old   January 22, 2009, 06:33
Default Ok, your Job is running on a s
  #18
caw
Member
 
Christian Winkler
Join Date: Mar 2009
Location: Mannheim, Germany
Posts: 63
Rep Power: 8
caw is on a distinguished road
Ok, your Job is running on a single CPU :-))

Have a look at your log:

First it says:
...
names of assigned nodes dn209 dn208 echo
dn209 is main node
...

And then from OF:
[0] Date : Jan 22 2009
[0] Time : 08:25:34
[0] Host : dn209
[0] PID : 20938
[1] Date : Jan 22 2009
[1] Time : 08:25:34
[1] Host : dn209
[1] PID : 20939
[1] Root : /disk/global1/andhak/OpenFOAM/andhak-1.4.1/run
[0] Root : /disk/global1/andhak/OpenFOAM/andhak-1.4.1/run
[0] Case : myBench2/oodles_pitzDaily_2
[0] Nprocs : 2
[0] Slaves :
[0] 1
[0] (
[0] dn209.20939
[0] )
[0]
[1] Case : myBench2/oodles_pitzDaily_2
[1] Nprocs : 2



The Host process is on dn209 and the slave process is on dn209 as well!!!, but should be on dn208, right?

Could you mail your Job script?

Best regards
Christian
caw is offline   Reply With Quote

Old   January 22, 2009, 06:42
Default Great, hope you are right, the
  #19
New Member
 
Andreas Håkansson
Join Date: Mar 2009
Location: Lund, Sweden
Posts: 12
Rep Power: 8
andreas_haakansson is on a distinguished road
Great, hope you are right, then I have hopes to fix this!
My submitt script is:
bench2.scr

Regards
Andreas
andreas_haakansson is offline   Reply With Quote

Old   January 22, 2009, 07:02
Default Hi Andreas, Try this comman
  #20
caw
Member
 
Christian Winkler
Join Date: Mar 2009
Location: Mannheim, Germany
Posts: 63
Rep Power: 8
caw is on a distinguished road
Hi Andreas,

Try this command:

mpirun -np $nrnodes -hostfile $PBS_NODEFILE $solver $PBS_O_WORKDIR $casename -parallel

In my experience the -hostfile option actually must not be used with Openmpi but you should try it anyways. You are using LAM, right?

Best regards
Christian
caw is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Superlinear speedup in OpenFOAM 13 msrinath80 OpenFOAM Running, Solving & CFD 18 March 3, 2015 06:36
speedup questions tony CFX 5 February 3, 2008 18:26
cluster - parallel speedup George Main CFD Forum 3 March 29, 2005 11:32
cluster - parallel speedup George FLUENT 0 March 25, 2005 06:54
About the parallelization ptyue Main CFD Forum 8 January 27, 2003 00:29


All times are GMT -4. The time now is 00:51.