CFD Online Discussion Forums - Problem with parallelization speedup using many CPUbs

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)

- - Problem with parallelization speedup using many CPUbs (https://www.cfd-online.com/Forums/openfoam-solving/58065-problem-parallelization-speedup-using-many-cpubs.html)

aunola

January 19, 2009 05:01

It seems you have a bandwidth

It seems you have a bandwidth problem on your interconnect. One possible explanation is that your case is too small for an efficient parallel run using more than 2 CPU's. Small cases require low latency interconnects due to heavy network traffic. I am not familiar with the benchmark case, but my sugesstion is to make it larger. Keep us posted.

andreas_haakansson

January 19, 2009 05:13

Thanks for the reply, I will t

Thanks for the reply, I will try increasing the size of the case and report what happens.

/Andreas

velan

January 19, 2009 05:22

Hi Why not you try smaller

Hi

Why not you try smaller case by grid size of
* 16*16*16
* 16*128*128

Decompose the grid in x direction, and if possible compare the results like

nproc realtime
1 ??secs
2 ??
4 ??

and samething for second case also(128cells in x-direction).

I found the problem between AMD vs INTEL on this issues(bandwidth), i will post the results later.

andreas_haakansson

January 19, 2009 11:32

Martin: I have now tried expan

Martin: I have now tried expanding the case from its original 12 000 cells to about 300 000 cells. I have run this in 2h on 1, 2 and 4 CPU's to see what happens.
My results are then:
1 CPU Clock=7222 Execution=7207.49 Speedup 1.00
2 CPU Clock=7187 Execution=3610.08 Speedup 1.00
4 CPU Clock=7187 Execution=1804.03 Speedup 0.99

This means that the execution/clock time ratio now drops even earlier (already at 2 CPU's) and I get no speedup even between 1 and 2 CPU's. When I try this on my PC there is a clear speedup between 1.5 and 2.
I also wonder if this is the result of slow networking; can it be fixed or do I need to use another cluster?

Velan: Thanks for your reply but i am not sure exactly what you mean. As I understand your posting you want me to check a very small case to see if the problem has to do with low memory or something like this, I will try following your instructions and post results tomorrow.

/Andreas

andreas_haakansson

January 20, 2009 05:53

I have now done some asking ar

I have now done some asking around at the cluster support. The network uses Gigabit Ethernet (1000Base-T) with a theoretical speed of 1000Mbit/s which according to support should imply that it transfers data at 125Mbtes/s between processors IF the switch is not overloaded. Time to initiate communication between nodes (latency) is 0.35 ms.
Does these numbers say anything to anyone? My main concern is if the cluster is fast enough...

I have also tried out the case suggested by Velan. The time simulated when run for 2h decreases (i.e. speedup lower then 1) when comparing 1 and 2 or 4 CPU's even for this small case. The ratio between execution time and clock time also decreases below 50% for 2 CPU's and becomes even lower for 4 CPU's.

For a oodles case with 16x16x16 cells:
CPUs RelSimTime ExecutionTime ClockTime
1 1.000 6512 7226
2 0.733 3255 7249
4 0.532 1653 7235

Thanks in advance for any reply
Andreas

caw	January 20, 2009 06:04

Some comments on scaling behav

Some comments on scaling behavior:

Intel harpertown and woodcrest cpus (quadcore xeons) have a significant memory bottleneck between cores and RAM. I work on a cluster with dual quadcore nodes connected by infiniband.
Result: only use half of the cores per node! If i use all cores the ExecutionTime doubles! Otherwise scaling is good enough.

Regarding your problem: bandwidth is not that important for cfd (but i does not hurt), latency is the key.

You might want to use a low latency mpi implementation for gigabit: Gamma
http://www.disi.unige.it/project/gam...mma/index.html

I havent tried it myself, but they have performance data for OF 1.4 which looks quite nice.

have a look at OpenFOAM-1.5.x/src/Pstream/gamma

Best regards
Christian

aunola

January 20, 2009 06:17

Can this be all latency induce

Can this be all latency induced? It appears that for N>1 CPUs, each executes for about the time one would expect but sits doing nothing for (N-1)/N of the total wall clock time.

As I understand it, this is a cluster of single-core, single-CPU nodes with a Gbit interconnect. This is exactly what I'm planning to invest in for OF, albeit for turbFoam. However, I suspect turbFoam and oodles only differ in the turbulence models, so this result is not at all encouraging.

I have no further ideas at this time, other than maybe trying other cases that use other sparse matrix solvers.

andreas_haakansson

January 20, 2009 08:25

Thanks for your answers, I wil

Thanks for your answers, I will check this with the latency.
Martin: Yes the cluster consists of 200 AMD Opteron 148 single-cpu nodes. I will report on any progress.

Regards
Andreas

aunola

January 20, 2009 11:27

Having thought a bit about it,

Having thought a bit about it, the fact that you see no difference using floatTransfer might indicate that this is not a bandwidth problem; rather it might indeed be a latency problem. You might want to examine your network performance.

Also, have a look at these threads for possible further assistance:

http://www.cfd-online.com/OpenFOAM_D...es/1/2970.html

http://www.cfd-online.com/OpenFOAM_D...es/1/5473.html

(posts from Sep 27 and onwards in the latter).

carsten

January 21, 2009 09:32

Dear Andreas, maybe I didn'

Dear Andreas,

maybe I didn't understand your table:

> My results are then:
> 1 CPU Clock=7222 Execution=7207.49 Speedup 1.00
> 2 CPU Clock=7187 Execution=3610.08 Speedup 1.00
> 4 CPU Clock=7187 Execution=1804.03 Speedup 0.99

But this looks like rather perfect speed-up for me?! 7207.4s on 1 CPU, 3610.08s on 2 CPUS, 1804.03s on 4 CPUs. Perfect. Dumb question: You're sure that you're interpreting the numbers correctly?

Bye,

Carsten

andreas_haakansson

January 21, 2009 10:21

@Martin: Thanks for the info.

@Martin:
Thanks for the info. I will continue to check out what I can do about the MPI and latency.

@Carsten:
Sorry, maybe my table was not very clear. What I did was to run each case for 2h (giving ClockTime approx= 7200 for all cases). Still the 4 CPU case only spends 1804.03s computing while the 1 CPU spends almost the whole 2h for computing. The remaining time is probably spent waiting for communication between the nodes. What I mean in the last column is that the different cases reaches almost the same simulation time at the end of the 2h.
So, as I see this I would have a very good speedup if the nodes didn't spend so much time waiting: if ExecutionTime where equal to clockTime, but they are not.

Regards
Andreas

ngj	January 21, 2009 13:12

Hi Andreas Somewhere on the

Hi Andreas

Somewhere on the Forum (cannot recall where), I read that you needed to have O(1e4) cells per processor to get reasonable results, otherwise transfer of BCs from processor to processor would eat all your time.

I have a rather pragmatic way of looking at it:

surface area / volume, i.e. number of processor patch faces divided by number of cells on each processor. If this is large (and the effort in solving the Poisson eq. is small), then you must expect a rather large still stand when running on multiple processors.

Thus I would like to suggest for you to run the exact same case just with 32 * 32 * 32. For the 2 processor case, you would get half of "surface area / volume", thus hopefully a better scaling.

I am not at all an expert, but I hope it is helpful.

Best regards,

Niels

olwi	January 21, 2009 16:09

Hi, I have no benchmark wit

Hi,

I have no benchmark with OF, but we run Fluent (similar algoritm and MPI impl.) on a cluster with 50 nodes, 2 AMD Opterons per node, and Gbit ethernet. Our rule of thumb is to have minimum 200-300 thousand cells per cpu, otherwise latency will kill us...

/Ola

carsten

January 22, 2009 02:54

Hi Andreas, I don't want to so

Hi Andreas, I don't want to sound stubborn, but I think you misinterprete your results or your set-up is wrong or I still don't understand your set-up.

What your're saying is that for a mesh of 16x16x16=4096 cells you have speedup of

CPUs RelSimTime ExecutionTime ClockTime
1 1.000 6512 7226
2 0.733 3255 7249
4 0.532 1653 7235

Then, for 300000 cells you have a speedup of
1 CPU Clock=7222 Execution=7207.49 Speedup 1.00
2 CPU Clock=7187 Execution=3610.08 Speedup 1.00
4 CPU Clock=7187 Execution=1804.03 Speedup 0.99

From your interpretation this means that the speed-up is worse for larger grids. This is very very improbable from my experience.

So, I would try the following:

- use a testcase that is big enough (1e6 cells if you have enough RAM, this reduces latency impact)
- adjust the testcase so that it does produce as little result output as possible (IO can slow everything down)
- adjust "endTime" of the testcase so that it runs ~30min on 1 CPU (what did you mean by "giving ClockTime"?)
- execute all runs with the "time" command in front of the foam-solver and in front of mpirun
(e.g. "time mpirun -np 4 -hostfile mymachines time /pathtofoam/mysolver -case mycase -parallel" , syntax may differ for you)
- decompose and run it for 2 and 4 CPUs.
- post the reults http://www.cfd-online.com/OpenFOAM_D...part/happy.gif

Maybe you can log in on each of your client nodes while the job is running and execute "top" on it. Check how much time is spend on your job, on system, on idle, on wait. Maybe activating "Sleeping in Function" in "top" may help (hit f y in top) to identify the culprit. As already stated by Martin, latency can be a big issue on gig-ethernet networks. But I wouldn't expect this to be so severe.

Bye,

Carsten

caw	January 22, 2009 03:20

Hi, i have the feeling that

Hi,

i have the feeling that your job is running on a single node. That would explain everything!

In detail: you have four nodes with a single CPU-Core in each node. Now you start your job mit 4 processes. The idea is, that your system distributes those 4 processes to the 4 physical CPU-Cores. If this distribution is not done properly, all 4 processes are assigned to ONE CPU-Core.

Then you get the behavior you see: each process uses up a fourth of the whole clocktime.

I made this mistake once myself. That would be a bug in you job submission script or in the batch scheduling system.

So you should check, if each node is running your job. Just log in during the run using ssh and do "top".

Best regards
Christian

caw	January 22, 2009 03:26

Some more comments on cell num

Some more comments on cell numbers per process and speedup:

I have some cases that i partitioned to have as low as 10.000 cells per CPU and even the show reasonable speedup with OF (also with Fluent, but not as good as OF).

But the bigger the chuncks of the mesh per CPU the better the scaleup, as stated before.

Best regards
Christian

andreas_haakansson

January 22, 2009 05:06

Hi again, @Carsten: I thin

Hi again,

@Carsten: I think I understand now what you mean. In each tabel I have standardized the simulation time to that of 1 CPU or comparison, clock time is posted in order to compare to execution time. If looking at actual numbers the small case have much longer simulation time.
I have made a 1M case with low output. Then I set the maximum wall time when submitting to the cluster at 2h for all cases and see how far they get in simulation time.
I have not used the "time" command. What is it for? Also see below.

@Christian: Maybe you are right, I also recently got an email from another user who suggested this. It would most definetly make sense of my low efficiency. My logg files say that more than one CPU is used:
http://www.cfd-online.com/OpenFOAM_D...hment_icon.gif bench_doc_1.o693801
http://www.cfd-online.com/OpenFOAM_D...hment_icon.gif bench_doc_2.o693802
(Only have 1 and 2 CPUs yet but for them I still have the problem)
but I do not know if they are really sharing the load. The cluster supports has told me that they do not allow access to single nodes so I can not go there directly and use "top" as I do one my own computer. But probably it would be possible to get the information from the running program, but I do not know how. Any suggestions would be very welcome.

Regards
Andreas

caw	January 22, 2009 05:33

Ok, your Job is running on a s

Ok, your Job is running on a single CPU :-))

Have a look at your log:

First it says:
...
names of assigned nodes dn209 dn208 echo
dn209 is main node
...

And then from OF:
[0] Date : Jan 22 2009
[0] Time : 08:25:34
[0] Host : dn209
[0] PID : 20938
[1] Date : Jan 22 2009
[1] Time : 08:25:34
[1] Host : dn209
[1] PID : 20939
[1] Root : /disk/global1/andhak/OpenFOAM/andhak-1.4.1/run
[0] Root : /disk/global1/andhak/OpenFOAM/andhak-1.4.1/run
[0] Case : myBench2/oodles_pitzDaily_2
[0] Nprocs : 2
[0] Slaves :
[0] 1
[0] (
[0] dn209.20939
[0] )
[0]
[1] Case : myBench2/oodles_pitzDaily_2
[1] Nprocs : 2

The Host process is on dn209 and the slave process is on dn209 as well!!!, but should be on dn208, right?

Could you mail your Job script?

Best regards
Christian

andreas_haakansson

January 22, 2009 05:42

Great, hope you are right, the

Great, hope you are right, then I have hopes to fix this!
My submitt script is:
http://www.cfd-online.com/OpenFOAM_D...hment_icon.gif bench2.scr

Regards
Andreas

caw	January 22, 2009 06:02

Hi Andreas, Try this comman

Hi Andreas,

Try this command:

mpirun -np $nrnodes -hostfile $PBS_NODEFILE $solver $PBS_O_WORKDIR $casename -parallel

In my experience the -hostfile option actually must not be used with Openmpi but you should try it anyways. You are using LAM, right?

Best regards
Christian

carsten

January 23, 2009 06:59

Hi Andreas! Yes, definitely

Hi Andreas!

Yes, definitely you're running on a single CPU. This explains the strange results.

About your job-script: I had similar problems when trying to run with the supplied OpenMPI and PBS. Do you use OpenMPI?

If yes, you should try to change your execution line to:

mpirun -hostfile m_nodes -np $nrnodes $solver $PBS_O_WORKDIR $casename -parallel

This delives the names of the "slaves" to mpirun.

Bye,

Carsten

All times are GMT -4. The time now is 23:52.