CFD Online Discussion Forums - Large case parallel efficiency

Page 1 of 2

Show 40 post(s) from this thread on one page

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)

- - Large case parallel efficiency (https://www.cfd-online.com/Forums/openfoam-solving/81051-large-case-parallel-efficiency.html)

lakeat

October 14, 2010 11:08

Large case parallel efficiency

Hi foamers,

I feel mad about my extremely slow parallel computing efficiency.
They are unsteady 3D LES case. incompressible external flow.

When the grid number is around 2M, it works fine, I use 24 or 48 cpus and looks not bad, but when the grid number is around 9M, I try to use 128 cpu or 96 cpu, but the simulation just did not move for a quite long time, (more than a week).

so dear all, what's your idea, and what is your suggestion and your experience.

Any ideas and advice would be highly appreciated!

vinz	October 14, 2010 11:14

The last computations I did with about 8 million cells using a solver derived from rhoSimpleFoam were running slower on 32cores than on 16 so I decided to stick to 16.
However, I would be really happy to know how to improve this kind of behaviour...

Vincent

lakeat

October 14, 2010 11:24

Quote:

Originally Posted by vinz (Post 279204)

Yes, i met the same situation, increase cpus but the speed is decreased, even decreased a lot.

Canesin

October 14, 2010 12:29

It is related with the needed for communication...

Every nem devision added to a domain means new synchronization is needed at boundaries ... The speed you earn is a compromise between increase in computational power (more cores) and increase in communication ...
In the case of adding every time more cores.. you need to know that the gain of communication will some time overtake the gain in computational power..

What can be done ??

The first solution is to improve the communication, better network and bypass the kernel .. the kernel generates overhead in communication using TCP/IP .. so you should use something like Infiniband or Myrinet ..

The second solution is to improve the locality of your problem, maybe decrease the number of domain subdivision by per compute node and them solve the linear system parallel in each compute node...

Hope it helps..

Fábio C. Canesin

alberto

October 15, 2010 00:06

Quote:

Originally Posted by lakeat (Post 279202)

It depends on a lot of factors, so at least you should say what method did you use to decompose your case, what architecture (multicore? how do CPU/nodes are distributed, meaning how many cores per node?) you are running on, and also if you are use the version of OpenMPI provided by OpenCFD or you recompiled OpenFOAM against the system libraries. Also, what compiler did you use?

Best,

eugene

October 18, 2010 07:52

We regularly run LES on large meshes with large numbers of CPUs with excellent speedup. Some things to keep in mind:

Beyond a certain number of CPUs, you need to move to infiniband or similar interconnect. Gigeth just wont hack it. Where the switch needs to occur depends on the case size, cpu speed and many other factors, but as rule of thumb, I would say anything above 32 cores requires infiniband.

Decomposition matters. If you can use a simpler decomposition like hierarchical, do. Try to keep the number of processor boundaries to a minimum (within reason). I suggest you experiment with different decompositions like (16 2 1), (8 4 1), etc. It can make a really massive difference.

Check that the slow-down is not due to some kind of disk activity, nfs or similar bottle-neck. If you have function objects or similar that read/write to disk a lot or have your case on a slow disk, you might want to distribute your case so each processor data set/mesh is local to the node it is being used on. (Check the distributed key word in decomposeParDict and the manual entry on decomposePar)

If you have an infiniband network, you either have to relink Pstream against an MPI that supports the OFED hardware stack or recompile OpenMPI to support infiniband, otherwise your infiniband will be wasted.

Hope this helps.

alberto

October 18, 2010 14:49

Quote:

Originally Posted by eugene (Post 279610)

We regularly run LES on large meshes with large numbers of CPUs with excellent speedup.

Yes, same experience here with our LES on micro-reactors.

Quote:

Decomposition matters. If you can use a simpler decomposition like hierarchical, do.

Out of curiosity, what is your experience with scotch? We had interesting results with it too, but it would be helpful to know other experiences.

Best,

eugene

October 18, 2010 16:44

Honestly, I haven't tried it. What I have read about scotch so far is that it produces decompositions similar to metis. For large numbers of cpus, this kind of approach simply doesn't cut the mustard. You end up with too many processors connected to too many others and parallel efficiency suffers. Somewhere there is an optimum between number of inter-processor connections vs. number of processor faces. You can see this easily by comparing a hierarchical decomposition like (128 1 1) with (64 2 1) and (8 4 4). The best performance will not be (128 1 1) or (8 4 4). (128 1 1) has a very large (processor face)/cell ratio, but the smallest number of (processor boundaries)/cell. For most cases (8 4 4) will be at the other extreme. Both are a disaster in terms of scalability - I have seen (64 2 1) run twice as fast as (128 1 1), (8 4 4) is even worse than (128 1 1). Extreme domain shapes probably influence matrix solvers as well. I must stress that this is all highly situational. If the number of CPUs is small, decomposition doesn't really matter. Cells/proc also affect scalability a lot.

Some kind of ultimate "self-optimising", hardware and algorithm aware decomposition would make a very cool Ph.D. project. At its simplest, you could just use dynamic load balancing techniques to optimise hierarchical decomposition coefficients at run time. Beyond this, you could look into profiling Pstream communication and developing decomposition methods that can be configured to perform best given a particular set of algorithms.
After working on the parallel hierarchical algorithm to allow snappyHexMesh to do dynamic load balancing, I was very interested in developing something like this. Unfortunately, it turned out to be rather difficult and there were more pressing matters to attend too. We can only hope that someone with more time, energy and bright ideas will come along to save us from the current crop of sub-optimal methods.

tehache

October 19, 2010 07:31

there are tons of points... one other, perhaps trivial, but not yet mentioned thing I just found out: mpich on our cluster was not configured to use shared memory communication, thus using loopback device. I found that in some cases I can gain a lot of speed over this using the shared memory communication. Dont know why, but configuration without shared memory enabled seems to be default in mpich...

Simon Lapointe

October 19, 2010 08:15

I've been running OpenFOAM on large meshes and high number of CPUs (up to 512) and the speedup was quite good. As it has been mentioned earlier, an Infiniband connection is necessary to achieve good performance on large parallel cases and we've also found that linking Pstream against the system compiled OpenMPI library supporting Infiniband makes a huge difference.

Concerning the distribution method, I've always used metis and obtained satisfactory results (my cases are mostly 3D airfoils). Eugene's post suggesting to use hierarchical decomposition if possible seems interesting and I might try it (along with scotch) in the near future.

I'm curious about the input of other members on this topic.

flavio_galeazzo

October 21, 2010 05:01

My experience with large cases is very close to Simon one. I have run LES cases up to 10 million nodes on up to 256 cores, with parallel eficiency around 85%, always using Metis as decomposition strategy. The machine has Infiniband interconnect, and I have compiled OpenFoam with the system compiled OpenMPI.

About smaller cases, using grids up to 2 million nodes and a Linux cluster with gigabit ethernet, I got good scalability only up to 16-20 cores (4-5 machines).

alberto

October 21, 2010 11:22

We have the same experience you had Flavio, with out LES on micro-reactors (>= 10^6 cells) using metis/scotch (scotch is actually slightly better it seems, even if the difference with respect to metis is not amazing).

Compiling against MPI libraries optimized for the architecture is key of course.

andyj

October 24, 2010 00:19

Hello
You might consider the Scalasca Diagnostic Toolset. I am unsure what HPC formats are supported, but Cray Xt and IBM Blue Gene are.
There is also Kojak, the precursor to Scalasca, which runs on more systems. Both give exhaustive info on bottlenecks and problems and system performance, complete with screenshots/charts/logs.
http://www.fz-juelich.de/jsc/scalasca/overview/

Kojak:
http://www.fz-juelich.de/jsc/kojak/platforms/
Kojak Supported Platforms
•Instrumentation, Measurement, and Analysis
◦Linux IA-32, IA-64, and EM64T/x86_64 clusters with GNU, PGI, or Intel compilers
◦IBM Power3 / Power4 / Power5 / Power6 based clusters
◦SGI Mips based clusters (O2k, O3k)
◦SGI IA-64 based clusters (Altix)
◦SUN Solaris Sparc and x86 based clusters
◦DEC/HP Alpha based clusters
◦Generic UNIX workstation (clusters)
•Instrumentation and Measurement only
◦IBM BlueGene/L and BlueGene/P
◦Cray T3E, XD1 and X1, XT3, XT4
◦SiCortex
◦NEC SX
◦Hitachi SR-8000

I do not know anything about the learning curve or install. Its at least worth a glance.

--------------------------------------------------------------------------------

lakeat

January 10, 2011 19:45

Good discussions, thank you all, I will try and keep you posted.

One of the major reminding for me is nfs writing speed. I will try to distribute the data.

lakeat

January 17, 2011 23:11

Quote:

Originally Posted by Simon Lapointe (Post 279773)

Dear Simon, could you tell me what do you mean by saying "linking Pstream against the system compiled OpenMPI"?

1 PFLAGS = -DOMPI_SKIP_MPICXX
2 PINC = -I$(MPI_ARCH_PATH)/include
3 PLIBS = -L$(MPI_ARCH_PATH)/lib -lmpi

Is this setting ok, or what?

Thanks

alberto

January 17, 2011 23:28

Also, take a look at the study presented at the Open Source CFD Conference 2010:

G. Shainer et al., OpenFOAM optimizations for Scale

They might give some information of interest for you.

lakeat

January 18, 2011 00:12

Gotcha, Thanks, I am testing..

The problem is , which I am not sure about:
I see in our high performance center, there are different nodes, it seems not all the compute nodes are using Infiniband. Some computing nodes are quite old. I am wondering if it is possible to apply the nsf computing node at Illinois.

And also, concerning the disk activity, I am not clear. My job are submitted via SGE management system, I do not have the right to access the computing node, which means ssh computing.node.XXX.edu, doesn't work. So I am wondering, when you are using job-management system like SGE, how did you set the "root" directories? So to let the data distributed??

See my PLIBS now,

Code:

[wei@opteron]$ echo $PLIBS

-pthread -L/afs/crc.edu/x86_64_linux/openmpi/1.3.2/gnu/lib -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl

Now, after re-setting the PFLAGS, PINC, and PLIBS, I then just wclean and recompile the src/Pstreams lib, So is this okay?

Thanks,

lakeat

January 18, 2011 01:01

Just noticed that they were using
–
Six-Core Intel X5670 @ 2.93 GHz CPUs
–
Memory: 24GB per node
–
OS: CentOS5U4, OFED 1.5.1 InfiniBand SW stack

While Im kind of frustrated, for mine is
32 HP DL160 G6 servers
Dual Quad-Core, 2.27 GHz L5520 Intel Nehalem nodes (8 cores per node, 256 total cores), 12 GB RAM each
or
393 HP DL165 G6 servers
Dual Six-Core 2.4 GHz AMD Opteron Model 2431 64/32 bit (12 cores per node, 4716 total cores), 12 GB RAM, 1 x 160 GB SATA Disk

So yesterday..

lakeat

March 4, 2011 14:21

Hello all,

Some findings and updates, hope this can help those non-professionals like me.

Infiniband support is critical for speed, to run a mid size case that need many nodes, it is strongly adviced to build the code against infiniband libs.

But I also got another question,
1. Usually how many grid points you guys allocate for each cpu? :confused:
2. I am still not clear how you make Hierarchical a better option than metis. Are you aware of any general rules, or do you have any experience that it is super better than metis. If not, im gonna stay with metis.

Thanks

flavio_galeazzo

March 9, 2011 03:35

I have the same experience as you about Infiniband, Daniel. It is crucial to get good speed up with more than 4-5 machines.

I normally allocate the nodes for simulation aiming for 1 second per time step, which is a good value in the system I work with (similar to your "old" cluster). The number of grid points per node depends largely on the complexity of the solver. I can allocate more grid points with a less complex solver, say an incompressible LES, than with a complex reacting flow solver.

lakeat

March 16, 2011 09:52

One more question concerning nCellsInCoarsestLevel, how to set it?

sqrt(mesh_in_total), or 10~30, or what?
Is SAMG much more better than GAMG? Any experience on that?
When will DICGaussSeidel be superior over GaussSeidel? Or there is even better options?

(Suppose this is an external incompressible flow with pisoFoam, have 10m grids pts, hex mesh).

Thanks

olivierG

March 17, 2011 09:40

hello,

You can have a look at this interesting thread.
The answer is to keep a small value here (10-20).

Olivier

lakeat

March 17, 2011 09:48

Since I have compared different nCellsInCoarsestLevel values both for a 5m grids case and also for a 10m grids case, I could not get any conclusion. For my 5m grids case, it seems 100 is better than setting it to 10~20, for 10m grids, case, I even find its good to set it around 3000.
That's why I feel kind of confused..

arjun

March 18, 2011 07:53

Quote:

Originally Posted by lakeat (Post 299874)

Large number is usually (almost always) better.

Anyway, good thing from that thread is that I designed a new multigrid scheme that is very efficient. That exploits the fact that direct solver size affects the solution rate.

Previously Bi-conjugate gradient preconditioned by AMG was the best combination I have tried. Now the timings with new method are at least two times faster than previous best. Will be testing it against W and F cycles in coming days.

:-D

lakeat

March 18, 2011 08:39

Quote:

Originally Posted by arjun (Post 300021)

Large number is usually (almost always) better.

Why I found it is "not always", actually, too large a number will slow down the speed, so I feel confused, not knowing what is a optimized number.

Quote:

Anyway, good thing from that thread is that I designed a new multigrid scheme that is very efficient. That exploits the fact that direct solver size affects the solution rate.

Which one?

Quote:

Previously Bi-conjugate gradient preconditioned by AMG was the best combination I have tried.

You mean GAMG?

Quote:

Now the timings with new method are at least two times faster than previous best. Will be testing it against W and F cycles in coming days.

I'd like to hear that, I have heard of this tech for quite a while, but not quite sure, how fast it is comparing with GAMG, so could you pls elaborate on this? Any experiences? Much appreciated.

lakeat

March 18, 2011 08:49

And also, It seems, scotch sometimes behaves better than metis... And I don;t know how to tune up the hierarchical method to find a optimized one, so I give up.

arjun

March 18, 2011 09:34

Quote:

Originally Posted by lakeat (Post 300032)

Why I found it is "not always", actually, too large a number will slow down the speed, so I feel confused, not knowing what is a optimized number.

too large size of direct solver would take more time. You said you tried 3000, for me 3000 takes lot of time, but this lot of time is relative. For 10mllion case, it may be negligible, while for a case with 10000 cells, it may be too much.

Quote:

Originally Posted by lakeat (Post 300032)

Which one?

Difficult to answer because since it is new one.

Quote:

Originally Posted by lakeat (Post 300032)

You mean GAMG?

I'd like to hear that, I have heard of this tech for quite a while, but not quite sure, how fast it is comparing with GAMG, so could you pls elaborate on this? Any experiences? Much appreciated.

In 2007, there was a paper by Jasak where he tried BiCG with AMG as preconditioner. (Or was it CG with AMG!!).

Anyway the crux was that the solver where AMG used W cycle was the fastest. Now if you look at W cycle, what it does is that it tries to solve problem at coarser levels to better convergence compared to V cycle. In case of direct solver, you do not iterate and solve the coarse problem to machine precision.
In both cases you are trying to essentially doing the same thing, that is to solve coarse problem to as high convergence level as possible without taking much time.

What I improved is to replace W cycle by some scheme more robust and better converging (on coarser levels), that takes similar or less time (but much higher convergence).
So far I am able to see speed up, but I have yet to make serious tests and comparisons. Specially with large sizes and W cycles.

PS: I am using my c++ library but the same scheme could be applied to any AMG scheme.

arjun

March 18, 2011 09:38

Quote:

Originally Posted by lakeat (Post 300034)

And also, It seems, scotch sometimes behaves better than metis... And I don;t know how to tune up the hierarchical method to find a optimized one, so I give up.

Not related to openFOAM but i wrote a small parallel data exchange code, that i am using. Good thing about it is that no matter what you use, its parallel efficiency do not go down. So far, at least my code there is no issue of partitioning. It is roughly 5 times faster than fluent :-D . (needless to say i am happy and relaxing :-D).

lakeat

March 18, 2011 09:48

Quote:

Agree, this is exactly what I have observed, I just hope there would be a better way to quickly find this number, so we do not need to tune up over and over. Coz I think it is realated to cpu ability, RAM available, simulation complexity, and also just as what you said, "Is it relatively negligible". So this makes me confused, if some one would propose a better and faster method for this number. Pls correct me if i am wrong.

Quote:

What I improved is to replace W cycle by some scheme more robust and better converging (on coarser levels), that takes similar or less time (but much higher convergence).
So far I am able to see speed up, but I have yet to make serious tests and comparisons. Specially with large sizes and W cycles.

Interesting, when would you share your work to public:).
And I also hope that the improvement is not just channel-flow friendly or box-turbulence friendly.
Thanks

lakeat

March 18, 2011 09:58

Quote:

Originally Posted by flavio_galeazzo (Post 298554)

I normally allocate the nodes for simulation aiming for 1 second per time step, which is a good value in the system I work with.

Hmmmmm, serious question, because I have no luck to get this "1 second per time step" after having tried different combinations. So, my question is this, theoretically, what's the best efficiency that I can expect from parallel computing give a single type of cluster architecture.

For my case, I am not fully sure, but having tried different cpus, I found 4s/timeStep is the best I can get...

ANY "formula"?

lakeat

March 18, 2011 09:59

Quote:

Originally Posted by arjun (Post 300047)

WoW, would you mind elaborate on this, lol

arjun

March 18, 2011 15:34

Quote:

Originally Posted by lakeat (Post 300054)

Interesting, when would you share your work to public:).
And I also hope that the improvement is not just channel-flow friendly or box-turbulence friendly.
Thanks

nothing special, replaced W cycle by another BiCG solver that again uses V cycle AMG as preconditioner. It solves to very high convergence levels (1e6 times) . So now it is as if direct solver size is say 1million cells or less. It is done on coarse enough levels , so that there is no penalty due to another BiCG routine. In a way it is a poly-BiCG-AMG solver. It is now roughly 2 times faster than normal BiCG AMG.

arjun

March 18, 2011 15:42

Quote:

Originally Posted by lakeat (Post 300057)

WoW, would you mind elaborate on this, lol

I was writing an immersed boundary code (SIMPLE algo), so it is Local refinement type. I do not partition mesh by metis or any other tool. I have written the grid generation that writes meshes for each partition. This way one can handle meshes from say 100million to 2-3 billion cells.
(I added few things to fluent mesh format, if others also follow that we can have a universal mesh format for parallel calculations. Other things are information related to parallel interfaces.).

Anyway the trick to efficiency is that when data is exchanged, there shall be no collision. Means that if a process is sending data to another process, no other process shall be sending data to it. The program takes care of it. This is why, data exchange remains highly efficient no matter how i partition.

eugene

April 7, 2011 05:51

Quote:

Originally Posted by lakeat (Post 297966)

Hello all,

2. I am still not clear how you make Hierarchical a better option than metis. Are you aware of any general rules, or do you have any experience that it is super better than metis. If not, im gonna stay with metis.

Thanks

Daniel,

We have run some additional tests and found metis/scotch is in general better than hierarchical. Which is different than what we thought previously. Hierarchical can be better under very specific circumstances where the cell distribution is favourable, but for cases with highly non-homogeneous cell density metis/scotch has shown up to 20% better performance in some instances. The only problem for us at the moment is that our version of snappyHexMesh does not generally work with parMetis under 1.6 - we have not tried to debug this yet. I hope the parallel scotch in 1.8 will rectify this.

lakeat

April 7, 2011 09:24

Haha, I tried for many cases using hierarchical, and then I got an impression that it is too tricky to use hierarchical, so I gave up.

Btw, just curious, are you all able to make every time-step run within one physical second, even when you are doing some time-averaging operations?

flavio_galeazzo

April 8, 2011 03:14

Quote:

Originally Posted by lakeat (Post 302653)

Btw, just curious, are you all able to make every time-step run within one physical second, even when you are doing some time-averaging operations?

I like to use the "1 second rule" for my simulations, but I can't follow it every time due to the avaiable ressorces. But all my simulations have scale well up to now, even with time averaging turned on, and I am confident that all of them could run within one physical second per time step. Of course, the system arquitecture have to be adequate. With a machine using Infiband, I got good speed up (say more than 80%) for up to 256 processors in 64 nodes.

jdiorio

May 17, 2011 14:30

A (more basic) question:

Does anyone have a recommended method to determine if your system is performing optimally? I've run a simple case (3D cavity tut with 1M cells) repeatedly on a various number of processors (N=8,16,..,64) and I get very erratic results. The run times (even when using the same number of processors) can vary dramatically (>x4). The jobs are submitted through an LSF scheduler and I've tried a few things to test this issue, like submitting the job sequentially (e.g. 3 times in a row) on the same set of processors, or running three instances of the job simultaneously on three different sets of N-processors (if that makes sense). We added some code to Pstream to output the total amount of time each processor spent in MPI, and we find that MPI time can vary substantially across the cores, which we've been interpreting as some of the cores are running slower. However, this slow down isn't repeatable, and running the job on the same nodes doesn't reproduce the issue.

I apologize because my knowledge of MPI is essentially zero. Is this a hardware issue or a problem with how the cluster is constructed? Or is it some setting in MPI? I can provide more information about the details (cluster architecture, etc.) but wanted to describe the problem in general first. Running OF-1.6 using supplied openmpi-1.3.3. Any thoughts or avenues of investigation would be great.

lakeat

May 17, 2011 14:37

Quote:

Originally Posted by jdiorio (Post 308013)

Just want a double check,
1. What did you mean by saying "I get very erratic results", are you saying that the results each time are completely different?
2. Are there any other persons who is using your computing nodes or your memory, if so that would cause differences.:)

jdiorio

May 17, 2011 15:35

Thanks for the reply Daniel.

1. The results from the simulation (i.e. flow field) are the same every time. The run time (amount of time to do the same number of iterations) varies greatly.

2. Aware of this, and agreed that I can't really control for it. However, that's why I ran 3 jobs at the same time (i.e. same exact simulation, just on different nodes) because I figured these would be subject to the same network traffic at that time. Even these cases can be very different (~x2). Furthermore, I've gotten in the habit of checking the cluster load when I submit. I've submitted jobs when there's little to no load and they've taken much longer when network traffic was medium/high.

I'll add to that I've also tried looking at how the job is distributed (i.e. 8 cores on 1 node, 1 core on 8 different nodes, etc. Note: each node has 4 dual-core processors) and I don't see a definite pattern - that is, it's not like running a job with N=64 on 8 x 8 is better than some other distribution...

flavio_galeazzo

May 18, 2011 13:27

Hi jdiorio,

If you cannot guarantee that the nodes work only for you, there is no surprise that your computation time varies greatly. One time the machine is working for your job only, and another time it is splitting the resources between X jobs.

If you are using the LSF scheduler, it is possible to reserve the nodes for your job only. Then your results will be consistent.

All times are GMT -4. The time now is 13:10.

Page 1 of 2

Show 40 post(s) from this thread on one page