CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (http://www.cfd-online.com/Forums/openfoam-solving/)
-   -   Large case parallel efficiency (http://www.cfd-online.com/Forums/openfoam-solving/81051-large-case-parallel-efficiency.html)

lakeat October 14, 2010 11:08

Large case parallel efficiency
 
Hi foamers,

I feel mad about my extremely slow parallel computing efficiency.
They are unsteady 3D LES case. incompressible external flow.

When the grid number is around 2M, it works fine, I use 24 or 48 cpus and looks not bad, but when the grid number is around 9M, I try to use 128 cpu or 96 cpu, but the simulation just did not move for a quite long time, (more than a week).

so dear all, what's your idea, and what is your suggestion and your experience.

Any ideas and advice would be highly appreciated!

vinz October 14, 2010 11:14

The last computations I did with about 8 million cells using a solver derived from rhoSimpleFoam were running slower on 32cores than on 16 so I decided to stick to 16.
However, I would be really happy to know how to improve this kind of behaviour...

Vincent

lakeat October 14, 2010 11:24

Quote:

Originally Posted by vinz (Post 279204)
The last computations I did with about 8 million cells using a solver derived from rhoSimpleFoam were running slower on 32cores than on 16 so I decided to stick to 16.
However, I would be really happy to know how to improve this kind of behaviour...

Vincent

Yes, i met the same situation, increase cpus but the speed is decreased, even decreased a lot.

Canesin October 14, 2010 12:29

It is related with the needed for communication...

Every nem devision added to a domain means new synchronization is needed at boundaries ... The speed you earn is a compromise between increase in computational power (more cores) and increase in communication ...
In the case of adding every time more cores.. you need to know that the gain of communication will some time overtake the gain in computational power..

What can be done ??

The first solution is to improve the communication, better network and bypass the kernel .. the kernel generates overhead in communication using TCP/IP .. so you should use something like Infiniband or Myrinet ..

The second solution is to improve the locality of your problem, maybe decrease the number of domain subdivision by per compute node and them solve the linear system parallel in each compute node...

Hope it helps..

Fábio C. Canesin

alberto October 15, 2010 00:06

Quote:

Originally Posted by lakeat (Post 279202)
Hi foamers,

I feel mad about my extremely slow parallel computing efficiency.
They are unsteady 3D LES case. incompressible external flow.

When the grid number is around 2M, it works fine, I use 24 or 48 cpus and looks not bad, but when the grid number is around 9M, I try to use 128 cpu or 96 cpu, but the simulation just did not move for a quite long time, (more than a week).

so dear all, what's your idea, and what is your suggestion and your experience.

Any ideas and advice would be highly appreciated!

It depends on a lot of factors, so at least you should say what method did you use to decompose your case, what architecture (multicore? how do CPU/nodes are distributed, meaning how many cores per node?) you are running on, and also if you are use the version of OpenMPI provided by OpenCFD or you recompiled OpenFOAM against the system libraries. Also, what compiler did you use?

Best,

eugene October 18, 2010 07:52

We regularly run LES on large meshes with large numbers of CPUs with excellent speedup. Some things to keep in mind:

Beyond a certain number of CPUs, you need to move to infiniband or similar interconnect. Gigeth just wont hack it. Where the switch needs to occur depends on the case size, cpu speed and many other factors, but as rule of thumb, I would say anything above 32 cores requires infiniband.

Decomposition matters. If you can use a simpler decomposition like hierarchical, do. Try to keep the number of processor boundaries to a minimum (within reason). I suggest you experiment with different decompositions like (16 2 1), (8 4 1), etc. It can make a really massive difference.

Check that the slow-down is not due to some kind of disk activity, nfs or similar bottle-neck. If you have function objects or similar that read/write to disk a lot or have your case on a slow disk, you might want to distribute your case so each processor data set/mesh is local to the node it is being used on. (Check the distributed key word in decomposeParDict and the manual entry on decomposePar)

If you have an infiniband network, you either have to relink Pstream against an MPI that supports the OFED hardware stack or recompile OpenMPI to support infiniband, otherwise your infiniband will be wasted.

Hope this helps.

alberto October 18, 2010 14:49

Quote:

Originally Posted by eugene (Post 279610)
We regularly run LES on large meshes with large numbers of CPUs with excellent speedup.

Yes, same experience here with our LES on micro-reactors.

Quote:

Decomposition matters. If you can use a simpler decomposition like hierarchical, do.
Out of curiosity, what is your experience with scotch? We had interesting results with it too, but it would be helpful to know other experiences.

Best,

eugene October 18, 2010 16:44

Honestly, I haven't tried it. What I have read about scotch so far is that it produces decompositions similar to metis. For large numbers of cpus, this kind of approach simply doesn't cut the mustard. You end up with too many processors connected to too many others and parallel efficiency suffers. Somewhere there is an optimum between number of inter-processor connections vs. number of processor faces. You can see this easily by comparing a hierarchical decomposition like (128 1 1) with (64 2 1) and (8 4 4). The best performance will not be (128 1 1) or (8 4 4). (128 1 1) has a very large (processor face)/cell ratio, but the smallest number of (processor boundaries)/cell. For most cases (8 4 4) will be at the other extreme. Both are a disaster in terms of scalability - I have seen (64 2 1) run twice as fast as (128 1 1), (8 4 4) is even worse than (128 1 1). Extreme domain shapes probably influence matrix solvers as well. I must stress that this is all highly situational. If the number of CPUs is small, decomposition doesn't really matter. Cells/proc also affect scalability a lot.

Some kind of ultimate "self-optimising", hardware and algorithm aware decomposition would make a very cool Ph.D. project. At its simplest, you could just use dynamic load balancing techniques to optimise hierarchical decomposition coefficients at run time. Beyond this, you could look into profiling Pstream communication and developing decomposition methods that can be configured to perform best given a particular set of algorithms.
After working on the parallel hierarchical algorithm to allow snappyHexMesh to do dynamic load balancing, I was very interested in developing something like this. Unfortunately, it turned out to be rather difficult and there were more pressing matters to attend too. We can only hope that someone with more time, energy and bright ideas will come along to save us from the current crop of sub-optimal methods.

tehache October 19, 2010 07:31

there are tons of points... one other, perhaps trivial, but not yet mentioned thing I just found out: mpich on our cluster was not configured to use shared memory communication, thus using loopback device. I found that in some cases I can gain a lot of speed over this using the shared memory communication. Dont know why, but configuration without shared memory enabled seems to be default in mpich...

Simon Lapointe October 19, 2010 08:15

I've been running OpenFOAM on large meshes and high number of CPUs (up to 512) and the speedup was quite good. As it has been mentioned earlier, an Infiniband connection is necessary to achieve good performance on large parallel cases and we've also found that linking Pstream against the system compiled OpenMPI library supporting Infiniband makes a huge difference.

Concerning the distribution method, I've always used metis and obtained satisfactory results (my cases are mostly 3D airfoils). Eugene's post suggesting to use hierarchical decomposition if possible seems interesting and I might try it (along with scotch) in the near future.

I'm curious about the input of other members on this topic.

flavio_galeazzo October 21, 2010 05:01

My experience with large cases is very close to Simon one. I have run LES cases up to 10 million nodes on up to 256 cores, with parallel eficiency around 85%, always using Metis as decomposition strategy. The machine has Infiniband interconnect, and I have compiled OpenFoam with the system compiled OpenMPI.

About smaller cases, using grids up to 2 million nodes and a Linux cluster with gigabit ethernet, I got good scalability only up to 16-20 cores (4-5 machines).

alberto October 21, 2010 11:22

We have the same experience you had Flavio, with out LES on micro-reactors (>= 10^6 cells) using metis/scotch (scotch is actually slightly better it seems, even if the difference with respect to metis is not amazing).

Compiling against MPI libraries optimized for the architecture is key of course.

andyj October 24, 2010 00:19

Hello
You might consider the Scalasca Diagnostic Toolset. I am unsure what HPC formats are supported, but Cray Xt and IBM Blue Gene are.
There is also Kojak, the precursor to Scalasca, which runs on more systems. Both give exhaustive info on bottlenecks and problems and system performance, complete with screenshots/charts/logs.
http://www.fz-juelich.de/jsc/scalasca/overview/

Kojak:
http://www.fz-juelich.de/jsc/kojak/platforms/
Kojak Supported Platforms
•Instrumentation, Measurement, and Analysis
◦Linux IA-32, IA-64, and EM64T/x86_64 clusters with GNU, PGI, or Intel compilers
◦IBM Power3 / Power4 / Power5 / Power6 based clusters
◦SGI Mips based clusters (O2k, O3k)
◦SGI IA-64 based clusters (Altix)
◦SUN Solaris Sparc and x86 based clusters
◦DEC/HP Alpha based clusters
◦Generic UNIX workstation (clusters)
•Instrumentation and Measurement only
◦IBM BlueGene/L and BlueGene/P
◦Cray T3E, XD1 and X1, XT3, XT4
◦SiCortex
◦NEC SX
◦Hitachi SR-8000

I do not know anything about the learning curve or install. Its at least worth a glance.


--------------------------------------------------------------------------------

lakeat January 10, 2011 19:45

Good discussions, thank you all, I will try and keep you posted.

One of the major reminding for me is nfs writing speed. I will try to distribute the data.

lakeat January 17, 2011 23:11

Quote:

Originally Posted by Simon Lapointe (Post 279773)
I've been running OpenFOAM on large meshes and high number of CPUs (up to 512) and the speedup was quite good. As it has been mentioned earlier, an Infiniband connection is necessary to achieve good performance on large parallel cases and we've also found that linking Pstream against the system compiled OpenMPI library supporting Infiniband makes a huge difference.

Concerning the distribution method, I've always used metis and obtained satisfactory results (my cases are mostly 3D airfoils). Eugene's post suggesting to use hierarchical decomposition if possible seems interesting and I might try it (along with scotch) in the near future.

I'm curious about the input of other members on this topic.

Dear Simon, could you tell me what do you mean by saying "linking Pstream against the system compiled OpenMPI"?

1 PFLAGS = -DOMPI_SKIP_MPICXX
2 PINC = -I$(MPI_ARCH_PATH)/include
3 PLIBS = -L$(MPI_ARCH_PATH)/lib -lmpi

Is this setting ok, or what?

Thanks

alberto January 17, 2011 23:28

Also, take a look at the study presented at the Open Source CFD Conference 2010:

G. Shainer et al., OpenFOAM optimizations for Scale

They might give some information of interest for you.

lakeat January 18, 2011 00:12

Gotcha, Thanks, I am testing..

The problem is , which I am not sure about:
I see in our high performance center, there are different nodes, it seems not all the compute nodes are using Infiniband. Some computing nodes are quite old. I am wondering if it is possible to apply the nsf computing node at Illinois.


And also, concerning the disk activity, I am not clear. My job are submitted via SGE management system, I do not have the right to access the computing node, which means ssh computing.node.XXX.edu, doesn't work. So I am wondering, when you are using job-management system like SGE, how did you set the "root" directories? So to let the data distributed??


See my PLIBS now,
Code:

[wei@opteron]$ echo $PLIBS
-pthread -L/afs/crc.edu/x86_64_linux/openmpi/1.3.2/gnu/lib -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl

Now, after re-setting the PFLAGS, PINC, and PLIBS, I then just wclean and recompile the src/Pstreams lib, So is this okay?


Thanks,

lakeat January 18, 2011 01:01

Just noticed that they were using

Six-Core Intel X5670 @ 2.93 GHz CPUs

Memory: 24GB per node

OS: CentOS5U4, OFED 1.5.1 InfiniBand SW stack



While Im kind of frustrated, for mine is
32 HP DL160 G6 servers
Dual Quad-Core, 2.27 GHz L5520 Intel Nehalem nodes (8 cores per node, 256 total cores), 12 GB RAM each
or
393 HP DL165 G6 servers
Dual Six-Core 2.4 GHz AMD Opteron Model 2431 64/32 bit (12 cores per node, 4716 total cores), 12 GB RAM, 1 x 160 GB SATA Disk

So yesterday..

lakeat March 4, 2011 14:21

Hello all,


Some findings and updates, hope this can help those non-professionals like me.

Infiniband support is critical for speed, to run a mid size case that need many nodes, it is strongly adviced to build the code against infiniband libs.


But I also got another question,
1. Usually how many grid points you guys allocate for each cpu? :confused:
2. I am still not clear how you make Hierarchical a better option than metis. Are you aware of any general rules, or do you have any experience that it is super better than metis. If not, im gonna stay with metis.

Thanks

flavio_galeazzo March 9, 2011 03:35

I have the same experience as you about Infiniband, Daniel. It is crucial to get good speed up with more than 4-5 machines.

I normally allocate the nodes for simulation aiming for 1 second per time step, which is a good value in the system I work with (similar to your "old" cluster). The number of grid points per node depends largely on the complexity of the solver. I can allocate more grid points with a less complex solver, say an incompressible LES, than with a complex reacting flow solver.


All times are GMT -4. The time now is 04:59.