CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM Running, Solving & CFD

Large case parallel efficiency

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree18Likes

Reply
 
LinkBack Thread Tools Display Modes
Old   October 14, 2010, 11:08
Default Large case parallel efficiency
  #1
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: South Bend, IN, USA
Posts: 688
Blog Entries: 9
Rep Power: 12
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Hi foamers,

I feel mad about my extremely slow parallel computing efficiency.
They are unsteady 3D LES case. incompressible external flow.

When the grid number is around 2M, it works fine, I use 24 or 48 cpus and looks not bad, but when the grid number is around 9M, I try to use 128 cpu or 96 cpu, but the simulation just did not move for a quite long time, (more than a week).

so dear all, what's your idea, and what is your suggestion and your experience.

Any ideas and advice would be highly appreciated!
__________________
~
Daniel WEI
-------------
NatHaz Modeling Laboratory
Department of Civil & Environmental Engineering & Earth Sciences
University of Notre Dame, USA
Email || My Personal CFD Blog
lakeat is offline   Reply With Quote

Old   October 14, 2010, 11:14
Default
  #2
Senior Member
 
Vincent RIVOLA
Join Date: Mar 2009
Location: France
Posts: 277
Rep Power: 9
vinz is on a distinguished road
The last computations I did with about 8 million cells using a solver derived from rhoSimpleFoam were running slower on 32cores than on 16 so I decided to stick to 16.
However, I would be really happy to know how to improve this kind of behaviour...

Vincent
vinz is offline   Reply With Quote

Old   October 14, 2010, 11:24
Default
  #3
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: South Bend, IN, USA
Posts: 688
Blog Entries: 9
Rep Power: 12
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Quote:
Originally Posted by vinz View Post
The last computations I did with about 8 million cells using a solver derived from rhoSimpleFoam were running slower on 32cores than on 16 so I decided to stick to 16.
However, I would be really happy to know how to improve this kind of behaviour...

Vincent
Yes, i met the same situation, increase cpus but the speed is decreased, even decreased a lot.
__________________
~
Daniel WEI
-------------
NatHaz Modeling Laboratory
Department of Civil & Environmental Engineering & Earth Sciences
University of Notre Dame, USA
Email || My Personal CFD Blog
lakeat is offline   Reply With Quote

Old   October 14, 2010, 12:29
Default
  #4
Member
 
Fábio César Canesin
Join Date: Mar 2010
Location: Florianópolis
Posts: 67
Rep Power: 7
Canesin is on a distinguished road
It is related with the needed for communication...

Every nem devision added to a domain means new synchronization is needed at boundaries ... The speed you earn is a compromise between increase in computational power (more cores) and increase in communication ...
In the case of adding every time more cores.. you need to know that the gain of communication will some time overtake the gain in computational power..

What can be done ??

The first solution is to improve the communication, better network and bypass the kernel .. the kernel generates overhead in communication using TCP/IP .. so you should use something like Infiniband or Myrinet ..

The second solution is to improve the locality of your problem, maybe decrease the number of domain subdivision by per compute node and them solve the linear system parallel in each compute node...

Hope it helps..

Fábio C. Canesin
lakeat and deji like this.
Canesin is offline   Reply With Quote

Old   October 15, 2010, 00:06
Default
  #5
Senior Member
 
Alberto Passalacqua
Join Date: Mar 2009
Location: Ames, Iowa, United States
Posts: 1,894
Rep Power: 26
alberto will become famous soon enoughalberto will become famous soon enough
Quote:
Originally Posted by lakeat View Post
Hi foamers,

I feel mad about my extremely slow parallel computing efficiency.
They are unsteady 3D LES case. incompressible external flow.

When the grid number is around 2M, it works fine, I use 24 or 48 cpus and looks not bad, but when the grid number is around 9M, I try to use 128 cpu or 96 cpu, but the simulation just did not move for a quite long time, (more than a week).

so dear all, what's your idea, and what is your suggestion and your experience.

Any ideas and advice would be highly appreciated!
It depends on a lot of factors, so at least you should say what method did you use to decompose your case, what architecture (multicore? how do CPU/nodes are distributed, meaning how many cores per node?) you are running on, and also if you are use the version of OpenMPI provided by OpenCFD or you recompiled OpenFOAM against the system libraries. Also, what compiler did you use?

Best,
__________________
Alberto Passalacqua

GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as live DVD/USB, hard drive image and virtual image.
OpenQBMM - An open-source implementation of quadrature-based moment methods
alberto is offline   Reply With Quote

Old   October 18, 2010, 07:52
Default
  #6
Senior Member
 
Eugene de Villiers
Join Date: Mar 2009
Posts: 725
Rep Power: 12
eugene is on a distinguished road
We regularly run LES on large meshes with large numbers of CPUs with excellent speedup. Some things to keep in mind:

Beyond a certain number of CPUs, you need to move to infiniband or similar interconnect. Gigeth just wont hack it. Where the switch needs to occur depends on the case size, cpu speed and many other factors, but as rule of thumb, I would say anything above 32 cores requires infiniband.

Decomposition matters. If you can use a simpler decomposition like hierarchical, do. Try to keep the number of processor boundaries to a minimum (within reason). I suggest you experiment with different decompositions like (16 2 1), (8 4 1), etc. It can make a really massive difference.

Check that the slow-down is not due to some kind of disk activity, nfs or similar bottle-neck. If you have function objects or similar that read/write to disk a lot or have your case on a slow disk, you might want to distribute your case so each processor data set/mesh is local to the node it is being used on. (Check the distributed key word in decomposeParDict and the manual entry on decomposePar)

If you have an infiniband network, you either have to relink Pstream against an MPI that supports the OFED hardware stack or recompile OpenMPI to support infiniband, otherwise your infiniband will be wasted.

Hope this helps.
lakeat, mgg and rajibroy like this.
eugene is offline   Reply With Quote

Old   October 18, 2010, 14:49
Default
  #7
Senior Member
 
Alberto Passalacqua
Join Date: Mar 2009
Location: Ames, Iowa, United States
Posts: 1,894
Rep Power: 26
alberto will become famous soon enoughalberto will become famous soon enough
Quote:
Originally Posted by eugene View Post
We regularly run LES on large meshes with large numbers of CPUs with excellent speedup.
Yes, same experience here with our LES on micro-reactors.

Quote:
Decomposition matters. If you can use a simpler decomposition like hierarchical, do.
Out of curiosity, what is your experience with scotch? We had interesting results with it too, but it would be helpful to know other experiences.

Best,
__________________
Alberto Passalacqua

GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as live DVD/USB, hard drive image and virtual image.
OpenQBMM - An open-source implementation of quadrature-based moment methods
alberto is offline   Reply With Quote

Old   October 18, 2010, 16:44
Default
  #8
Senior Member
 
Eugene de Villiers
Join Date: Mar 2009
Posts: 725
Rep Power: 12
eugene is on a distinguished road
Honestly, I haven't tried it. What I have read about scotch so far is that it produces decompositions similar to metis. For large numbers of cpus, this kind of approach simply doesn't cut the mustard. You end up with too many processors connected to too many others and parallel efficiency suffers. Somewhere there is an optimum between number of inter-processor connections vs. number of processor faces. You can see this easily by comparing a hierarchical decomposition like (128 1 1) with (64 2 1) and (8 4 4). The best performance will not be (128 1 1) or (8 4 4). (128 1 1) has a very large (processor face)/cell ratio, but the smallest number of (processor boundaries)/cell. For most cases (8 4 4) will be at the other extreme. Both are a disaster in terms of scalability - I have seen (64 2 1) run twice as fast as (128 1 1), (8 4 4) is even worse than (128 1 1). Extreme domain shapes probably influence matrix solvers as well. I must stress that this is all highly situational. If the number of CPUs is small, decomposition doesn't really matter. Cells/proc also affect scalability a lot.

Some kind of ultimate "self-optimising", hardware and algorithm aware decomposition would make a very cool Ph.D. project. At its simplest, you could just use dynamic load balancing techniques to optimise hierarchical decomposition coefficients at run time. Beyond this, you could look into profiling Pstream communication and developing decomposition methods that can be configured to perform best given a particular set of algorithms.
After working on the parallel hierarchical algorithm to allow snappyHexMesh to do dynamic load balancing, I was very interested in developing something like this. Unfortunately, it turned out to be rather difficult and there were more pressing matters to attend too. We can only hope that someone with more time, energy and bright ideas will come along to save us from the current crop of sub-optimal methods.
lakeat, nisha, hua1015 and 3 others like this.
eugene is offline   Reply With Quote

Old   October 19, 2010, 07:31
Default
  #9
Senior Member
 
Thomas Jung
Join Date: Mar 2009
Posts: 100
Rep Power: 8
tehache is on a distinguished road
there are tons of points... one other, perhaps trivial, but not yet mentioned thing I just found out: mpich on our cluster was not configured to use shared memory communication, thus using loopback device. I found that in some cases I can gain a lot of speed over this using the shared memory communication. Dont know why, but configuration without shared memory enabled seems to be default in mpich...
lakeat likes this.
tehache is offline   Reply With Quote

Old   October 19, 2010, 08:15
Default
  #10
Member
 
Simon Lapointe
Join Date: May 2009
Location: Québec, Qc, Canada
Posts: 33
Rep Power: 8
Simon Lapointe is on a distinguished road
I've been running OpenFOAM on large meshes and high number of CPUs (up to 512) and the speedup was quite good. As it has been mentioned earlier, an Infiniband connection is necessary to achieve good performance on large parallel cases and we've also found that linking Pstream against the system compiled OpenMPI library supporting Infiniband makes a huge difference.

Concerning the distribution method, I've always used metis and obtained satisfactory results (my cases are mostly 3D airfoils). Eugene's post suggesting to use hierarchical decomposition if possible seems interesting and I might try it (along with scotch) in the near future.

I'm curious about the input of other members on this topic.
lakeat likes this.
Simon Lapointe is offline   Reply With Quote

Old   October 21, 2010, 05:01
Default
  #11
Member
 
Flavio Galeazzo
Join Date: Mar 2009
Location: Karlsruhe, Germany
Posts: 30
Rep Power: 8
flavio_galeazzo is on a distinguished road
My experience with large cases is very close to Simon one. I have run LES cases up to 10 million nodes on up to 256 cores, with parallel eficiency around 85%, always using Metis as decomposition strategy. The machine has Infiniband interconnect, and I have compiled OpenFoam with the system compiled OpenMPI.

About smaller cases, using grids up to 2 million nodes and a Linux cluster with gigabit ethernet, I got good scalability only up to 16-20 cores (4-5 machines).
flavio_galeazzo is offline   Reply With Quote

Old   October 21, 2010, 11:22
Default
  #12
Senior Member
 
Alberto Passalacqua
Join Date: Mar 2009
Location: Ames, Iowa, United States
Posts: 1,894
Rep Power: 26
alberto will become famous soon enoughalberto will become famous soon enough
We have the same experience you had Flavio, with out LES on micro-reactors (>= 10^6 cells) using metis/scotch (scotch is actually slightly better it seems, even if the difference with respect to metis is not amazing).

Compiling against MPI libraries optimized for the architecture is key of course.
__________________
Alberto Passalacqua

GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as live DVD/USB, hard drive image and virtual image.
OpenQBMM - An open-source implementation of quadrature-based moment methods
alberto is offline   Reply With Quote

Old   October 24, 2010, 00:19
Default
  #13
Member
 
Andy Jones
Join Date: Sep 2010
Location: Columbus, Ohio
Posts: 78
Rep Power: 6
andyj is on a distinguished road
Hello
You might consider the Scalasca Diagnostic Toolset. I am unsure what HPC formats are supported, but Cray Xt and IBM Blue Gene are.
There is also Kojak, the precursor to Scalasca, which runs on more systems. Both give exhaustive info on bottlenecks and problems and system performance, complete with screenshots/charts/logs.
http://www.fz-juelich.de/jsc/scalasca/overview/

Kojak:
http://www.fz-juelich.de/jsc/kojak/platforms/
Kojak Supported Platforms
•Instrumentation, Measurement, and Analysis
◦Linux IA-32, IA-64, and EM64T/x86_64 clusters with GNU, PGI, or Intel compilers
◦IBM Power3 / Power4 / Power5 / Power6 based clusters
◦SGI Mips based clusters (O2k, O3k)
◦SGI IA-64 based clusters (Altix)
◦SUN Solaris Sparc and x86 based clusters
◦DEC/HP Alpha based clusters
◦Generic UNIX workstation (clusters)
•Instrumentation and Measurement only
◦IBM BlueGene/L and BlueGene/P
◦Cray T3E, XD1 and X1, XT3, XT4
◦SiCortex
◦NEC SX
◦Hitachi SR-8000

I do not know anything about the learning curve or install. Its at least worth a glance.


--------------------------------------------------------------------------------
andyj is offline   Reply With Quote

Old   January 10, 2011, 20:45
Default
  #14
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: South Bend, IN, USA
Posts: 688
Blog Entries: 9
Rep Power: 12
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Good discussions, thank you all, I will try and keep you posted.

One of the major reminding for me is nfs writing speed. I will try to distribute the data.
__________________
~
Daniel WEI
-------------
NatHaz Modeling Laboratory
Department of Civil & Environmental Engineering & Earth Sciences
University of Notre Dame, USA
Email || My Personal CFD Blog
lakeat is offline   Reply With Quote

Old   January 18, 2011, 00:11
Default
  #15
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: South Bend, IN, USA
Posts: 688
Blog Entries: 9
Rep Power: 12
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Quote:
Originally Posted by Simon Lapointe View Post
I've been running OpenFOAM on large meshes and high number of CPUs (up to 512) and the speedup was quite good. As it has been mentioned earlier, an Infiniband connection is necessary to achieve good performance on large parallel cases and we've also found that linking Pstream against the system compiled OpenMPI library supporting Infiniband makes a huge difference.

Concerning the distribution method, I've always used metis and obtained satisfactory results (my cases are mostly 3D airfoils). Eugene's post suggesting to use hierarchical decomposition if possible seems interesting and I might try it (along with scotch) in the near future.

I'm curious about the input of other members on this topic.
Dear Simon, could you tell me what do you mean by saying "linking Pstream against the system compiled OpenMPI"?

1 PFLAGS = -DOMPI_SKIP_MPICXX
2 PINC = -I$(MPI_ARCH_PATH)/include
3 PLIBS = -L$(MPI_ARCH_PATH)/lib -lmpi

Is this setting ok, or what?

Thanks
__________________
~
Daniel WEI
-------------
NatHaz Modeling Laboratory
Department of Civil & Environmental Engineering & Earth Sciences
University of Notre Dame, USA
Email || My Personal CFD Blog
lakeat is offline   Reply With Quote

Old   January 18, 2011, 00:28
Default
  #16
Senior Member
 
Alberto Passalacqua
Join Date: Mar 2009
Location: Ames, Iowa, United States
Posts: 1,894
Rep Power: 26
alberto will become famous soon enoughalberto will become famous soon enough
Also, take a look at the study presented at the Open Source CFD Conference 2010:

G. Shainer et al., OpenFOAM optimizations for Scale

They might give some information of interest for you.
__________________
Alberto Passalacqua

GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as live DVD/USB, hard drive image and virtual image.
OpenQBMM - An open-source implementation of quadrature-based moment methods
alberto is offline   Reply With Quote

Old   January 18, 2011, 01:12
Default
  #17
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: South Bend, IN, USA
Posts: 688
Blog Entries: 9
Rep Power: 12
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Gotcha, Thanks, I am testing..

The problem is , which I am not sure about:
I see in our high performance center, there are different nodes, it seems not all the compute nodes are using Infiniband. Some computing nodes are quite old. I am wondering if it is possible to apply the nsf computing node at Illinois.


And also, concerning the disk activity, I am not clear. My job are submitted via SGE management system, I do not have the right to access the computing node, which means ssh computing.node.XXX.edu, doesn't work. So I am wondering, when you are using job-management system like SGE, how did you set the "root" directories? So to let the data distributed??


See my PLIBS now,
Code:
[wei@opteron]$ echo $PLIBS
-pthread -L/afs/crc.edu/x86_64_linux/openmpi/1.3.2/gnu/lib -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl
Now, after re-setting the PFLAGS, PINC, and PLIBS, I then just wclean and recompile the src/Pstreams lib, So is this okay?


Thanks,
__________________
~
Daniel WEI
-------------
NatHaz Modeling Laboratory
Department of Civil & Environmental Engineering & Earth Sciences
University of Notre Dame, USA
Email || My Personal CFD Blog
lakeat is offline   Reply With Quote

Old   January 18, 2011, 02:01
Default
  #18
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: South Bend, IN, USA
Posts: 688
Blog Entries: 9
Rep Power: 12
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Just noticed that they were using

Six-Core Intel X5670 @ 2.93 GHz CPUs

Memory: 24GB per node

OS: CentOS5U4, OFED 1.5.1 InfiniBand SW stack



While Im kind of frustrated, for mine is
32 HP DL160 G6 servers
Dual Quad-Core, 2.27 GHz L5520 Intel Nehalem nodes (8 cores per node, 256 total cores), 12 GB RAM each
or
393 HP DL165 G6 servers
Dual Six-Core 2.4 GHz AMD Opteron Model 2431 64/32 bit (12 cores per node, 4716 total cores), 12 GB RAM, 1 x 160 GB SATA Disk

So yesterday..
__________________
~
Daniel WEI
-------------
NatHaz Modeling Laboratory
Department of Civil & Environmental Engineering & Earth Sciences
University of Notre Dame, USA
Email || My Personal CFD Blog
lakeat is offline   Reply With Quote

Old   March 4, 2011, 15:21
Default
  #19
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: South Bend, IN, USA
Posts: 688
Blog Entries: 9
Rep Power: 12
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Hello all,


Some findings and updates, hope this can help those non-professionals like me.

Infiniband support is critical for speed, to run a mid size case that need many nodes, it is strongly adviced to build the code against infiniband libs.


But I also got another question,
1. Usually how many grid points you guys allocate for each cpu?
2. I am still not clear how you make Hierarchical a better option than metis. Are you aware of any general rules, or do you have any experience that it is super better than metis. If not, im gonna stay with metis.

Thanks
__________________
~
Daniel WEI
-------------
NatHaz Modeling Laboratory
Department of Civil & Environmental Engineering & Earth Sciences
University of Notre Dame, USA
Email || My Personal CFD Blog
lakeat is offline   Reply With Quote

Old   March 9, 2011, 04:35
Default
  #20
Member
 
Flavio Galeazzo
Join Date: Mar 2009
Location: Karlsruhe, Germany
Posts: 30
Rep Power: 8
flavio_galeazzo is on a distinguished road
I have the same experience as you about Infiniband, Daniel. It is crucial to get good speed up with more than 4-5 machines.

I normally allocate the nodes for simulation aiming for 1 second per time step, which is a good value in the system I work with (similar to your "old" cluster). The number of grid points per node depends largely on the complexity of the solver. I can allocate more grid points with a less complex solver, say an incompressible LES, than with a complex reacting flow solver.
flavio_galeazzo is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Superlinear speedup in OpenFOAM 13 msrinath80 OpenFOAM Running, Solving & CFD 18 March 3, 2015 06:36
Postprocessing large data sets in parallel evrikon OpenFOAM Post-Processing 27 June 22, 2011 03:34
Parelleling Efficiency kassiotis OpenFOAM 0 June 19, 2009 14:12
Parallel efficiency channel flow maka OpenFOAM Running, Solving & CFD 1 December 8, 2005 13:58
Post-processing of a large transient case Flav CD-adapco 2 September 28, 2004 06:19


All times are GMT -4. The time now is 21:16.