poor performance at massive parallel run using SGI cluster
I would like to address this issue to the forum since I have no more idea where I can optimize my setup.
Some times ago I run a case with 190Mio cells on a SGI ICE cluster using 2048 cores. For parallelization I used MPI and MPT, which showed similar performance. An iteration step took approximately 10s (pisoFoam+advection-diffusion equation).
So far so good!
Now I would like to run a 350Mio cell case on the same SGI ICE cluster using 2720 cores. The mesh is decomposed by scotch, each processor has approximately 125000 cells. The setup is equal to the previous case. But now an iteration step takes 120s.
I already used different MPT optimization flags without success.
Most time is spent for pressure correction (GAMG) although only few iterations (max 7) are needed. I used 3 nCorrector steps and 2 nonOrthogonalCorrector step (as before). The new mesh was created using refineMesh and it shows no errors using checkMesh.
If I limit the max iterations to two in the GAMG settings an iteration step takes approx. 30s but there must be a better solution.
In my opinion 30000 cells more per proc cannot increase the iteration time from ~10s to ~120s.
Are 125000 cells to much for one core? BTW, the CPU is a Nehalem EP, X5570 running at 2.93 GHz.
Matthias, try using PCG for the pressure, or increasing the number of cells in the coarse level of GAMG.. GAMG need to much communication .. that's what is increasing the cost of simulation..
I usually aim for 50k cells per processor but I haven't done too much quantification of the speeds.
I tweaked the GAMG settings for a parallel run and had pretty good speeds with this configuration (although I can't remember my original settings :)):
Here is another thread (although a little old at this point) where users discussed speedup with OF's multi grid solvers. They link to some interesting presentations as well.
I don't know if the PCG solver with DIC preconditioner will be faster, although I don't have experience with domains this large..
What you could also try is using the GAMG solver as preconditioner. Something like:
PCG methods scale better, even though they require more iterations, so the suggestion is correct.
The number of cells you are using (~120k/processor) should be fine too.
I tested all settings posted here but no one seems to be the holy grail.
@Canesin, Alberto: using only PCG with DIC seems to work best. An iteration takes now ~ 60s.
@kmooney: an iteration needs about 70s.
@Fransje: using this setup an iteration takes about 200s, much more than using only pcg or gamg
Next, I will test what happens if I use 3600 Cores (~98K cells per proc). Maybe I will get an improvement in the time for one iteration or the communication eats up all time savings.
Add in the controDict the option:
Try modifying the PCG relative tolerance to 0.1 instead of 1e-02.
But the most important thing is to look at the cluster infraestruture, how many cores is in one hack ? How is the rack to rack communication ?? Because maybe you need to create a local copy of the case in each node... Imagine you are sharing an folder in the head node of one rack and them an other rack three levels above in the hierarchic needs to download the mesh, it will need -> time to transfer the fiels and mesh + 3 routing times.
There is an option in decomposeDict that makes possible saving the data locally.
I will try using rel tolerance 0.1 but increasing the rel. tolerance impairs also the numerical solution?
The cluster has a parallel storage system connected by IB with two rails. So the data transport should be no problem.
BTW, hopping from IB switch to IB switch takes only some nano seconds.
I think you're wasting a lot of time by using nCellsInCoarsestLevel 10 or 30. At this point you'll be using more time on interpolation and restriction than gaining thanks to the coarser mesh. Increasing nCellsInCoarsestLevel to 1000 should improve the GAMG performance. Also, the more levels you have, the more you will have to communicate.
As Alberto already states though PCG requires less communication, so you might never be able to achieve better performance with GAMG with lots of processor boundaries.
|All times are GMT -4. The time now is 10:56.|