CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (http://www.cfd-online.com/Forums/openfoam-solving/)
-   -   poor performance at massive parallel run using SGI cluster (http://www.cfd-online.com/Forums/openfoam-solving/93569-poor-performance-massive-parallel-run-using-sgi-cluster.html)

matthias October 19, 2011 10:28

poor performance at massive parallel run using SGI cluster
 
Hi all,

I would like to address this issue to the forum since I have no more idea where I can optimize my setup.

Some times ago I run a case with 190Mio cells on a SGI ICE cluster using 2048 cores. For parallelization I used MPI and MPT, which showed similar performance. An iteration step took approximately 10s (pisoFoam+advection-diffusion equation).
So far so good!

Now I would like to run a 350Mio cell case on the same SGI ICE cluster using 2720 cores. The mesh is decomposed by scotch, each processor has approximately 125000 cells. The setup is equal to the previous case. But now an iteration step takes 120s.

I already used different MPT optimization flags without success.

Most time is spent for pressure correction (GAMG) although only few iterations (max 7) are needed. I used 3 nCorrector steps and 2 nonOrthogonalCorrector step (as before). The new mesh was created using refineMesh and it shows no errors using checkMesh.

If I limit the max iterations to two in the GAMG settings an iteration step takes approx. 30s but there must be a better solution.

Code:

Time = 0.0009171

Courant Number mean: 0.0001917142724 max: 0.287876642
DILUPBiCG:  Solving for Ux, Initial residual = 3.052289654e-06, Final residual = 2.738676051e-10, No Iterations 1
DILUPBiCG:  Solving for Uy, Initial residual = 5.981403037e-05, Final residual = 6.161745588e-09, No Iterations 1
DILUPBiCG:  Solving for Uz, Initial residual = 6.441312049e-05, Final residual = 7.118668359e-09, No Iterations 1
GAMG:  Solving for p, Initial residual = 0.001449305359, Final residual = 9.574983996e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 0.0001347876892, Final residual = 2.672140607e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 3.610848126e-05, Final residual = 7.369521828e-06, No Iterations 2
time step continuity errors : sum local = 5.649504504e-14, global = -4.718524356e-16, cumulative = -5.771567132e-15
GAMG:  Solving for p, Initial residual = 0.0005869720633, Final residual = 4.412817839e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 6.206812369e-05, Final residual = 1.367993617e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 1.819577585e-05, Final residual = 5.044909251e-06, No Iterations 2
time step continuity errors : sum local = 3.866414897e-14, global = -5.610493932e-18, cumulative = -5.777177626e-15
GAMG:  Solving for p, Initial residual = 4.005752672e-05, Final residual = 7.733093149e-06, No Iterations 2
GAMG:  Solving for p, Initial residual = 1.005946488e-05, Final residual = 4.57022811e-06, No Iterations 2
GAMG:  Solving for p, Initial residual = 5.315044337e-06, Final residual = 3.059376256e-06, No Iterations 2
time step continuity errors : sum local = 2.344610732e-14, global = -8.601005524e-18, cumulative = -5.785778632e-15
DILUPBiCG:  Solving for F, Initial residual = 1.423119179e-06, Final residual = 1.152704014e-10, No Iterations 1
DILUPBiCG:  Solving for LLMM, Initial residual = 0.0001600554571, Final residual = 1.266482924e-08, No Iterations 1
DILUPBiCG:  Solving for MMMM, Initial residual = 6.3441275e-05, Final residual = 2.470064085e-09, No Iterations 1
DILUPBiCG:  Solving for NNMM, Initial residual = 0.0001706087254, Final residual = 1.25219109e-08, No Iterations 1
DILUPBiCG:  Solving for LFMFF_LDMMS, Initial residual = 5.309817933e-05, Final residual = 1.139502859e-08, No Iterations 1
DILUPBiCG:  Solving for MFMFF_LDMMS, Initial residual = 5.665481015e-05, Final residual = 3.567372018e-08, No Iterations 1
DILUPBiCG:  Solving for NFMFF_LDMMS, Initial residual = 5.804721854e-05, Final residual = 9.66232654e-09, No Iterations 1
ExecutionTime = 765.53 s  ClockTime = 814 s

Time = 0.000918

Courant Number mean: 0.0001917165609 max: 0.2878756917
DILUPBiCG:  Solving for Ux, Initial residual = 3.052300658e-06, Final residual = 2.738577117e-10, No Iterations 1
DILUPBiCG:  Solving for Uy, Initial residual = 5.97459875e-05, Final residual = 6.158196307e-09, No Iterations 1
DILUPBiCG:  Solving for Uz, Initial residual = 6.448140805e-05, Final residual = 7.136262812e-09, No Iterations 1
GAMG:  Solving for p, Initial residual = 0.001456275326, Final residual = 9.601357664e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 0.0001362132073, Final residual = 2.654885927e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 3.614924123e-05, Final residual = 7.118986273e-06, No Iterations 2
time step continuity errors : sum local = 5.455807911e-14, global = 2.641470572e-16, cumulative = -5.521631575e-15
GAMG:  Solving for p, Initial residual = 0.0005872826127, Final residual = 4.418221094e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 6.214854047e-05, Final residual = 1.378557672e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 1.834875281e-05, Final residual = 5.077590028e-06, No Iterations 2
time step continuity errors : sum local = 3.890282517e-14, global = 1.330261159e-16, cumulative = -5.388605459e-15
GAMG:  Solving for p, Initial residual = 4.023426514e-05, Final residual = 7.565715719e-06, No Iterations 2
GAMG:  Solving for p, Initial residual = 9.852083357e-06, Final residual = 4.413431217e-06, No Iterations 2
GAMG:  Solving for p, Initial residual = 5.164066985e-06, Final residual = 3.023006277e-06, No Iterations 2
time step continuity errors : sum local = 2.316040413e-14, global = 6.903163248e-17, cumulative = -5.319573826e-15
DILUPBiCG:  Solving for F, Initial residual = 1.423187154e-06, Final residual = 1.152845032e-10, No Iterations 1
DILUPBiCG:  Solving for LLMM, Initial residual = 0.0001611766838, Final residual = 1.291849175e-08, No Iterations 1
DILUPBiCG:  Solving for MMMM, Initial residual = 6.407629502e-05, Final residual = 2.486206485e-09, No Iterations 1
DILUPBiCG:  Solving for NNMM, Initial residual = 0.0001709372216, Final residual = 1.2537942e-08, No Iterations 1
DILUPBiCG:  Solving for LFMFF_LDMMS, Initial residual = 5.310868952e-05, Final residual = 1.139492543e-08, No Iterations 1
DILUPBiCG:  Solving for MFMFF_LDMMS, Initial residual = 5.666380492e-05, Final residual = 3.561810739e-08, No Iterations 1
DILUPBiCG:  Solving for NFMFF_LDMMS, Initial residual = 5.80577229e-05, Final residual = 9.6605096e-09, No Iterations 1
ExecutionTime = 801.32 s  ClockTime = 850 s

The boundary conditions are equal to the previous case as well as the initial values.
In my opinion 30000 cells more per proc cannot increase the iteration time from ~10s to ~120s.
Are 125000 cells to much for one core? BTW, the CPU is a Nehalem EP, X5570 running at 2.93 GHz.


Best regards

Matthias

Canesin October 19, 2011 21:58

Matthias, try using PCG for the pressure, or increasing the number of cells in the coarse level of GAMG.. GAMG need to much communication .. that's what is increasing the cost of simulation..


p
{
solver PCG;
preconditioner DIC;
tolerance 1e-07;
relTol 1e-02;
}

pFinal
{
solver PCG;
preconditioner DIC;
tolerance 1e-08;
relTol 0;
}

kmooney October 20, 2011 10:17

I usually aim for 50k cells per processor but I haven't done too much quantification of the speeds.

I tweaked the GAMG settings for a parallel run and had pretty good speeds with this configuration (although I can't remember my original settings :)):

p GAMG
{
agglomerator faceAreaPair;
nCellsInCoarsestLevel 30;
cacheAgglomeration true;
directSolveCoarsest false;
nPreSweeps 1;
nPostSweeps 2;
nFinestSweeps 2;
tolerance 1e-07;
relTol 0.0;
smoother GaussSeidel;
mergeLevels 2;
minIter 0;
maxIter 10;
};


Here is another thread (although a little old at this point) where users discussed speedup with OF's multi grid solvers. They link to some interesting presentations as well.

http://www.cfd-online.com/Forums/ope...ward-step.html

Fransje October 20, 2011 13:31

I don't know if the PCG solver with DIC preconditioner will be faster, although I don't have experience with domains this large..

What you could also try is using the GAMG solver as preconditioner. Something like:

Code:

    p
    {
        solver          PCG;
        preconditioner
        {
            preconditioner    GAMG;
            tolerance            1e-10;
            relTol                  0;
            smoother            DICGaussSeidel;
            nPreSweeps      0;
            nPostSweeps    2;
            nFinestSweeps  2;
            cacheAgglomeration false;
            nCellsInCoarsestLevel 10;
            agglomerator      faceAreaPair;
            mergeLevels        2;
        }
        tolerance      1e-10;
        relTol          0;;
    }

Let us know if one of those settings helps!

Kind regards,

Francois.

alberto October 20, 2011 13:52

PCG methods scale better, even though they require more iterations, so the suggestion is correct.

The number of cells you are using (~120k/processor) should be fine too.

matthias October 21, 2011 06:35

I tested all settings posted here but no one seems to be the holy grail.

@Canesin, Alberto: using only PCG with DIC seems to work best. An iteration takes now ~ 60s.

@kmooney: an iteration needs about 70s.

@Fransje: using this setup an iteration takes about 200s, much more than using only pcg or gamg

Next, I will test what happens if I use 3600 Cores (~98K cells per proc). Maybe I will get an improvement in the time for one iteration or the communication eats up all time savings.

Canesin October 21, 2011 07:00

Add in the controDict the option:

commsType nomBlocking

Try modifying the PCG relative tolerance to 0.1 instead of 1e-02.

But the most important thing is to look at the cluster infraestruture, how many cores is in one hack ? How is the rack to rack communication ?? Because maybe you need to create a local copy of the case in each node... Imagine you are sharing an folder in the head node of one rack and them an other rack three levels above in the hierarchic needs to download the mesh, it will need -> time to transfer the fiels and mesh + 3 routing times.

There is an option in decomposeDict that makes possible saving the data locally.

matthias October 21, 2011 07:42

I will try using rel tolerance 0.1 but increasing the rel. tolerance impairs also the numerical solution?

The cluster has a parallel storage system connected by IB with two rails. So the data transport should be no problem.

BTW, hopping from IB switch to IB switch takes only some nano seconds.

akidess October 21, 2011 08:24

I think you're wasting a lot of time by using nCellsInCoarsestLevel 10 or 30. At this point you'll be using more time on interpolation and restriction than gaining thanks to the coarser mesh. Increasing nCellsInCoarsestLevel to 1000 should improve the GAMG performance. Also, the more levels you have, the more you will have to communicate.

As Alberto already states though PCG requires less communication, so you might never be able to achieve better performance with GAMG with lots of processor boundaries.


All times are GMT -4. The time now is 06:53.