CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

poor performance at massive parallel run using SGI cluster

Register Blogs Community New Posts Updated Threads Search

Like Tree1Likes
  • 1 Post By kmooney

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   October 19, 2011, 10:28
Default poor performance at massive parallel run using SGI cluster
  #1
Member
 
Matthias Walter
Join Date: Mar 2009
Location: Rostock, Germany
Posts: 63
Rep Power: 17
matthias is on a distinguished road
Hi all,

I would like to address this issue to the forum since I have no more idea where I can optimize my setup.

Some times ago I run a case with 190Mio cells on a SGI ICE cluster using 2048 cores. For parallelization I used MPI and MPT, which showed similar performance. An iteration step took approximately 10s (pisoFoam+advection-diffusion equation).
So far so good!

Now I would like to run a 350Mio cell case on the same SGI ICE cluster using 2720 cores. The mesh is decomposed by scotch, each processor has approximately 125000 cells. The setup is equal to the previous case. But now an iteration step takes 120s.

I already used different MPT optimization flags without success.

Most time is spent for pressure correction (GAMG) although only few iterations (max 7) are needed. I used 3 nCorrector steps and 2 nonOrthogonalCorrector step (as before). The new mesh was created using refineMesh and it shows no errors using checkMesh.

If I limit the max iterations to two in the GAMG settings an iteration step takes approx. 30s but there must be a better solution.

Code:
Time = 0.0009171

Courant Number mean: 0.0001917142724 max: 0.287876642
DILUPBiCG:  Solving for Ux, Initial residual = 3.052289654e-06, Final residual = 2.738676051e-10, No Iterations 1
DILUPBiCG:  Solving for Uy, Initial residual = 5.981403037e-05, Final residual = 6.161745588e-09, No Iterations 1
DILUPBiCG:  Solving for Uz, Initial residual = 6.441312049e-05, Final residual = 7.118668359e-09, No Iterations 1
GAMG:  Solving for p, Initial residual = 0.001449305359, Final residual = 9.574983996e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 0.0001347876892, Final residual = 2.672140607e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 3.610848126e-05, Final residual = 7.369521828e-06, No Iterations 2
time step continuity errors : sum local = 5.649504504e-14, global = -4.718524356e-16, cumulative = -5.771567132e-15
GAMG:  Solving for p, Initial residual = 0.0005869720633, Final residual = 4.412817839e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 6.206812369e-05, Final residual = 1.367993617e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 1.819577585e-05, Final residual = 5.044909251e-06, No Iterations 2
time step continuity errors : sum local = 3.866414897e-14, global = -5.610493932e-18, cumulative = -5.777177626e-15
GAMG:  Solving for p, Initial residual = 4.005752672e-05, Final residual = 7.733093149e-06, No Iterations 2
GAMG:  Solving for p, Initial residual = 1.005946488e-05, Final residual = 4.57022811e-06, No Iterations 2
GAMG:  Solving for p, Initial residual = 5.315044337e-06, Final residual = 3.059376256e-06, No Iterations 2
time step continuity errors : sum local = 2.344610732e-14, global = -8.601005524e-18, cumulative = -5.785778632e-15
DILUPBiCG:  Solving for F, Initial residual = 1.423119179e-06, Final residual = 1.152704014e-10, No Iterations 1
DILUPBiCG:  Solving for LLMM, Initial residual = 0.0001600554571, Final residual = 1.266482924e-08, No Iterations 1
DILUPBiCG:  Solving for MMMM, Initial residual = 6.3441275e-05, Final residual = 2.470064085e-09, No Iterations 1
DILUPBiCG:  Solving for NNMM, Initial residual = 0.0001706087254, Final residual = 1.25219109e-08, No Iterations 1
DILUPBiCG:  Solving for LFMFF_LDMMS, Initial residual = 5.309817933e-05, Final residual = 1.139502859e-08, No Iterations 1
DILUPBiCG:  Solving for MFMFF_LDMMS, Initial residual = 5.665481015e-05, Final residual = 3.567372018e-08, No Iterations 1
DILUPBiCG:  Solving for NFMFF_LDMMS, Initial residual = 5.804721854e-05, Final residual = 9.66232654e-09, No Iterations 1
ExecutionTime = 765.53 s  ClockTime = 814 s

Time = 0.000918

Courant Number mean: 0.0001917165609 max: 0.2878756917
DILUPBiCG:  Solving for Ux, Initial residual = 3.052300658e-06, Final residual = 2.738577117e-10, No Iterations 1
DILUPBiCG:  Solving for Uy, Initial residual = 5.97459875e-05, Final residual = 6.158196307e-09, No Iterations 1
DILUPBiCG:  Solving for Uz, Initial residual = 6.448140805e-05, Final residual = 7.136262812e-09, No Iterations 1
GAMG:  Solving for p, Initial residual = 0.001456275326, Final residual = 9.601357664e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 0.0001362132073, Final residual = 2.654885927e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 3.614924123e-05, Final residual = 7.118986273e-06, No Iterations 2
time step continuity errors : sum local = 5.455807911e-14, global = 2.641470572e-16, cumulative = -5.521631575e-15
GAMG:  Solving for p, Initial residual = 0.0005872826127, Final residual = 4.418221094e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 6.214854047e-05, Final residual = 1.378557672e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 1.834875281e-05, Final residual = 5.077590028e-06, No Iterations 2
time step continuity errors : sum local = 3.890282517e-14, global = 1.330261159e-16, cumulative = -5.388605459e-15
GAMG:  Solving for p, Initial residual = 4.023426514e-05, Final residual = 7.565715719e-06, No Iterations 2
GAMG:  Solving for p, Initial residual = 9.852083357e-06, Final residual = 4.413431217e-06, No Iterations 2
GAMG:  Solving for p, Initial residual = 5.164066985e-06, Final residual = 3.023006277e-06, No Iterations 2
time step continuity errors : sum local = 2.316040413e-14, global = 6.903163248e-17, cumulative = -5.319573826e-15
DILUPBiCG:  Solving for F, Initial residual = 1.423187154e-06, Final residual = 1.152845032e-10, No Iterations 1
DILUPBiCG:  Solving for LLMM, Initial residual = 0.0001611766838, Final residual = 1.291849175e-08, No Iterations 1
DILUPBiCG:  Solving for MMMM, Initial residual = 6.407629502e-05, Final residual = 2.486206485e-09, No Iterations 1
DILUPBiCG:  Solving for NNMM, Initial residual = 0.0001709372216, Final residual = 1.2537942e-08, No Iterations 1
DILUPBiCG:  Solving for LFMFF_LDMMS, Initial residual = 5.310868952e-05, Final residual = 1.139492543e-08, No Iterations 1
DILUPBiCG:  Solving for MFMFF_LDMMS, Initial residual = 5.666380492e-05, Final residual = 3.561810739e-08, No Iterations 1
DILUPBiCG:  Solving for NFMFF_LDMMS, Initial residual = 5.80577229e-05, Final residual = 9.6605096e-09, No Iterations 1
ExecutionTime = 801.32 s  ClockTime = 850 s
The boundary conditions are equal to the previous case as well as the initial values.
In my opinion 30000 cells more per proc cannot increase the iteration time from ~10s to ~120s.
Are 125000 cells to much for one core? BTW, the CPU is a Nehalem EP, X5570 running at 2.93 GHz.


Best regards

Matthias
matthias is offline   Reply With Quote

Old   October 19, 2011, 21:58
Default
  #2
Member
 
Fábio César Canesin
Join Date: Mar 2010
Location: Florianópolis
Posts: 67
Rep Power: 16
Canesin is on a distinguished road
Matthias, try using PCG for the pressure, or increasing the number of cells in the coarse level of GAMG.. GAMG need to much communication .. that's what is increasing the cost of simulation..


p
{
solver PCG;
preconditioner DIC;
tolerance 1e-07;
relTol 1e-02;
}

pFinal
{
solver PCG;
preconditioner DIC;
tolerance 1e-08;
relTol 0;
}
Canesin is offline   Reply With Quote

Old   October 20, 2011, 10:17
Default
  #3
Senior Member
 
kmooney's Avatar
 
Kyle Mooney
Join Date: Jul 2009
Location: San Francisco, CA USA
Posts: 323
Rep Power: 17
kmooney is on a distinguished road
I usually aim for 50k cells per processor but I haven't done too much quantification of the speeds.

I tweaked the GAMG settings for a parallel run and had pretty good speeds with this configuration (although I can't remember my original settings ):

p GAMG
{
agglomerator faceAreaPair;
nCellsInCoarsestLevel 30;
cacheAgglomeration true;
directSolveCoarsest false;
nPreSweeps 1;
nPostSweeps 2;
nFinestSweeps 2;
tolerance 1e-07;
relTol 0.0;
smoother GaussSeidel;
mergeLevels 2;
minIter 0;
maxIter 10;
};


Here is another thread (although a little old at this point) where users discussed speedup with OF's multi grid solvers. They link to some interesting presentations as well.

http://www.cfd-online.com/Forums/ope...ward-step.html
mgg likes this.
kmooney is offline   Reply With Quote

Old   October 20, 2011, 13:31
Default
  #4
Senior Member
 
Francois
Join Date: Jun 2010
Posts: 107
Rep Power: 20
Fransje will become famous soon enough
I don't know if the PCG solver with DIC preconditioner will be faster, although I don't have experience with domains this large..

What you could also try is using the GAMG solver as preconditioner. Something like:

Code:
     p
    {
        solver          PCG;
        preconditioner
        {
            preconditioner     GAMG;
            tolerance            1e-10;
            relTol                  0;
            smoother            DICGaussSeidel;
            nPreSweeps       0;
            nPostSweeps     2;
            nFinestSweeps   2;
            cacheAgglomeration false;
            nCellsInCoarsestLevel 10;
            agglomerator       faceAreaPair;
            mergeLevels        2;
        }
        tolerance       1e-10;
        relTol          0;;
    }
Let us know if one of those settings helps!

Kind regards,

Francois.
Fransje is offline   Reply With Quote

Old   October 20, 2011, 13:52
Default
  #5
Senior Member
 
Alberto Passalacqua
Join Date: Mar 2009
Location: Ames, Iowa, United States
Posts: 1,912
Rep Power: 36
alberto will become famous soon enoughalberto will become famous soon enough
PCG methods scale better, even though they require more iterations, so the suggestion is correct.

The number of cells you are using (~120k/processor) should be fine too.
__________________
Alberto Passalacqua

GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as in both physical and virtual formats (current status: http://albertopassalacqua.com/?p=1541)
OpenQBMM - An open-source implementation of quadrature-based moment methods.

To obtain more accurate answers, please specify the version of OpenFOAM you are using.
alberto is offline   Reply With Quote

Old   October 21, 2011, 06:35
Default
  #6
Member
 
Matthias Walter
Join Date: Mar 2009
Location: Rostock, Germany
Posts: 63
Rep Power: 17
matthias is on a distinguished road
I tested all settings posted here but no one seems to be the holy grail.

@Canesin, Alberto: using only PCG with DIC seems to work best. An iteration takes now ~ 60s.

@kmooney: an iteration needs about 70s.

@Fransje: using this setup an iteration takes about 200s, much more than using only pcg or gamg

Next, I will test what happens if I use 3600 Cores (~98K cells per proc). Maybe I will get an improvement in the time for one iteration or the communication eats up all time savings.
matthias is offline   Reply With Quote

Old   October 21, 2011, 07:00
Default
  #7
Member
 
Fábio César Canesin
Join Date: Mar 2010
Location: Florianópolis
Posts: 67
Rep Power: 16
Canesin is on a distinguished road
Add in the controDict the option:

commsType nomBlocking

Try modifying the PCG relative tolerance to 0.1 instead of 1e-02.

But the most important thing is to look at the cluster infraestruture, how many cores is in one hack ? How is the rack to rack communication ?? Because maybe you need to create a local copy of the case in each node... Imagine you are sharing an folder in the head node of one rack and them an other rack three levels above in the hierarchic needs to download the mesh, it will need -> time to transfer the fiels and mesh + 3 routing times.

There is an option in decomposeDict that makes possible saving the data locally.
Canesin is offline   Reply With Quote

Old   October 21, 2011, 07:42
Default
  #8
Member
 
Matthias Walter
Join Date: Mar 2009
Location: Rostock, Germany
Posts: 63
Rep Power: 17
matthias is on a distinguished road
I will try using rel tolerance 0.1 but increasing the rel. tolerance impairs also the numerical solution?

The cluster has a parallel storage system connected by IB with two rails. So the data transport should be no problem.

BTW, hopping from IB switch to IB switch takes only some nano seconds.
matthias is offline   Reply With Quote

Old   October 21, 2011, 08:24
Default
  #9
Senior Member
 
akidess's Avatar
 
Anton Kidess
Join Date: May 2009
Location: Germany
Posts: 1,377
Rep Power: 29
akidess will become famous soon enough
I think you're wasting a lot of time by using nCellsInCoarsestLevel 10 or 30. At this point you'll be using more time on interpolation and restriction than gaining thanks to the coarser mesh. Increasing nCellsInCoarsestLevel to 1000 should improve the GAMG performance. Also, the more levels you have, the more you will have to communicate.

As Alberto already states though PCG requires less communication, so you might never be able to achieve better performance with GAMG with lots of processor boundaries.
akidess is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
new Solver won't run parallel Chris Lucas OpenFOAM 4 January 10, 2012 10:30
Fluent can't run in parallel when hyper threading turning on. field FLUENT 0 May 5, 2011 07:41
batch mode - parallel run turbotel CFX 2 March 29, 2011 16:53
Minimum number of nodes to run CFX in parallel Rui CFX 3 April 11, 2005 20:46
How to run parallel in ICEM_CFD? Kiddo Main CFD Forum 2 January 24, 2005 08:53


All times are GMT -4. The time now is 19:51.