
[Sponsors] 
October 19, 2011, 10:28 
poor performance at massive parallel run using SGI cluster

#1 
Member
Matthias Walter
Join Date: Mar 2009
Location: Rostock, Germany
Posts: 63
Rep Power: 9 
Hi all,
I would like to address this issue to the forum since I have no more idea where I can optimize my setup. Some times ago I run a case with 190Mio cells on a SGI ICE cluster using 2048 cores. For parallelization I used MPI and MPT, which showed similar performance. An iteration step took approximately 10s (pisoFoam+advectiondiffusion equation). So far so good! Now I would like to run a 350Mio cell case on the same SGI ICE cluster using 2720 cores. The mesh is decomposed by scotch, each processor has approximately 125000 cells. The setup is equal to the previous case. But now an iteration step takes 120s. I already used different MPT optimization flags without success. Most time is spent for pressure correction (GAMG) although only few iterations (max 7) are needed. I used 3 nCorrector steps and 2 nonOrthogonalCorrector step (as before). The new mesh was created using refineMesh and it shows no errors using checkMesh. If I limit the max iterations to two in the GAMG settings an iteration step takes approx. 30s but there must be a better solution. Code:
Time = 0.0009171 Courant Number mean: 0.0001917142724 max: 0.287876642 DILUPBiCG: Solving for Ux, Initial residual = 3.052289654e06, Final residual = 2.738676051e10, No Iterations 1 DILUPBiCG: Solving for Uy, Initial residual = 5.981403037e05, Final residual = 6.161745588e09, No Iterations 1 DILUPBiCG: Solving for Uz, Initial residual = 6.441312049e05, Final residual = 7.118668359e09, No Iterations 1 GAMG: Solving for p, Initial residual = 0.001449305359, Final residual = 9.574983996e05, No Iterations 2 GAMG: Solving for p, Initial residual = 0.0001347876892, Final residual = 2.672140607e05, No Iterations 2 GAMG: Solving for p, Initial residual = 3.610848126e05, Final residual = 7.369521828e06, No Iterations 2 time step continuity errors : sum local = 5.649504504e14, global = 4.718524356e16, cumulative = 5.771567132e15 GAMG: Solving for p, Initial residual = 0.0005869720633, Final residual = 4.412817839e05, No Iterations 2 GAMG: Solving for p, Initial residual = 6.206812369e05, Final residual = 1.367993617e05, No Iterations 2 GAMG: Solving for p, Initial residual = 1.819577585e05, Final residual = 5.044909251e06, No Iterations 2 time step continuity errors : sum local = 3.866414897e14, global = 5.610493932e18, cumulative = 5.777177626e15 GAMG: Solving for p, Initial residual = 4.005752672e05, Final residual = 7.733093149e06, No Iterations 2 GAMG: Solving for p, Initial residual = 1.005946488e05, Final residual = 4.57022811e06, No Iterations 2 GAMG: Solving for p, Initial residual = 5.315044337e06, Final residual = 3.059376256e06, No Iterations 2 time step continuity errors : sum local = 2.344610732e14, global = 8.601005524e18, cumulative = 5.785778632e15 DILUPBiCG: Solving for F, Initial residual = 1.423119179e06, Final residual = 1.152704014e10, No Iterations 1 DILUPBiCG: Solving for LLMM, Initial residual = 0.0001600554571, Final residual = 1.266482924e08, No Iterations 1 DILUPBiCG: Solving for MMMM, Initial residual = 6.3441275e05, Final residual = 2.470064085e09, No Iterations 1 DILUPBiCG: Solving for NNMM, Initial residual = 0.0001706087254, Final residual = 1.25219109e08, No Iterations 1 DILUPBiCG: Solving for LFMFF_LDMMS, Initial residual = 5.309817933e05, Final residual = 1.139502859e08, No Iterations 1 DILUPBiCG: Solving for MFMFF_LDMMS, Initial residual = 5.665481015e05, Final residual = 3.567372018e08, No Iterations 1 DILUPBiCG: Solving for NFMFF_LDMMS, Initial residual = 5.804721854e05, Final residual = 9.66232654e09, No Iterations 1 ExecutionTime = 765.53 s ClockTime = 814 s Time = 0.000918 Courant Number mean: 0.0001917165609 max: 0.2878756917 DILUPBiCG: Solving for Ux, Initial residual = 3.052300658e06, Final residual = 2.738577117e10, No Iterations 1 DILUPBiCG: Solving for Uy, Initial residual = 5.97459875e05, Final residual = 6.158196307e09, No Iterations 1 DILUPBiCG: Solving for Uz, Initial residual = 6.448140805e05, Final residual = 7.136262812e09, No Iterations 1 GAMG: Solving for p, Initial residual = 0.001456275326, Final residual = 9.601357664e05, No Iterations 2 GAMG: Solving for p, Initial residual = 0.0001362132073, Final residual = 2.654885927e05, No Iterations 2 GAMG: Solving for p, Initial residual = 3.614924123e05, Final residual = 7.118986273e06, No Iterations 2 time step continuity errors : sum local = 5.455807911e14, global = 2.641470572e16, cumulative = 5.521631575e15 GAMG: Solving for p, Initial residual = 0.0005872826127, Final residual = 4.418221094e05, No Iterations 2 GAMG: Solving for p, Initial residual = 6.214854047e05, Final residual = 1.378557672e05, No Iterations 2 GAMG: Solving for p, Initial residual = 1.834875281e05, Final residual = 5.077590028e06, No Iterations 2 time step continuity errors : sum local = 3.890282517e14, global = 1.330261159e16, cumulative = 5.388605459e15 GAMG: Solving for p, Initial residual = 4.023426514e05, Final residual = 7.565715719e06, No Iterations 2 GAMG: Solving for p, Initial residual = 9.852083357e06, Final residual = 4.413431217e06, No Iterations 2 GAMG: Solving for p, Initial residual = 5.164066985e06, Final residual = 3.023006277e06, No Iterations 2 time step continuity errors : sum local = 2.316040413e14, global = 6.903163248e17, cumulative = 5.319573826e15 DILUPBiCG: Solving for F, Initial residual = 1.423187154e06, Final residual = 1.152845032e10, No Iterations 1 DILUPBiCG: Solving for LLMM, Initial residual = 0.0001611766838, Final residual = 1.291849175e08, No Iterations 1 DILUPBiCG: Solving for MMMM, Initial residual = 6.407629502e05, Final residual = 2.486206485e09, No Iterations 1 DILUPBiCG: Solving for NNMM, Initial residual = 0.0001709372216, Final residual = 1.2537942e08, No Iterations 1 DILUPBiCG: Solving for LFMFF_LDMMS, Initial residual = 5.310868952e05, Final residual = 1.139492543e08, No Iterations 1 DILUPBiCG: Solving for MFMFF_LDMMS, Initial residual = 5.666380492e05, Final residual = 3.561810739e08, No Iterations 1 DILUPBiCG: Solving for NFMFF_LDMMS, Initial residual = 5.80577229e05, Final residual = 9.6605096e09, No Iterations 1 ExecutionTime = 801.32 s ClockTime = 850 s In my opinion 30000 cells more per proc cannot increase the iteration time from ~10s to ~120s. Are 125000 cells to much for one core? BTW, the CPU is a Nehalem EP, X5570 running at 2.93 GHz. Best regards Matthias 

October 19, 2011, 21:58 

#2 
Member
Fábio César Canesin
Join Date: Mar 2010
Location: Florianópolis
Posts: 67
Rep Power: 8 
Matthias, try using PCG for the pressure, or increasing the number of cells in the coarse level of GAMG.. GAMG need to much communication .. that's what is increasing the cost of simulation..
p { solver PCG; preconditioner DIC; tolerance 1e07; relTol 1e02; } pFinal { solver PCG; preconditioner DIC; tolerance 1e08; relTol 0; } 

October 20, 2011, 10:17 

#3 
Senior Member
Kyle Mooney
Join Date: Jul 2009
Location: Amherst, MA USA  San Diego, CA USA
Posts: 320
Rep Power: 10 
I usually aim for 50k cells per processor but I haven't done too much quantification of the speeds.
I tweaked the GAMG settings for a parallel run and had pretty good speeds with this configuration (although I can't remember my original settings ): p GAMG { agglomerator faceAreaPair; nCellsInCoarsestLevel 30; cacheAgglomeration true; directSolveCoarsest false; nPreSweeps 1; nPostSweeps 2; nFinestSweeps 2; tolerance 1e07; relTol 0.0; smoother GaussSeidel; mergeLevels 2; minIter 0; maxIter 10; }; Here is another thread (although a little old at this point) where users discussed speedup with OF's multi grid solvers. They link to some interesting presentations as well. http://www.cfdonline.com/Forums/ope...wardstep.html 

October 20, 2011, 13:31 

#4 
Senior Member
Francois
Join Date: Jun 2010
Posts: 107
Rep Power: 9 
I don't know if the PCG solver with DIC preconditioner will be faster, although I don't have experience with domains this large..
What you could also try is using the GAMG solver as preconditioner. Something like: Code:
p { solver PCG; preconditioner { preconditioner GAMG; tolerance 1e10; relTol 0; smoother DICGaussSeidel; nPreSweeps 0; nPostSweeps 2; nFinestSweeps 2; cacheAgglomeration false; nCellsInCoarsestLevel 10; agglomerator faceAreaPair; mergeLevels 2; } tolerance 1e10; relTol 0;; } Kind regards, Francois. 

October 20, 2011, 13:52 

#5 
Senior Member
Alberto Passalacqua
Join Date: Mar 2009
Location: Ames, Iowa, United States
Posts: 1,910
Rep Power: 27 
PCG methods scale better, even though they require more iterations, so the suggestion is correct.
The number of cells you are using (~120k/processor) should be fine too.
__________________
Alberto Passalacqua GeekoCFD  A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as in both physical and virtual formats. OpenQBMM  An opensource implementation of quadraturebased moment methods. To obtain more accurate answers, please specify the version of OpenFOAM you are using. 

October 21, 2011, 06:35 

#6 
Member
Matthias Walter
Join Date: Mar 2009
Location: Rostock, Germany
Posts: 63
Rep Power: 9 
I tested all settings posted here but no one seems to be the holy grail.
@Canesin, Alberto: using only PCG with DIC seems to work best. An iteration takes now ~ 60s. @kmooney: an iteration needs about 70s. @Fransje: using this setup an iteration takes about 200s, much more than using only pcg or gamg Next, I will test what happens if I use 3600 Cores (~98K cells per proc). Maybe I will get an improvement in the time for one iteration or the communication eats up all time savings. 

October 21, 2011, 07:00 

#7 
Member
Fábio César Canesin
Join Date: Mar 2010
Location: Florianópolis
Posts: 67
Rep Power: 8 
Add in the controDict the option:
commsType nomBlocking Try modifying the PCG relative tolerance to 0.1 instead of 1e02. But the most important thing is to look at the cluster infraestruture, how many cores is in one hack ? How is the rack to rack communication ?? Because maybe you need to create a local copy of the case in each node... Imagine you are sharing an folder in the head node of one rack and them an other rack three levels above in the hierarchic needs to download the mesh, it will need > time to transfer the fiels and mesh + 3 routing times. There is an option in decomposeDict that makes possible saving the data locally. 

October 21, 2011, 07:42 

#8 
Member
Matthias Walter
Join Date: Mar 2009
Location: Rostock, Germany
Posts: 63
Rep Power: 9 
I will try using rel tolerance 0.1 but increasing the rel. tolerance impairs also the numerical solution?
The cluster has a parallel storage system connected by IB with two rails. So the data transport should be no problem. BTW, hopping from IB switch to IB switch takes only some nano seconds. 

October 21, 2011, 08:24 

#9 
Senior Member
Anton Kidess
Join Date: May 2009
Location: Delft, Netherlands
Posts: 1,182
Rep Power: 21 
I think you're wasting a lot of time by using nCellsInCoarsestLevel 10 or 30. At this point you'll be using more time on interpolation and restriction than gaining thanks to the coarser mesh. Increasing nCellsInCoarsestLevel to 1000 should improve the GAMG performance. Also, the more levels you have, the more you will have to communicate.
As Alberto already states though PCG requires less communication, so you might never be able to achieve better performance with GAMG with lots of processor boundaries. 

Thread Tools  
Display Modes  


Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
new Solver won't run parallel  Chris Lucas  OpenFOAM  4  January 10, 2012 11:30 
Fluent can't run in parallel when hyper threading turning on.  field  FLUENT  0  May 5, 2011 07:41 
batch mode  parallel run  turbotel  CFX  2  March 29, 2011 16:53 
Minimum number of nodes to run CFX in parallel  Rui  CFX  3  April 11, 2005 20:46 
How to run parallel in ICEM_CFD?  Kiddo  Main CFD Forum  2  January 24, 2005 09:53 