
[Sponsors] 
August 4, 2015, 05:00 
Problem with parallelization on cluster

#1 
New Member
Join Date: Sep 2014
Posts: 10
Rep Power: 3 
Hi guys,
I have a serious problem with my parallel run: I have a cluster with 5 blades and 12 CPU/blade. When I use only 1 blade (12 CPU) simpleFOAM solve 1 iteration in about 30second, using 2 blade (24 CPU) it takes 10 seconds but,when I use 3 o 4 blades it takes 6090 second for 1 single iteration. How is it possible? Does anyone find the same problem? Thank's 

August 4, 2015, 16:41 

#2 
Senior Member
Alexey Matveichev
Join Date: Aug 2011
Location: Nancy, France
Posts: 1,237
Rep Power: 21 
Hi,
I guess everybody will need more details like... 1. Size of the problem (i.e. number of cell in the mesh). 2. Type of interconnect in your cluster. Since problem appears then number of subdomains goes above 36, maybe you are loosing time waiting for processes to exchange data. 

August 5, 2015, 05:01 

#3 
New Member
Join Date: Sep 2014
Posts: 10
Rep Power: 3 
Thanks for answer.
The problem is a test case that I've built with about 12M cells. Our cluster doesn't have Infiniband for connection but some test with other application (CFX, Radioss, and other) doesn't give this problems during parallel runs. 

August 5, 2015, 05:13 

#4 
Senior Member
Alexey Matveichev
Join Date: Aug 2011
Location: Nancy, France
Posts: 1,237
Rep Power: 21 
Hi,
Well, at this point there will be even more technical questions: 0. Is this solution process slow down reproducible? 1. What solver do you utilize? 2. What decomposition method do you utilize? 3. What linear solver do you utilize? 4. Does convergence of the linear solvers depends on the number of blades used? 

August 5, 2015, 05:23 

#5 
New Member
Join Date: Sep 2014
Posts: 10
Rep Power: 3 
Thak's for your time,
about your questions: 0  I've tested the cluster with several test case and every time, using 3 or more blades I have the same problem 1  simpleFOAM 2  Hierarchical with differents coeffs, for example if i use 48 CPU I've used 4/4/3 or 48/1/1 obtaining different duration for calculations 3  Look at attached files 4  I've still don't have info about convergenge because i'm testing using 3040 step, just to understand the calculation time 

August 5, 2015, 05:33 

#6 
Senior Member
Alexey Matveichev
Join Date: Aug 2011
Location: Nancy, France
Posts: 1,237
Rep Power: 21 
Hi,
2. Could you visualize decomposition? Also try to use scotch decomposition instead is hierarchical. 3. Try to change linear solver for pressure to PCG (there was reports about GAMG poor performance in parallel regime). I.e. in fvSolution instead of Code:
p { solver GAMG; tolerance 1e7; relTol 0.01; smoother GaussSeidel; nPreSweeps 0; nPostSweeps 2; cacheAgglomeration on; agglomerator faceAreaPair; nCellsInCoarsestLevel 10; mergeLevels 1; } Code:
p { solver PCG; preconditioner DIC; tolerance 1e7; relTol 0.01; } Code:
GAMG: Solving for p, Initial residual = 0.330664, Final residual = 0.0304151, No Iterations 4 

August 5, 2015, 06:48 

#7 
New Member
Join Date: Sep 2014
Posts: 10
Rep Power: 3 
Hi,
2  I enclose one decomposition output. About scotch decomposition do you have any suggestions? 3  Ok, now I will try this solver 4  The number of time step is equals using 1 or 2 blades, it changes using 3 blades 

August 5, 2015, 08:26 

#8 
Senior Member
Alexey Matveichev
Join Date: Aug 2011
Location: Nancy, France
Posts: 1,237
Rep Power: 21 
Hi,
Decomposition is for the case of two blades, yet strange behavior begins with 3 blades. I have got nothing particular to suggest about Scotch decomposition, just set number of subdomains GAMG behavior/performance also depends on the number of cells in the subdomain and nCellsInCoarsestLevel setting. For further diagnostics you can attach solver (simpleFoam) output during the first 10 time steps with decomposition into 24 subdomains (2 blades) and 36 subdomains (3 blades). 

August 5, 2015, 08:28 

#9 
New Member
Join Date: Sep 2014
Posts: 10
Rep Power: 3 
Perfect,
now I prepare this 2 test with 10 step and I will attech them to this post! 

August 5, 2015, 09:14 

#10 
New Member
Join Date: Sep 2014
Posts: 10
Rep Power: 3 
So,
I've done several tests. I've attached the result here (simpleFOAM and decomposePar log). I've used new p solutor and scotch decomposition. Now it looks that OF scales good:  12 CPU 1 step about 50s  24 CPU 1 step about 30s  48 CPU 1 step about 15s But if i compare the 12 and 48 CPU new result with the old ones they are slower (I've atteched also the results): 12 CPU Now: 50s Before: 30s 12 CPU Now: 30s Before: 12s But maybe is the geometry that is not big enough to scale in a right way?If I use a bigger geometry (our real geo hava abou 3050M) i will have best results? Or maybe the lack of infiniband make this possible? Thank's 

August 5, 2015, 11:22 

#11 
Senior Member
Alexey Matveichev
Join Date: Aug 2011
Location: Nancy, France
Posts: 1,237
Rep Power: 21 
Well,
This time you spend more time solving pressure equation: Old test: Code:
GAMG: Solving for p, Initial residual = 0.450677, Final residual = 0.00449642, No Iterations 10 Code:
DICPCG: Solving for p, Initial residual = 0.246676, Final residual = 0.00245312 , No Iterations 245 Could you run test with GAMG linear solver (i.e. you take fvSolution from old test) + Scotch decomposition on 36/48 cores and post logfiles? Just to see what is happening with GAMG after you go from 24 to higher number of subdomains. In general PCG linear solver requires more iterations to converge than GAMG. So increasing number of cells, well, it will help to creating feeling that we are scaling better, as time spent in calculation will increase as you increase number of cells and keep number of subdomains constant. I would choose hierarchical decomposition method as guilty here. With the method too many processor boundary faces were created and since data exchange between processors is quite expensive (it is Ethernet), overall performance was poor. I have compared output from decomposePar and Scotch method seems to create much less processor boundary faces. 

August 6, 2015, 05:08 

#12 
New Member
Join Date: Sep 2014
Posts: 10
Rep Power: 3 
Morning,
I've done the test with 36/48 cpu using GAMG and now the computauto seems to be very fast and it scales very good. Using 48 CPU it takes 72s for 10 step. Now the next step is to evaluete convergence time: so i'm thinking to do 2 different test with 5000 step, using the 2 differents pressure solver, to evaluate the minimun numebr of step to obtain a good convergence. Do you think is a good way to proceed? Thank's 

August 14, 2015, 05:11 

#13 
Senior Member
Alexey Matveichev
Join Date: Aug 2011
Location: Nancy, France
Posts: 1,237
Rep Power: 21 
Hi,
Your plan is reasonable, yet I would suggest: 1. Use convergence criterion instead of fixed number of iterations (see residualControl section in fvSolution dictionary in airFoil2D tutorial case for example). 2. Use setups closer to the real problems, which will be solved in the future. Obviously if you spend more time in calculations (i.e. instead of just momentum equation, you also solve temperature/mass transfer/reaction kinetics) you scale better and better 

Tags 
clusters, decomposepar, openfoam 2.2.x, parallel computing 
Thread Tools  
Display Modes  


Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Problem running on Cluster  Sway  OpenFOAM Running, Solving & CFD  1  August 5, 2015 04:01 
Improper data to cluster through .cas and .dat files  kaeran  FLUENT  0  October 24, 2014 04:10 
Cluster Parallelization Performance  minger  OpenFOAM Running, Solving & CFD  1  November 21, 2013 18:45 
area does not match neighbour by ... %  possible face ordering problem  St.Pacholak  OpenFOAM  9  November 22, 2011 11:02 
Adiabatic and Rotating wall (Convection problem)  ParodDav  CFX  5  April 29, 2007 19:13 