CFD Online Discussion Forums - Problem with parallelization on cluster

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)

- - Problem with parallelization on cluster (https://www.cfd-online.com/Forums/openfoam-solving/157643-problem-parallelization-cluster.html)

GiuMan

August 4, 2015 05:00

Problem with parallelization on cluster

Hi guys,

I have a serious problem with my parallel run: I have a cluster with 5 blades and 12 CPU/blade.
When I use only 1 blade (12 CPU) simpleFOAM solve 1 iteration in about 30second, using 2 blade (24 CPU) it takes 10 seconds but,when I use 3 o 4 blades it takes 60-90 second for 1 single iteration.

How is it possible? Does anyone find the same problem?

Thank's

alexeym

August 4, 2015 16:41

Hi,

I guess everybody will need more details like...

1. Size of the problem (i.e. number of cell in the mesh).
2. Type of interconnect in your cluster.

Since problem appears then number of subdomains goes above 36, maybe you are loosing time waiting for processes to exchange data.

GiuMan

August 5, 2015 05:01

Thanks for answer.

The problem is a test case that I've built with about 12M cells.
Our cluster doesn't have Infiniband for connection but some test with other application (CFX, Radioss, and other) doesn't give this problems during parallel runs.

alexeym

August 5, 2015 05:13

Hi,

Well, at this point there will be even more technical questions:

0. Is this solution process slow down reproducible?
1. What solver do you utilize?
2. What decomposition method do you utilize?
3. What linear solver do you utilize?
4. Does convergence of the linear solvers depends on the number of blades used?

GiuMan

August 5, 2015 05:23

2 Attachment(s)

Thak's for your time,
about your questions:

0 - I've tested the cluster with several test case and every time, using 3 or more blades I have the same problem
1 - simpleFOAM
2 - Hierarchical with differents coeffs, for example if i use 48 CPU I've used 4/4/3 or 48/1/1 obtaining different duration for calculations
3 - Look at attached files
4 - I've still don't have info about convergenge because i'm testing using 30-40 step, just to understand the calculation time

alexeym

August 5, 2015 05:33

Hi,

2. Could you visualize decomposition? Also try to use scotch decomposition instead is hierarchical.

3. Try to change linear solver for pressure to PCG (there was reports about GAMG poor performance in parallel regime). I.e. in fvSolution instead of

Code:

    p

    {

        solver           GAMG;

        tolerance        1e-7;

        relTol           0.01;

        smoother         GaussSeidel;

        nPreSweeps       0;

        nPostSweeps      2;

        cacheAgglomeration on;

        agglomerator     faceAreaPair;

        nCellsInCoarsestLevel 10;

        mergeLevels      1;

    }

put

Code:

    p

    {

        solver           PCG;

        preconditioner DIC;

        tolerance        1e-7;

        relTol           0.01;

    }

4. In fact I am talking about convergence of linear solvers not PDE solver ;) In your log file you have something like:

Code:

GAMG: Solving for p, Initial residual = 0.330664, Final residual = 0.0304151, No Iterations 4

Does the number of iterations (for the same time steps) depend on number of blades used?

GiuMan

August 5, 2015 06:48

1 Attachment(s)

Hi,

2 - I enclose one decomposition output. About scotch decomposition do you have any suggestions?
3 - Ok, now I will try this solver
4 - The number of time step is equals using 1 or 2 blades, it changes using 3 blades

alexeym

August 5, 2015 08:26

Hi,

Decomposition is for the case of two blades, yet strange behavior begins with 3 blades. I have got nothing particular to suggest about Scotch decomposition, just set number of subdomains ;)

GAMG behavior/performance also depends on the number of cells in the subdomain and nCellsInCoarsestLevel setting.

For further diagnostics you can attach solver (simpleFoam) output during the first 10 time steps with decomposition into 24 subdomains (2 blades) and 36 subdomains (3 blades).

GiuMan

August 5, 2015 08:28

Perfect,

now I prepare this 2 test with 10 step and I will attech them to this post!

GiuMan

August 5, 2015 09:14

4 Attachment(s)

So,

I've done several tests. I've attached the result here (simpleFOAM and decomposePar log).
I've used new p solutor and scotch decomposition.

Now it looks that OF scales good:
- 12 CPU 1 step about 50s
- 24 CPU 1 step about 30s
- 48 CPU 1 step about 15s

But if i compare the 12 and 48 CPU new result with the old ones they are slower (I've atteched also the results):
12 CPU
-Now: 50s
-Before: 30s

12 CPU
-Now: 30s
-Before: 12s

But maybe is the geometry that is not big enough to scale in a right way?If I use a bigger geometry (our real geo hava abou 30-50M) i will have best results?

Or maybe the lack of infiniband make this possible?

Thank's

alexeym

August 5, 2015 11:22

Well,

This time you spend more time solving pressure equation:

Old test:

Code:

GAMG:  Solving for p, Initial residual = 0.450677, Final residual = 0.00449642,

No Iterations 10

New test:

Code:

DICPCG:  Solving for p, Initial residual = 0.246676, Final residual = 0.00245312

, No Iterations 245

That is why overall simulation is slower.

Could you run test with GAMG linear solver (i.e. you take fvSolution from old test) + Scotch decomposition on 36/48 cores and post log-files? Just to see what is happening with GAMG after you go from 24 to higher number of subdomains.

In general PCG linear solver requires more iterations to converge than GAMG. So increasing number of cells, well, it will help to creating feeling that we are scaling better, as time spent in calculation will increase as you increase number of cells and keep number of subdomains constant.

I would choose hierarchical decomposition method as guilty here. With the method too many processor boundary faces were created and since data exchange between processors is quite expensive (it is Ethernet), overall performance was poor. I have compared output from decomposePar and Scotch method seems to create much less processor boundary faces.

GiuMan

August 6, 2015 05:08

Morning,

I've done the test with 36/48 cpu using GAMG and now the computauto seems to be very fast and it scales very good.
Using 48 CPU it takes 72s for 10 step.

Now the next step is to evaluete convergence time: so i'm thinking to do 2 different test with 5000 step, using the 2 differents pressure solver, to evaluate the minimun numebr of step to obtain a good convergence. Do you think is a good way to proceed?

Thank's

alexeym

August 14, 2015 05:11

Hi,

Your plan is reasonable, yet I would suggest:

1. Use convergence criterion instead of fixed number of iterations (see residualControl section in fvSolution dictionary in airFoil2D tutorial case for example).

2. Use setups closer to the real problems, which will be solved in the future. Obviously if you spend more time in calculations (i.e. instead of just momentum equation, you also solve temperature/mass transfer/reaction kinetics) you scale better and better ;)

All times are GMT -4. The time now is 19:57.