Performance of GGI case in parallel

hannes · May 13, 2009, 09:17

Hello everyone,

I just tried to run a case using interDyMFoam in parallel. The case consists of a non-moving outer mesh and an inner cylindrical mesh that is rotating (with a surface-piercing propeller in it). I use GGI to connect both meshes. The inner mesh is polyhedral and the outer one hexahedral.
The entire case consists of approx. 1 million cells (most of them in the inner mesh)

I have run this case in parallel on a different number of processors on a SMP machine with 8 quad opteron processors (decompositionMethod metis):

#Proc; time per timestep; speedup
1; 360s; 1
4; 155s; 2.3
8; 146s; 2.4
16; 130s; 2.7

So the speedup doesn't even reach 3. A similar case where the whole domain is rotating and the mesh consists only of polyhedra shows a linear speedup up to 8 processors and a decreasing parallel efficieny beyond that.

I wonder if this has to do with the GGI interface? I tried to stitch it and repeat the test but unfortunately stitchMesh failed. Does anyone have an idea how to improve the parallel efficiency?

Best regards, Hannes

PS: Despite the missing parallel efficiency, the case seems to run fine. Typical output:

Courant Number mean: 0.0005296648 max: 29.60281 velocity magnitude: 56.38879
GGI pair (slider, inside_slider) : 1.694001 1.692101 Diff = 1.71989e-05 or 0.001015283 %
Time = 0.004368
Execution time for mesh.update() = 5.04 s
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 0.05648662 average: 0.001119383
Largest master weighting factor correction: 0.02077904 average: 0.0006102291
--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
time step continuity errors : sum local = 2.139033e-14, global = -1.99808e-16, cumulative = -1.372663e-05
PCG: Solving for pcorr, Initial residual = 1, Final residual = 0.000847008, No Iterations 17
PCG: Solving for pcorr, Initial residual = 0.08500426, Final residual = 0.0003398914, No Iterations 4
time step continuity errors : sum local = 2.889153e-17, global = -1.458812e-19, cumulative = -1.372663e-05
MULES: Solving for gamma
Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1
MULES: Solving for gamma
Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1
MULES: Solving for gamma
Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1
MULES: Solving for gamma
Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1
MULES: Solving for gamma
Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1
MULES: Solving for gamma
Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1
MULES: Solving for gamma
Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1
MULES: Solving for gamma
Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1
smoothSolver: Solving for Ux, Initial residual = 7.74393e-06, Final residual = 1.011977e-07, No Iterations 1
smoothSolver: Solving for Uy, Initial residual = 3.776374e-06, Final residual = 4.505313e-08, No Iterations 1
smoothSolver: Solving for Uz, Initial residual = 1.477087e-05, Final residual = 1.429332e-07, No Iterations 1
GAMG: Solving for pd, Initial residual = 4.955159e-05, Final residual = 9.252356e-07, No Iterations 2
GAMG: Solving for pd, Initial residual = 6.533451e-06, Final residual = 2.441995e-07, No Iterations 2
time step continuity errors : sum local = 1.379978e-12, global = 6.242956e-15, cumulative = -1.372663e-05
GAMG: Solving for pd, Initial residual = 3.37058e-06, Final residual = 1.664205e-07, No Iterations 2
PCG: Solving for pd, Initial residual = 1.65434e-06, Final residual = 4.308411e-09, No Iterations 5
time step continuity errors : sum local = 2.435524e-14, global = 7.917505e-17, cumulative = -1.372663e-05
ExecutionTime = 81834.14 s ClockTime = 81851 s

hannes · May 27, 2009, 08:23

I just updated to SVN revision 1266 because of the performance updates for GGI.
It helped. The above case now scales linearly up to 8 processors. Thanks for this, Hrv.

Best regards, Hannes

yuhai · May 31, 2009, 06:37

my experience also shows that 8 cores gives the best speed.
while it slows down for 16 cores.
I wonder it is because for me, each computer has 8 cores, and the communication between two computers is much inefficient.

bastil · September 9, 2009, 10:28

I have consicered similar problems with the ggi performance in paralllel and I still get no speedup for more than 8 cores. THis is extremely bad since I have large cases typically running on 32 cores with lots of interfaces in. Running them on 8 cores is dam slow for me. Hrv: Are there plans to further improve parallel performance of ggi?

Regards

BastiL

ddigrask · October 28, 2009, 15:05

Hello All,

I am running turbDyMFoam with GGI on a full wind turbine. So, the mesh is huge (~ 4 million cells). I am having problems with running in parallel. I am running on 32 processors. It runs very slowly and eventually one of the processes dies.

The same job runs perfectly fine in serial, but I think this case would take a long time to finish in serial.

I will be very much thankful to you if you shed some light on improving the ggi parallel performance.

Thank you

--
Dnyanesh

mbeaudoin · October 31, 2009, 13:27

Hello Dnyanesh,

I need a bit more information in order to try to help you out.

Which svn version of 1.5-dev are you running? Please make sure you are running with the latest svn release in order to get the best GGI implementation available.
Your problem might be hardware related. Can you provide some speed up numbers you did achieve with your hardware while running OpenFOAM simulations with some none-GGI test cases?
Could you provide a bit more information about your hardware setup? Mostly about the interconnect between the computing nodes, and the memory available on the nodes?
Are you using something like VMware machines for your simulation? I have seen that one before...
I would like to see the following piece of information from your case:
- file constant/polyMesh/boundary
- file system/controlDict
- file system/decomposeParDict
- a log file of your failed parallel run
- the exact command you are using to start your failed parallel run
- Are you using MPI? if so, which flavor, and which version?

Please take note that this is the kind of information that you can provide up-front with your question and that can really help other people to help you out quickly.

Regards,

Martin

Quote:

Originally Posted by ddigrask

Hello All,

I am running turbDyMFoam with GGI on a full wind turbine. So, the mesh is huge (~ 4 million cells). I am having problems with running in parallel. I am running on 32 processors. It runs very slowly and eventually one of the processes dies.

The same job runs perfectly fine in serial, but I think this case would take a long time to finish in serial.

I will be very much thankful to you if you shed some light on improving the ggi parallel performance.

Thank you

--
Dnyanesh

ddigrask · October 31, 2009, 14:56

Hello Mr. Beaudoin,

Thank you for your reply. I realize I should have given all the details before it self. I am sorry for that. Wont happen in future.

1. I was earlier using the latest SVN version. But later, since I was facing the parallel problems, I read more and thought I should follow the ERCOFTAC page. So I reverted back to version 1238. I will upgrade to latest one now.

2. I am quite sure that it is not a hardware related issue. I am running my cases on our college supercomputer cluster. I had successfully run MRFSimpleFoam on 32 cores for quite a long time. I used to get linear speed up till 32 cores.
Some numbers from my case:
150K cells / processor - 12 Proc - 8 sec / time step
110K cells / processor - 24 Proc - 6 sec / time step
about 75K cells / processor - 32 Proc - 2.5 sec / time step

after that it would be almost stable, and later start to increase.

3. The cluster is a 72 node cluster with 8 processors per node. Each node has 4 GB RAM. The interconnect between the nodes is GigaByte Ethernet.

4. I am not using a VMWare machine.

5. The required files are copied below. Answers to some more questions:
a. MPI version is mpich2-1.0.8
b. command used : mpiexec-pbs turbDyMFoam -parallel > outfile
c. boundary file:
FoamFile
{
version 2.0;
format ascii;
class polyBoundaryMesh;
location "constant/polyMesh";
object boundary;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

11
(
outerSliderOutlet
{
type ggi;
nFaces 25814;
startFace 5565326;
shadowPatch innerSliderOutlet;
zone outerSliderOutlet_zone;
bridgeOverlap false;
}
outerSliderWall
{
type ggi;
nFaces 43870;
startFace 5591140;
shadowPatch innerSliderWall;
zone outerSliderWall_zone;
bridgeOverlap false;
}
outerSliderInlet
{
type ggi;
nFaces 25814;
startFace 5635010;
shadowPatch innerSliderInlet;
zone outerSliderInlet_zone;
bridgeOverlap false;
}
innerSliderOutlet
{
type ggi;
nFaces 18596;
startFace 5660824;
shadowPatch outerSliderOutlet;
zone innerSliderOutlet_zone;
bridgeOverlap false;
}
innerSliderInlet
{
type ggi;
nFaces 1148;
startFace 5679420;
shadowPatch outerSliderInlet;
zone innerSliderInlet_zone;
bridgeOverlap false;
}
innerSliderWall
{
type ggi;
nFaces 5424;
startFace 5680568;
shadowPatch outerSliderWall;
zone innerSliderWall_zone;
bridgeOverlap false;
}
tower_plate
{
type wall;
nFaces 13024;
startFace 5685992;
}
rotor
{
type wall;
nFaces 19180;
startFace 5699016;
}
outlet
{
type patch;
nFaces 2052;
startFace 5718196;
}
outer_wall
{
type wall;
nFaces 10390;
startFace 5720248;
}
inlet
{
type patch;
nFaces 2052;
startFace 5730638;
}
)

d. CONTROLDICT:

FoamFile
{
version 2.0;
format ascii;
class dictionary;
object controlDict;
}

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

applicationClass icoTopoFoam;

startFrom startTime;

startTime 0.108562;

stopAt endTime;

endTime 5;

deltaT 0.05;

writeControl timeStep;

writeInterval 20;

cycleWrite 0;

writeFormat ascii;

writePrecision 6;

writeCompression uncompressed;

timeFormat general;

timePrecision 6;

runTimeModifiable yes;

adjustTimeStep yes;
maxCo 1;

maxDeltaT 1.0;

functions
(
ggiCheck
{
// Type of functionObject
type ggiCheck;

phi phi;

// Where to load it from (if not already in solver)
functionObjectLibs ("libsampling.so");
}
);

e. decomposeParDict:

FoamFile
{
version 2.0;
format ascii;

root "";
case "";
instance "";
local "";

class dictionary;
object decomposeParDict;
}

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

numberOfSubdomains 8;

// Patches or Face Zones which need to face both cells on the same CPU
//preservePatches (innerSliderInlet outerSliderInlet innerSliderWall outerSliderWall outerSliderOutlet innerSliderOutlet);
//preserveFaceZones (innerSliderInlet_zone outerSliderInlet_zone innerSliderWall_zone outerSliderWall_zone outerSliderOutlet_zone innerSliderOutlet_zone);

// Face zones which need to be present on all CPUs in its entirety
globalFaceZones (innerSliderInlet_zone outerSliderInlet_zone innerSliderWall_zone outerSliderWall_zone outerSliderOutlet_zone innerSliderOutlet_zone);

method metis;

simpleCoeffs
{
n (4 2 1);
delta 0.001;
}

hierarchicalCoeffs
{
n (1 1 1);
delta 0.001;
order xyz;
}
metisCoeffs
{
processorWeights
(
1
1
1
1
1
1
1
1
);
}
manualCoeffs
{
dataFile "cellDecomposition";
}

distributed no;

roots
(
);

f. dynamicMeshDict:

dynamicFvMeshLib "libtopoChangerFvMesh.so";
dynamicFvMesh mixerGgiFvMesh;

mixerGgiFvMeshCoeffs
{
coordinateSystem
{
type cylindrical;
origin (0 0 0);
axis (0 0 1);
direction (0 1 0);
}

rpm -72;

slider
{
moving ( innerSliderInlet innerSliderWall innerSliderOutlet );
static ( outerSliderInlet outerSliderWall outerSliderOutlet );
}
}

NOTE: I had a doubt. Since my rotating zone has the finest mesh. So, the GGI faces have the maximum number of cells. When I use globalFaceZones in decomposeParDict, does it copy all the ggi faces on all processors? If that is the case, then it would run really slow, because, it will take time to interpolate between 100K cells and communicate data. Please forgive me if what I am thinking is wrong.

Thank you very much for your help. I am grateful.
Sincerely,

--
Dnyanesh Digraskar

mbeaudoin · October 31, 2009, 22:13

Hello Dnyanesh,

Thank you for the information, this is much more helpful.

From the information present in your boundary file, I can see that your GGI interfaces are indeed composed of large sets of facets.

With the actual implementation of the GGI, this will have an impact because the GGI faceZones are shared on all the processors and communication will take it's toll.
Also, one internal algorithm of the GGI is a bit slow when using very large numbers of facets for the GGI patches (my bad here, but I am working on it...).

But not to the point of a simulation to crash and burn like you are describing.

So another important piece of information I need is your simulation log file; not the PBS log file, but the log messages generated by turbDyMFoam when running your 32 processors parallel run.

This file is probably quite large for posting on the Forum, so I would like to see at least the very early log messages, from line #1 (the turbDyMFoam splash header) down to let's say the 10th simulation time step.

I also need to see the log for the last 10 time steps, just before your application crashed.

As a side note: As I mentioned, I am currently working on some improvements to the GGI in order speed up the code when using GGI patches with a large number of facets (100K and +).

My research group needs to run large GGI cases like that, so this is a priority for me to get this nailed down asap. We will contribute our modifications to Hrv's dev version, so you will have access to the improvements as well.

Regards,

Martin

Quote:

Originally Posted by ddigrask

Hello Mr. Beaudoin,

Thank you for your reply. I realize I should have given all the details before it self. I am sorry for that. Wont happen in future.

1. I was earlier using the latest SVN version. But later, since I was facing the parallel problems, I read more and thought I should follow the ERCOFTAC page. So I reverted back to version 1238. I will upgrade to latest one now.

2. I am quite sure that it is not a hardware related issue. I am running my cases on our college supercomputer cluster. I had successfully run MRFSimpleFoam on 32 cores for quite a long time. I used to get linear speed up till 32 cores.
Some numbers from my case:
150K cells / processor - 12 Proc - 8 sec / time step
110K cells / processor - 24 Proc - 6 sec / time step
about 75K cells / processor - 32 Proc - 2.5 sec / time step

after that it would be almost stable, and later start to increase.

3. The cluster is a 72 node cluster with 8 processors per node. Each node has 4 GB RAM. The interconnect between the nodes is GigaByte Ethernet.

4. I am not using a VMWare machine.

5. The required files are copied below. Answers to some more questions:
a. MPI version is mpich2-1.0.8
b. command used : mpiexec-pbs turbDyMFoam -parallel > outfile
c. boundary file:
FoamFile
{
version 2.0;
format ascii;
class polyBoundaryMesh;
location "constant/polyMesh";
object boundary;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

11
(
outerSliderOutlet
{
type ggi;
nFaces 25814;
startFace 5565326;
shadowPatch innerSliderOutlet;
zone outerSliderOutlet_zone;
bridgeOverlap false;
}
outerSliderWall
{
type ggi;
nFaces 43870;
startFace 5591140;
shadowPatch innerSliderWall;
zone outerSliderWall_zone;
bridgeOverlap false;
}
outerSliderInlet
{
type ggi;
nFaces 25814;
startFace 5635010;
shadowPatch innerSliderInlet;
zone outerSliderInlet_zone;
bridgeOverlap false;
}
innerSliderOutlet
{
type ggi;
nFaces 18596;
startFace 5660824;
shadowPatch outerSliderOutlet;
zone innerSliderOutlet_zone;
bridgeOverlap false;
}
innerSliderInlet
{
type ggi;
nFaces 1148;
startFace 5679420;
shadowPatch outerSliderInlet;
zone innerSliderInlet_zone;
bridgeOverlap false;
}
innerSliderWall
{
type ggi;
nFaces 5424;
startFace 5680568;
shadowPatch outerSliderWall;
zone innerSliderWall_zone;
bridgeOverlap false;
}
tower_plate
{
type wall;
nFaces 13024;
startFace 5685992;
}
rotor
{
type wall;
nFaces 19180;
startFace 5699016;
}
outlet
{
type patch;
nFaces 2052;
startFace 5718196;
}
outer_wall
{
type wall;
nFaces 10390;
startFace 5720248;
}
inlet
{
type patch;
nFaces 2052;
startFace 5730638;
}
)

d. CONTROLDICT:

FoamFile
{
version 2.0;
format ascii;
class dictionary;
object controlDict;
}

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

applicationClass icoTopoFoam;

startFrom startTime;

startTime 0.108562;

stopAt endTime;

endTime 5;

deltaT 0.05;

writeControl timeStep;

writeInterval 20;

cycleWrite 0;

writeFormat ascii;

writePrecision 6;

writeCompression uncompressed;

timeFormat general;

timePrecision 6;

runTimeModifiable yes;

adjustTimeStep yes;
maxCo 1;

maxDeltaT 1.0;

functions
(
ggiCheck
{
// Type of functionObject
type ggiCheck;

phi phi;

// Where to load it from (if not already in solver)
functionObjectLibs ("libsampling.so");
}
);

e. decomposeParDict:

FoamFile
{
version 2.0;
format ascii;

root "";
case "";
instance "";
local "";

class dictionary;
object decomposeParDict;
}

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

numberOfSubdomains 8;

// Patches or Face Zones which need to face both cells on the same CPU
//preservePatches (innerSliderInlet outerSliderInlet innerSliderWall outerSliderWall outerSliderOutlet innerSliderOutlet);
//preserveFaceZones (innerSliderInlet_zone outerSliderInlet_zone innerSliderWall_zone outerSliderWall_zone outerSliderOutlet_zone innerSliderOutlet_zone);

// Face zones which need to be present on all CPUs in its entirety
globalFaceZones (innerSliderInlet_zone outerSliderInlet_zone innerSliderWall_zone outerSliderWall_zone outerSliderOutlet_zone innerSliderOutlet_zone);

method metis;

simpleCoeffs
{
n (4 2 1);
delta 0.001;
}

hierarchicalCoeffs
{
n (1 1 1);
delta 0.001;
order xyz;
}
metisCoeffs
{
processorWeights
(
1
1
1
1
1
1
1
1
);
}
manualCoeffs
{
dataFile "cellDecomposition";
}

distributed no;

roots
(
);

f. dynamicMeshDict:

dynamicFvMeshLib "libtopoChangerFvMesh.so";
dynamicFvMesh mixerGgiFvMesh;

mixerGgiFvMeshCoeffs
{
coordinateSystem
{
type cylindrical;
origin (0 0 0);
axis (0 0 1);
direction (0 1 0);
}

rpm -72;

slider
{
moving ( innerSliderInlet innerSliderWall innerSliderOutlet );
static ( outerSliderInlet outerSliderWall outerSliderOutlet );
}
}

NOTE: I had a doubt. Since my rotating zone has the finest mesh. So, the GGI faces have the maximum number of cells. When I use globalFaceZones in decomposeParDict, does it copy all the ggi faces on all processors? If that is the case, then it would run really slow, because, it will take time to interpolate between 100K cells and communicate data. Please forgive me if what I am thinking is wrong.

Thank you very much for your help. I am grateful.
Sincerely,

--
Dnyanesh Digraskar

ddigrask · November 1, 2009, 14:42

Dear Mr. Beaudoin,

Sorry for a bit late reply. The turbDyMFoam output is attached below. The code doesnot crash because of solver settings, it just waits on some step during the calculation and finally dies giving MPI error.

After carefully looking at each time step output, I have observed that the maximum time consuming part of the solution is the GGI Interpolation step. That is where the solver takes about 2 - 3 minutes to post the output.

Following is the turbDyMFoam output:

/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 1.5-dev |
| \\ / A nd | Revision: 1388 |
| \\/ M anipulation | Web: http://www.OpenFOAM.org |
\*---------------------------------------------------------------------------*/
Exec : turbDyMFoam -parallel
Date : Nov 01 2009
Time : 14:13:26
Host : node76
PID : 26246
Case : /home/ddigrask/OpenFOAM/ddigrask-1.5-dev/run/fall2009/ggi/turbineGgi_bigMesh
nProcs : 32
Slaves :
31
(
node76.26247
node76.26248
node76.26249
node76.26250
node76.26251
node76.26252
node76.26253
node23.22676
node23.22677
node23.22678
node23.22679
node23.22680
node23.22681
node23.22682
node23.22683
node42.22800
node42.22801
node42.22802
node42.22803
node42.22804
node42.22805
node42.22806
node42.22807
node31.31933
node31.31934
node31.31935
node31.31936
node31.31937
node31.31938
node31.31939
node31.31940
)
Pstream initialized with:
floatTransfer : 0
nProcsSimpleSum : 0
commsType : nonBlocking

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

Create dynamic mesh for time = 0

Selecting dynamicFvMesh mixerGgiFvMesh
void mixerGgiFvMesh::addZonesAndModifiers() : Zones and modifiers already present. Skipping.
Mixer mesh:
origin: (0 0 0)
axis : (0 0 1)
rpm : -72
Reading field p

Reading field U

Reading/calculating face flux field phi

Initializing the GGI interpolator between master/shadow patches: outerSliderOutlet/innerSliderOutlet
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 0.00102394 average: 1.66438e-06
Largest master weighting factor correction: 0.0922472 average: 0.000376558

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Initializing the GGI interpolator between master/shadow patches: outerSliderWall/innerSliderWall
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 0.00104841 average: 0.000148099
Largest master weighting factor correction: 2.79095e-06 average: 4.34772e-09

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Initializing the GGI interpolator between master/shadow patches: outerSliderInlet/innerSliderInlet
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 1.51821e-05 average: 3.4379e-07
Largest master weighting factor correction: 0.176367 average: 0.00114207

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Selecting incompressible transport model Newtonian
Selecting RAS turbulence model SpalartAllmaras
Reading field rAU if present

Starting time loop

Courant Number mean: 0.00864601 max: 1.0153 velocity magnitude: 10
deltaT = 0.000492466
--> FOAM Warning :
From function dlLibraryTable:

pen(const dictionary& dict, const word& libsEntry, const TablePtr tablePtr)
in file lnInclude/dlLibraryTableTemplates.C at line 68
library "libsampling.so" did not introduce any new entries

Creating ggi check
Time = 0.000492466

Initializing the GGI interpolator between master/shadow patches: outerSliderOutlet/innerSliderOutlet
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 0.000966753 average: 1.68023e-06
Largest master weighting factor correction: 0.0926134 average: 0.000376611

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Initializing the GGI interpolator between master/shadow patches: outerSliderWall/innerSliderWall
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 0.00104722 average: 0.000148213
Largest master weighting factor correction: 2.90604e-06 average: 4.45196e-09

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Initializing the GGI interpolator between master/shadow patches: outerSliderInlet/innerSliderInlet
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 1.56139e-05 average: 3.5192e-07
Largest master weighting factor correction: 0.179572 average: 0.0011419

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
PBiCG: Solving for Ux, Initial residual = 1, Final residual = 1.15144e-06, No Iterations 9
PBiCG: Solving for Uy, Initial residual = 1, Final residual = 1.69491e-06, No Iterations 9
PBiCG: Solving for Uz, Initial residual = 1, Final residual = 5.0914e-06, No Iterations 7
GAMG: Solving for p, Initial residual = 1, Final residual = 0.0214318, No Iterations 6
time step continuity errors : sum local = 1.50927e-07, global = -1.79063e-08, cumulative = -1.79063e-08
GAMG: Solving for p, Initial residual = 0.284648, Final residual = 0.00377362, No Iterations 2
time step continuity errors : sum local = 5.39845e-07, global = 3.49001e-09, cumulative = -1.44163e-08
(END)

It is not even one complete time step. This is all the code has run for. After this the code quits with MPI error.

mpiexec-pbs: Warning: tasks 0-29,31 died with signal 15 (Terminated).
mpiexec-pbs: Warning: task 30 died with signal 9 (Killed).

Thank you again for your help.

I am also trying to run the same case with just 2 ggi faces (instead of 6), i.e. ggiInside and ggiOutside. But even this does not help from making it run faster.

Sincerely,

--
Dnyanesh Digraskar

mbeaudoin · November 1, 2009, 16:57

Hello,

Some comments:

1: It would be useful to see a stack trace in your log file when your run aborts. Could you set the environment variable FOAM_ABORT=1 and make sure every parallel task got this variable activated as well? That way, we could see where the parallel tasks are crashing through the stack trace in the log file.

2: You said your cluster has 72 nodes, 8 processors per node and each node has 4 GB RAM.

3: From your log file, we can see that you have 8 parallel tasks running on each node. Overall, your parallel run is using only 4 nodes on your cluster (node76, node23, node42 and node31).

4: So basically, for a ~4 million cells mesh, you are using only 4 computers, each with only 4 GB of RAM, and 8 tasks per node fighting simultaneously for access to this amount of RAM.

Am I right?

If so, because of your large mesh, your 4 nodes probably don't have enough memory available, and could be swapping for virtual memory on the hard-drive, which is quite slow.

And depending on your memory bus architecture, your 8 tasks will have to compete for access to the memory bus, which will slow you down as well.

Did you meant 4 GB RAM per processor instead, which would give you 32 GB RAM per node or computer?

Could you just double-check that your cluster information is accurate?

Martin

Quote:

Originally Posted by ddigrask

Dear Mr. Beaudoin,

Sorry for a bit late reply. The turbDyMFoam output is attached below. The code doesnot crash because of solver settings, it just waits on some step during the calculation and finally dies giving MPI error.

After carefully looking at each time step output, I have observed that the maximum time consuming part of the solution is the GGI Interpolation step. That is where the solver takes about 2 - 3 minutes to post the output.

Following is the turbDyMFoam output:

/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 1.5-dev |
| \\ / A nd | Revision: 1388 |
| \\/ M anipulation | Web: http://www.OpenFOAM.org |
\*---------------------------------------------------------------------------*/
Exec : turbDyMFoam -parallel
Date : Nov 01 2009
Time : 14:13:26
Host : node76
PID : 26246
Case : /home/ddigrask/OpenFOAM/ddigrask-1.5-dev/run/fall2009/ggi/turbineGgi_bigMesh
nProcs : 32
Slaves :
31
(
node76.26247
node76.26248
node76.26249
node76.26250
node76.26251
node76.26252
node76.26253
node23.22676
node23.22677
node23.22678
node23.22679
node23.22680
node23.22681
node23.22682
node23.22683
node42.22800
node42.22801
node42.22802
node42.22803
node42.22804
node42.22805
node42.22806
node42.22807
node31.31933
node31.31934
node31.31935
node31.31936
node31.31937
node31.31938
node31.31939
node31.31940
)
Pstream initialized with:
floatTransfer : 0
nProcsSimpleSum : 0
commsType : nonBlocking

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

Create dynamic mesh for time = 0

Selecting dynamicFvMesh mixerGgiFvMesh
void mixerGgiFvMesh::addZonesAndModifiers() : Zones and modifiers already present. Skipping.
Mixer mesh:
origin: (0 0 0)
axis : (0 0 1)
rpm : -72
Reading field p

Reading field U

Reading/calculating face flux field phi

Initializing the GGI interpolator between master/shadow patches: outerSliderOutlet/innerSliderOutlet
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 0.00102394 average: 1.66438e-06
Largest master weighting factor correction: 0.0922472 average: 0.000376558

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Initializing the GGI interpolator between master/shadow patches: outerSliderWall/innerSliderWall
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 0.00104841 average: 0.000148099
Largest master weighting factor correction: 2.79095e-06 average: 4.34772e-09

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Initializing the GGI interpolator between master/shadow patches: outerSliderInlet/innerSliderInlet
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 1.51821e-05 average: 3.4379e-07
Largest master weighting factor correction: 0.176367 average: 0.00114207

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Selecting incompressible transport model Newtonian
Selecting RAS turbulence model SpalartAllmaras
Reading field rAU if present

Starting time loop

Courant Number mean: 0.00864601 max: 1.0153 velocity magnitude: 10
deltaT = 0.000492466
--> FOAM Warning :
From function dlLibraryTable:

pen(const dictionary& dict, const word& libsEntry, const TablePtr tablePtr)
in file lnInclude/dlLibraryTableTemplates.C at line 68
library "libsampling.so" did not introduce any new entries

Creating ggi check
Time = 0.000492466

Initializing the GGI interpolator between master/shadow patches: outerSliderOutlet/innerSliderOutlet
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 0.000966753 average: 1.68023e-06
Largest master weighting factor correction: 0.0926134 average: 0.000376611

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Initializing the GGI interpolator between master/shadow patches: outerSliderWall/innerSliderWall
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 0.00104722 average: 0.000148213
Largest master weighting factor correction: 2.90604e-06 average: 4.45196e-09

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Initializing the GGI interpolator between master/shadow patches: outerSliderInlet/innerSliderInlet
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 1.56139e-05 average: 3.5192e-07
Largest master weighting factor correction: 0.179572 average: 0.0011419

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
PBiCG: Solving for Ux, Initial residual = 1, Final residual = 1.15144e-06, No Iterations 9
PBiCG: Solving for Uy, Initial residual = 1, Final residual = 1.69491e-06, No Iterations 9
PBiCG: Solving for Uz, Initial residual = 1, Final residual = 5.0914e-06, No Iterations 7
GAMG: Solving for p, Initial residual = 1, Final residual = 0.0214318, No Iterations 6
time step continuity errors : sum local = 1.50927e-07, global = -1.79063e-08, cumulative = -1.79063e-08
GAMG: Solving for p, Initial residual = 0.284648, Final residual = 0.00377362, No Iterations 2
time step continuity errors : sum local = 5.39845e-07, global = 3.49001e-09, cumulative = -1.44163e-08
(END)

It is not even one complete time step. This is all the code has run for. After this the code quits with MPI error.

mpiexec-pbs: Warning: tasks 0-29,31 died with signal 15 (Terminated).
mpiexec-pbs: Warning: task 30 died with signal 9 (Killed).

Thank you again for your help.

I am also trying to run the same case with just 2 ggi faces (instead of 6), i.e. ggiInside and ggiOutside. But even this does not help from making it run faster.

Sincerely,

--
Dnyanesh Digraskar

ddigrask · November 1, 2009, 19:59

Hello Mr. Beaudoin,

Thank you for your reply. I was a little bit confused between cores and processors. The cluster is

72 nodes --- 8 cores per node --- 4 GB RAM per node.

My information about memory per node (computer) is correct.

I had also tried running the same job on 32 nodes with one process per node. That takes more time than this.

I will post the stack trace log soon.
I will also try running the case on more cores (maybe 48 or 56) in order to avoid the memory bottleneck.

Thank you again for help.

Sincerely,

--
Dnyanesh Digraskar

bastil · November 2, 2009, 16:22

Martin,

this sounds great to me since I have similar problems with large models with many ggi-pairs. I am really looking forward to this.

Regards BastiL

ddigrask · November 2, 2009, 17:23

Hello Mr. Beaudoin,

Some updates to my previous post. After enabling FOAM_ABORT=1, I get a detailed MPI error message than earlier one:

Create time

Create dynamic mesh for time = 0

Selecting dynamicFvMesh mixerGgiFvMesh
void mixerGgiFvMesh::addZonesAndModifiers() : Zones and modifiers already present. Skipping.
Mixer mesh:
origin: (0 0 0)
axis : (0 0 1)
rpm : -72
Reading field p

Reading field U

Reading/calculating face flux field phi

Initializing the GGI interpolator between master/shadow patches: outerSliderOutlet/innerSliderOutlet
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 0.00102394 average: 1.66438e-06
Largest master weighting factor correction: 0.0922472 average: 0.000376558

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Initializing the GGI interpolator between master/shadow patches: outerSliderWall/innerSliderWall
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 0.00104841 average: 0.000148099
Largest master weighting factor correction: 2.79095e-06 average: 4.34772e-09

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Initializing the GGI interpolator between master/shadow patches: outerSliderInlet/innerSliderInlet
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 1.51821e-05 average: 3.4379e-07
Largest master weighting factor correction: 0.176367 average: 0.00114207

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Selecting incompressible transport model Newtonian
Selecting RAS turbulence model SpalartAllmaras
Reading field rAU if present
Working directory is /home/ddigrask/OpenFOAM/ddigrask-1.5-dev/run/fall2009/ggi/turbineGgi_bigMesh
--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Selecting incompressible transport model Newtonian
Selecting RAS turbulence model SpalartAllmaras
Reading field rAU if present

Starting time loop

Courant Number mean: 0.00865431 max: 1.0153 velocity magnitude: 10
deltaT = 0.000492466
--> FOAM Warning :
From function dlLibraryTable:

pen(const dictionary& dict, const word& libsEntry, const TablePtr tablePtr)
in file lnInclude/dlLibraryTableTemplates.C at line 68
library "libsampling.so" did not introduce any new entries

Creating ggi check
Time = 0.000492466

Initializing the GGI interpolator between master/shadow patches: outerSliderOutlet/innerSliderOutlet
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 0.000966753 average: 1.68023e-06
Largest master weighting factor correction: 0.0926134 average: 0.000376611

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Initializing the GGI interpolator between master/shadow patches: outerSliderWall/innerSliderWall
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 0.00104722 average: 0.000148213
Largest master weighting factor correction: 2.90604e-06 average: 4.45196e-09

--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
--> FOAM Warning :
From function min(const UList<Type>&)
in file lnInclude/FieldFunctions.C at line 342
empty field, returning zero
Initializing the GGI interpolator between master/shadow patches: outerSliderInlet/innerSliderInlet
Evaluation of GGI weighting factors:
Largest slave weighting factor correction : 1.56139e-05 average: 3.5192e-07
Largest master weighting factor correction: 0.179572 average: 0.0011419

Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=0x7f4487e25010, count=1052888, MPI_PACKED, dest=0, tag=1, MPI_COMM_WORLD) failed
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)[cli_8]: aborting job:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=0x7f4487e25010, count=1052888, MPI_PACKED, dest=0, tag=1, MPI_COMM_WORLD) failed
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)

Again, this is not even one complete time step and the code quits after almost 20 mins.

2. I have also tried running the same case on 56 and 64 cores. (i.e 7 and 8 nodes respectively). There was no change in the output.

3. Mr. Oliver Petit suggested me to manually create movingCells cellzone. I will try that to see if it helps.

Thank you.

Sincerely.

--
Dnyanesh Digraskar

mbeaudoin · November 2, 2009, 19:31

Hello,

Quote:

Originally Posted by ddigrask

Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=0x7f4487e25010, count=1052888, MPI_PACKED, dest=0, tag=1, MPI_COMM_WORLD) failed
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)[cli_8]: aborting job:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=0x7f4487e25010, count=1052888, MPI_PACKED, dest=0, tag=1, MPI_COMM_WORLD) failed
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)

You mean this is the complete stack trace? Nothing more?

It only says that it crashed in a MPI operation. We don't know where. It can be in the GGI code, it can be in the solver, it can be anywhere MPI is being used. So unfortunately, this stack trace is useless.

I don't have enough information to help you much more.

Try logging on your compute nodes to see if you have enough memory while the parallel job runs. 20 mins gives you plenty of time to catch this.

Try checking if your nodes are not swapping on disk for virtual memory on disk.

I hope to be able to contribute some improvements to the GGI soon. I do not know if this will help you. Let's hope for the best.

Regards,

Martin

mbeaudoin · November 2, 2009, 19:34

Hello BastiL,

Out of curiosity, how large is large?

I mean how many GGIs, and how many faces per GGI pairs?

Martin

Quote:

Originally Posted by bastil

Martin,

this sounds great to me since I have similar problems with large models with many ggi-pairs. I am really looking forward to this.

Regards BastiL

bastil · November 3, 2009, 03:15

Quote:

Originally Posted by mbeaudoin

Hello BastiL,

Out of curiosity, how large is large?

I mean how many GGIs, and how many faces per GGI pairs?

Martin

About 40 Mio. cells, about 15 ggi pairs with very different numbers of faces per ggi.

bastil · November 26, 2009, 04:45

Martin,

I am wondering how your work is going on? Is there some way I can support your work, e.g. by testing improvements with our models so please let me know. Thanks.

Regards BastiL

mbeaudoin · November 27, 2009, 09:54

Hey BastiL,

I am actively working on that one.
Thanks for the offer. I will keep you posted.

Regards,

Martin

Quote:

Originally Posted by bastil

Martin,

I am wondering how your work is going on? Is there some way I can support your work, e.g. by testing improvements with our models so please let me know. Thanks.

Regards BastiL

bastil · March 16, 2010, 03:17

Quote:

Originally Posted by mbeaudoin

I am actively working on that one.
Thanks for the offer. I will keep you posted.

Martin,

I am wondering how work at the ggi-Implementation is proceeding? Thanks.

Regards Bastian

hjasak · March 17, 2010, 08:43

Actually, I've got an update for you. There is a new layer of optimisation code built into the GGI interpolation, aimed at sorting out the loss of performance in parallel for a large number of CPUs. In short, each GGI will recognise whether it is located on a single CPU or not and, based on this, it will adjust the communications pattern in parallel.

This has shown good improvement on a bunch of cases I have tried but you need to be careful on the parallel decomposition you choose. There are two further optimisation steps we can do, but they are much more intrusive. I am delaying this until we start doing projects with real multi-stage compressors (lots of GGIs) and until we get the mixing plane code rocking (friends involved here).

Further updates are likely to follow, isn't that right Martin?

Hrv

May 27, 2009, 08:23	Solved	#2
hannes Senior Member Hannes Kröger Join Date: Mar 2009 Location: Rostock, Germany Posts: 123 Rep Power: 18	I just updated to SVN revision 1266 because of the performance updates for GGI. It helped. The above case now scales linearly up to 8 processors. Thanks for this, Hrv. Best regards, Hannes

October 28, 2009, 15:05	GGI in parallel	#5
ddigrask New Member Dnyanesh Digraskar Join Date: Mar 2009 Location: Amherst, MA, United States Posts: 10 Rep Power: 17	Hello All, I am running turbDyMFoam with GGI on a full wind turbine. So, the mesh is huge (~ 4 million cells). I am having problems with running in parallel. I am running on 32 processors. It runs very slowly and eventually one of the processes dies. The same job runs perfectly fine in serial, but I think this case would take a long time to finish in serial. I will be very much thankful to you if you shed some light on improving the ggi parallel performance. Thank you -- Dnyanesh

October 31, 2009, 14:56		#7
ddigrask New Member Dnyanesh Digraskar Join Date: Mar 2009 Location: Amherst, MA, United States Posts: 10 Rep Power: 17	Hello Mr. Beaudoin, Thank you for your reply. I realize I should have given all the details before it self. I am sorry for that. Wont happen in future. 1. I was earlier using the latest SVN version. But later, since I was facing the parallel problems, I read more and thought I should follow the ERCOFTAC page. So I reverted back to version 1238. I will upgrade to latest one now. 2. I am quite sure that it is not a hardware related issue. I am running my cases on our college supercomputer cluster. I had successfully run MRFSimpleFoam on 32 cores for quite a long time. I used to get linear speed up till 32 cores. Some numbers from my case: 150K cells / processor - 12 Proc - 8 sec / time step 110K cells / processor - 24 Proc - 6 sec / time step about 75K cells / processor - 32 Proc - 2.5 sec / time step after that it would be almost stable, and later start to increase. 3. The cluster is a 72 node cluster with 8 processors per node. Each node has 4 GB RAM. The interconnect between the nodes is GigaByte Ethernet. 4. I am not using a VMWare machine. 5. The required files are copied below. Answers to some more questions: a. MPI version is mpich2-1.0.8 b. command used : mpiexec-pbs turbDyMFoam -parallel > outfile c. boundary file: FoamFile { version 2.0; format ascii; class polyBoundaryMesh; location "constant/polyMesh"; object boundary; } // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // 11 ( outerSliderOutlet { type ggi; nFaces 25814; startFace 5565326; shadowPatch innerSliderOutlet; zone outerSliderOutlet_zone; bridgeOverlap false; } outerSliderWall { type ggi; nFaces 43870; startFace 5591140; shadowPatch innerSliderWall; zone outerSliderWall_zone; bridgeOverlap false; } outerSliderInlet { type ggi; nFaces 25814; startFace 5635010; shadowPatch innerSliderInlet; zone outerSliderInlet_zone; bridgeOverlap false; } innerSliderOutlet { type ggi; nFaces 18596; startFace 5660824; shadowPatch outerSliderOutlet; zone innerSliderOutlet_zone; bridgeOverlap false; } innerSliderInlet { type ggi; nFaces 1148; startFace 5679420; shadowPatch outerSliderInlet; zone innerSliderInlet_zone; bridgeOverlap false; } innerSliderWall { type ggi; nFaces 5424; startFace 5680568; shadowPatch outerSliderWall; zone innerSliderWall_zone; bridgeOverlap false; } tower_plate { type wall; nFaces 13024; startFace 5685992; } rotor { type wall; nFaces 19180; startFace 5699016; } outlet { type patch; nFaces 2052; startFace 5718196; } outer_wall { type wall; nFaces 10390; startFace 5720248; } inlet { type patch; nFaces 2052; startFace 5730638; } ) d. CONTROLDICT: FoamFile { version 2.0; format ascii; class dictionary; object controlDict; } // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // applicationClass icoTopoFoam; startFrom startTime; startTime 0.108562; stopAt endTime; endTime 5; deltaT 0.05; writeControl timeStep; writeInterval 20; cycleWrite 0; writeFormat ascii; writePrecision 6; writeCompression uncompressed; timeFormat general; timePrecision 6; runTimeModifiable yes; adjustTimeStep yes; maxCo 1; maxDeltaT 1.0; functions ( ggiCheck { // Type of functionObject type ggiCheck; phi phi; // Where to load it from (if not already in solver) functionObjectLibs ("libsampling.so"); } ); e. decomposeParDict: FoamFile { version 2.0; format ascii; root ""; case ""; instance ""; local ""; class dictionary; object decomposeParDict; } // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // numberOfSubdomains 8; // Patches or Face Zones which need to face both cells on the same CPU //preservePatches (innerSliderInlet outerSliderInlet innerSliderWall outerSliderWall outerSliderOutlet innerSliderOutlet); //preserveFaceZones (innerSliderInlet_zone outerSliderInlet_zone innerSliderWall_zone outerSliderWall_zone outerSliderOutlet_zone innerSliderOutlet_zone); // Face zones which need to be present on all CPUs in its entirety globalFaceZones (innerSliderInlet_zone outerSliderInlet_zone innerSliderWall_zone outerSliderWall_zone outerSliderOutlet_zone innerSliderOutlet_zone); method metis; simpleCoeffs { n (4 2 1); delta 0.001; } hierarchicalCoeffs { n (1 1 1); delta 0.001; order xyz; } metisCoeffs { processorWeights ( 1 1 1 1 1 1 1 1 ); } manualCoeffs { dataFile "cellDecomposition"; } distributed no; roots ( ); f. dynamicMeshDict: dynamicFvMeshLib "libtopoChangerFvMesh.so"; dynamicFvMesh mixerGgiFvMesh; mixerGgiFvMeshCoeffs { coordinateSystem { type cylindrical; origin (0 0 0); axis (0 0 1); direction (0 1 0); } rpm -72; slider { moving ( innerSliderInlet innerSliderWall innerSliderOutlet ); static ( outerSliderInlet outerSliderWall outerSliderOutlet ); } } NOTE: I had a doubt. Since my rotating zone has the finest mesh. So, the GGI faces have the maximum number of cells. When I use globalFaceZones in decomposeParDict, does it copy all the ggi faces on all processors? If that is the case, then it would run really slow, because, it will take time to interpolate between 100K cells and communicate data. Please forgive me if what I am thinking is wrong. Thank you very much for your help. I am grateful. Sincerely, -- Dnyanesh Digraskar Last edited by ddigrask; October 31, 2009 at 18:30.

November 1, 2009, 14:42		#9
ddigrask New Member Dnyanesh Digraskar Join Date: Mar 2009 Location: Amherst, MA, United States Posts: 10 Rep Power: 17	Dear Mr. Beaudoin, Sorry for a bit late reply. The turbDyMFoam output is attached below. The code doesnot crash because of solver settings, it just waits on some step during the calculation and finally dies giving MPI error. After carefully looking at each time step output, I have observed that the maximum time consuming part of the solution is the GGI Interpolation step. That is where the solver takes about 2 - 3 minutes to post the output. Following is the turbDyMFoam output: /---------------------------------------------------------------------------\ \| ========= \| \| \| \\ / F ield \| OpenFOAM: The Open Source CFD Toolbox \| \| \\ / O peration \| Version: 1.5-dev \| \| \\ / A nd \| Revision: 1388 \| \| \\/ M anipulation \| Web: http://www.OpenFOAM.org \| \---------------------------------------------------------------------------/ Exec : turbDyMFoam -parallel Date : Nov 01 2009 Time : 14:13:26 Host : node76 PID : 26246 Case : /home/ddigrask/OpenFOAM/ddigrask-1.5-dev/run/fall2009/ggi/turbineGgi_bigMesh nProcs : 32 Slaves : 31 ( node76.26247 node76.26248 node76.26249 node76.26250 node76.26251 node76.26252 node76.26253 node23.22676 node23.22677 node23.22678 node23.22679 node23.22680 node23.22681 node23.22682 node23.22683 node42.22800 node42.22801 node42.22802 node42.22803 node42.22804 node42.22805 node42.22806 node42.22807 node31.31933 node31.31934 node31.31935 node31.31936 node31.31937 node31.31938 node31.31939 node31.31940 ) Pstream initialized with: floatTransfer : 0 nProcsSimpleSum : 0 commsType : nonBlocking // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // Create time Create dynamic mesh for time = 0 Selecting dynamicFvMesh mixerGgiFvMesh void mixerGgiFvMesh::addZonesAndModifiers() : Zones and modifiers already present. Skipping. Mixer mesh: origin: (0 0 0) axis : (0 0 1) rpm : -72 Reading field p Reading field U Reading/calculating face flux field phi Initializing the GGI interpolator between master/shadow patches: outerSliderOutlet/innerSliderOutlet Evaluation of GGI weighting factors: Largest slave weighting factor correction : 0.00102394 average: 1.66438e-06 Largest master weighting factor correction: 0.0922472 average: 0.000376558 --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero Initializing the GGI interpolator between master/shadow patches: outerSliderWall/innerSliderWall Evaluation of GGI weighting factors: Largest slave weighting factor correction : 0.00104841 average: 0.000148099 Largest master weighting factor correction: 2.79095e-06 average: 4.34772e-09 --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero Initializing the GGI interpolator between master/shadow patches: outerSliderInlet/innerSliderInlet Evaluation of GGI weighting factors: Largest slave weighting factor correction : 1.51821e-05 average: 3.4379e-07 Largest master weighting factor correction: 0.176367 average: 0.00114207 --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero Selecting incompressible transport model Newtonian Selecting RAS turbulence model SpalartAllmaras Reading field rAU if present Starting time loop Courant Number mean: 0.00864601 max: 1.0153 velocity magnitude: 10 deltaT = 0.000492466 --> FOAM Warning : From function dlLibraryTable:pen(const dictionary& dict, const word& libsEntry, const TablePtr tablePtr) in file lnInclude/dlLibraryTableTemplates.C at line 68 library "libsampling.so" did not introduce any new entries Creating ggi check Time = 0.000492466 Initializing the GGI interpolator between master/shadow patches: outerSliderOutlet/innerSliderOutlet Evaluation of GGI weighting factors: Largest slave weighting factor correction : 0.000966753 average: 1.68023e-06 Largest master weighting factor correction: 0.0926134 average: 0.000376611 --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero Initializing the GGI interpolator between master/shadow patches: outerSliderWall/innerSliderWall Evaluation of GGI weighting factors: Largest slave weighting factor correction : 0.00104722 average: 0.000148213 Largest master weighting factor correction: 2.90604e-06 average: 4.45196e-09 --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero Initializing the GGI interpolator between master/shadow patches: outerSliderInlet/innerSliderInlet Evaluation of GGI weighting factors: Largest slave weighting factor correction : 1.56139e-05 average: 3.5192e-07 Largest master weighting factor correction: 0.179572 average: 0.0011419 --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero PBiCG: Solving for Ux, Initial residual = 1, Final residual = 1.15144e-06, No Iterations 9 PBiCG: Solving for Uy, Initial residual = 1, Final residual = 1.69491e-06, No Iterations 9 PBiCG: Solving for Uz, Initial residual = 1, Final residual = 5.0914e-06, No Iterations 7 GAMG: Solving for p, Initial residual = 1, Final residual = 0.0214318, No Iterations 6 time step continuity errors : sum local = 1.50927e-07, global = -1.79063e-08, cumulative = -1.79063e-08 GAMG: Solving for p, Initial residual = 0.284648, Final residual = 0.00377362, No Iterations 2 time step continuity errors : sum local = 5.39845e-07, global = 3.49001e-09, cumulative = -1.44163e-08 (END) It is not even one complete time step. This is all the code has run for. After this the code quits with MPI error. mpiexec-pbs: Warning: tasks 0-29,31 died with signal 15 (Terminated). mpiexec-pbs: Warning: task 30 died with signal 9 (Killed). Thank you again for your help. I am also trying to run the same case with just 2 ggi faces (instead of 6), i.e. ggiInside and ggiOutside. But even this does not help from making it run faster. Sincerely, -- Dnyanesh Digraskar Last edited by ddigrask; November 1, 2009 at 15:42.

November 2, 2009, 17:23		#13
ddigrask New Member Dnyanesh Digraskar Join Date: Mar 2009 Location: Amherst, MA, United States Posts: 10 Rep Power: 17	Hello Mr. Beaudoin, Some updates to my previous post. After enabling FOAM_ABORT=1, I get a detailed MPI error message than earlier one: Create time Create dynamic mesh for time = 0 Selecting dynamicFvMesh mixerGgiFvMesh void mixerGgiFvMesh::addZonesAndModifiers() : Zones and modifiers already present. Skipping. Mixer mesh: origin: (0 0 0) axis : (0 0 1) rpm : -72 Reading field p Reading field U Reading/calculating face flux field phi Initializing the GGI interpolator between master/shadow patches: outerSliderOutlet/innerSliderOutlet Evaluation of GGI weighting factors: Largest slave weighting factor correction : 0.00102394 average: 1.66438e-06 Largest master weighting factor correction: 0.0922472 average: 0.000376558 --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero Initializing the GGI interpolator between master/shadow patches: outerSliderWall/innerSliderWall Evaluation of GGI weighting factors: Largest slave weighting factor correction : 0.00104841 average: 0.000148099 Largest master weighting factor correction: 2.79095e-06 average: 4.34772e-09 --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero Initializing the GGI interpolator between master/shadow patches: outerSliderInlet/innerSliderInlet Evaluation of GGI weighting factors: Largest slave weighting factor correction : 1.51821e-05 average: 3.4379e-07 Largest master weighting factor correction: 0.176367 average: 0.00114207 --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero Selecting incompressible transport model Newtonian Selecting RAS turbulence model SpalartAllmaras Reading field rAU if present Working directory is /home/ddigrask/OpenFOAM/ddigrask-1.5-dev/run/fall2009/ggi/turbineGgi_bigMesh --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero Selecting incompressible transport model Newtonian Selecting RAS turbulence model SpalartAllmaras Reading field rAU if present Starting time loop Courant Number mean: 0.00865431 max: 1.0153 velocity magnitude: 10 deltaT = 0.000492466 --> FOAM Warning : From function dlLibraryTable:pen(const dictionary& dict, const word& libsEntry, const TablePtr tablePtr) in file lnInclude/dlLibraryTableTemplates.C at line 68 library "libsampling.so" did not introduce any new entries Creating ggi check Time = 0.000492466 Initializing the GGI interpolator between master/shadow patches: outerSliderOutlet/innerSliderOutlet Evaluation of GGI weighting factors: Largest slave weighting factor correction : 0.000966753 average: 1.68023e-06 Largest master weighting factor correction: 0.0926134 average: 0.000376611 --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero Initializing the GGI interpolator between master/shadow patches: outerSliderWall/innerSliderWall Evaluation of GGI weighting factors: Largest slave weighting factor correction : 0.00104722 average: 0.000148213 Largest master weighting factor correction: 2.90604e-06 average: 4.45196e-09 --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero Initializing the GGI interpolator between master/shadow patches: outerSliderInlet/innerSliderInlet Evaluation of GGI weighting factors: Largest slave weighting factor correction : 1.56139e-05 average: 3.5192e-07 Largest master weighting factor correction: 0.179572 average: 0.0011419 Fatal error in MPI_Send: Other MPI error, error stack: MPI_Send(173).............................: MPI_Send(buf=0x7f4487e25010, count=1052888, MPI_PACKED, dest=0, tag=1, MPI_COMM_WORLD) failed MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait() MPIDI_CH3I_Progress_handle_sock_event(420): MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)[cli_8]: aborting job: Fatal error in MPI_Send: Other MPI error, error stack: MPI_Send(173).............................: MPI_Send(buf=0x7f4487e25010, count=1052888, MPI_PACKED, dest=0, tag=1, MPI_COMM_WORLD) failed MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait() MPIDI_CH3I_Progress_handle_sock_event(420): MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer) Again, this is not even one complete time step and the code quits after almost 20 mins. 2. I have also tried running the same case on 56 and 64 cores. (i.e 7 and 8 nodes respectively). There was no change in the output. 3. Mr. Oliver Petit suggested me to manually create movingCells cellzone. I will try that to see if it helps. Thank you. Sincerely. -- Dnyanesh Digraskar

May 13, 2009, 09:17	Performance of GGI case in parallel	#1
hannes Senior Member Hannes Kröger Join Date: Mar 2009 Location: Rostock, Germany Posts: 123 Rep Power: 18	Hello everyone, I just tried to run a case using interDyMFoam in parallel. The case consists of a non-moving outer mesh and an inner cylindrical mesh that is rotating (with a surface-piercing propeller in it). I use GGI to connect both meshes. The inner mesh is polyhedral and the outer one hexahedral. The entire case consists of approx. 1 million cells (most of them in the inner mesh) I have run this case in parallel on a different number of processors on a SMP machine with 8 quad opteron processors (decompositionMethod metis): #Proc; time per timestep; speedup 1; 360s; 1 4; 155s; 2.3 8; 146s; 2.4 16; 130s; 2.7 So the speedup doesn't even reach 3. A similar case where the whole domain is rotating and the mesh consists only of polyhedra shows a linear speedup up to 8 processors and a decreasing parallel efficieny beyond that. I wonder if this has to do with the GGI interface? I tried to stitch it and repeat the test but unfortunately stitchMesh failed. Does anyone have an idea how to improve the parallel efficiency? Best regards, Hannes PS: Despite the missing parallel efficiency, the case seems to run fine. Typical output: Courant Number mean: 0.0005296648 max: 29.60281 velocity magnitude: 56.38879 GGI pair (slider, inside_slider) : 1.694001 1.692101 Diff = 1.71989e-05 or 0.001015283 % Time = 0.004368 Execution time for mesh.update() = 5.04 s Evaluation of GGI weighting factors: Largest slave weighting factor correction : 0.05648662 average: 0.001119383 Largest master weighting factor correction: 0.02077904 average: 0.0006102291 --> FOAM Warning : From function min(const UList<Type>&) in file lnInclude/FieldFunctions.C at line 342 empty field, returning zero time step continuity errors : sum local = 2.139033e-14, global = -1.99808e-16, cumulative = -1.372663e-05 PCG: Solving for pcorr, Initial residual = 1, Final residual = 0.000847008, No Iterations 17 PCG: Solving for pcorr, Initial residual = 0.08500426, Final residual = 0.0003398914, No Iterations 4 time step continuity errors : sum local = 2.889153e-17, global = -1.458812e-19, cumulative = -1.372663e-05 MULES: Solving for gamma Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1 MULES: Solving for gamma Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1 MULES: Solving for gamma Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1 MULES: Solving for gamma Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1 MULES: Solving for gamma Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1 MULES: Solving for gamma Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1 MULES: Solving for gamma Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1 MULES: Solving for gamma Liquid phase volume fraction = 0.6001454 Min(gamma) = 0 Max(gamma) = 1 smoothSolver: Solving for Ux, Initial residual = 7.74393e-06, Final residual = 1.011977e-07, No Iterations 1 smoothSolver: Solving for Uy, Initial residual = 3.776374e-06, Final residual = 4.505313e-08, No Iterations 1 smoothSolver: Solving for Uz, Initial residual = 1.477087e-05, Final residual = 1.429332e-07, No Iterations 1 GAMG: Solving for pd, Initial residual = 4.955159e-05, Final residual = 9.252356e-07, No Iterations 2 GAMG: Solving for pd, Initial residual = 6.533451e-06, Final residual = 2.441995e-07, No Iterations 2 time step continuity errors : sum local = 1.379978e-12, global = 6.242956e-15, cumulative = -1.372663e-05 GAMG: Solving for pd, Initial residual = 3.37058e-06, Final residual = 1.664205e-07, No Iterations 2 PCG: Solving for pd, Initial residual = 1.65434e-06, Final residual = 4.308411e-09, No Iterations 5 time step continuity errors : sum local = 2.435524e-14, global = 7.917505e-17, cumulative = -1.372663e-05 ExecutionTime = 81834.14 s ClockTime = 81851 s

May 31, 2009, 06:37		#3
yuhai Member Hai Yu Join Date: Mar 2009 Location: Harbin Posts: 67 Rep Power: 17	my experience also shows that 8 cores gives the best speed. while it slows down for 16 cores. I wonder it is because for me, each computer has 8 cores, and the communication between two computers is much inefficient.

September 9, 2009, 10:28		#4
bastil Senior Member BastiL Join Date: Mar 2009 Posts: 530 Rep Power: 20	I have consicered similar problems with the ggi performance in paralllel and I still get no speedup for more than 8 cores. THis is extremely bad since I have large cases typically running on 32 cores with lots of interfaces in. Running them on 8 cores is dam slow for me. Hrv: Are there plans to further improve parallel performance of ggi? Regards BastiL

November 1, 2009, 19:59		#11
ddigrask New Member Dnyanesh Digraskar Join Date: Mar 2009 Location: Amherst, MA, United States Posts: 10 Rep Power: 17	Hello Mr. Beaudoin, Thank you for your reply. I was a little bit confused between cores and processors. The cluster is 72 nodes --- 8 cores per node --- 4 GB RAM per node. My information about memory per node (computer) is correct. I had also tried running the same job on 32 nodes with one process per node. That takes more time than this. I will post the stack trace log soon. I will also try running the case on more cores (maybe 48 or 56) in order to avoid the memory bottleneck. Thank you again for help. Sincerely, -- Dnyanesh Digraskar

November 2, 2009, 16:22		#12
bastil Senior Member BastiL Join Date: Mar 2009 Posts: 530 Rep Power: 20	Martin, this sounds great to me since I have similar problems with large models with many ggi-pairs. I am really looking forward to this. Regards BastiL

November 26, 2009, 04:45		#17
bastil Senior Member BastiL Join Date: Mar 2009 Posts: 530 Rep Power: 20	Martin, I am wondering how your work is going on? Is there some way I can support your work, e.g. by testing improvements with our models so please let me know. Thanks. Regards BastiL

March 17, 2010, 08:43		#20
hjasak Senior Member Hrvoje Jasak Join Date: Mar 2009 Location: London, England Posts: 1,905 Rep Power: 33	Actually, I've got an update for you. There is a new layer of optimisation code built into the GGI interpolation, aimed at sorting out the loss of performance in parallel for a large number of CPUs. In short, each GGI will recognise whether it is located on a single CPU or not and, based on this, it will adjust the communications pattern in parallel. This has shown good improvement on a bunch of cases I have tried but you need to be careful on the parallel decomposition you choose. There are two further optimisation steps we can do, but they are much more intrusive. I am delaying this until we start doing projects with real multi-stage compressors (lots of GGIs) and until we get the mixing plane code rocking (friends involved here). Further updates are likely to follow, isn't that right Martin? Hrv __________________ Hrvoje Jasak Providing commercial FOAM/OpenFOAM and CFD Consulting: http://wikki.co.uk

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 05:36
Script to Run Parallel Jobs in Rocks Cluster	asaha	OpenFOAM Running, Solving & CFD	12	July 4, 2012 22:51
Serial vs parallel different results	luca	OpenFOAM Bugs	2	December 3, 2008 10:12
Parallel Performance of Fluent	Soheyl	FLUENT	2	October 30, 2005 06:11
PC vs. Workstation	Tim Franke	Main CFD Forum	5	September 29, 1999 15:01