CFD Online Discussion Forums - Parallel Performance 2.1.1/AMI vs. 1.6-ext/GGI

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- OpenFOAM (https://www.cfd-online.com/Forums/openfoam/)

- - Parallel Performance 2.1.1/AMI vs. 1.6-ext/GGI (https://www.cfd-online.com/Forums/openfoam/114570-parallel-performance-2-1-1-ami-vs-1-6-ext-ggi.html)

timo_IHS

March 13, 2013 10:34

Parallel Performance 2.1.1/AMI vs. 1.6-ext/GGI

Hello everybody,

I have find out, that the parallel performance of AMI is not good or even bad for simulations above ~100 cores.

Description of my (test) case:
-40M elements
-4 different meshes coupled by GGI/AMI
-one of them is rotating (turbine)
-transientSimpleDyMFoam
-partitions: 128, 256, 512, 1024
-versions: 1.6-ext and 2.1.1

Has anybody similar results or suggestions on improving something?

Best regards,
Timo

http://http://www.cfd-online.com/For...84-speedup.png http://http://www.cfd-online.com/For...84-speedup.png http://www.cfd-online.com/Forums/mem...84-speedup.png
http://http://www.cfd-online.com/For...84-speedup.png

Fuchs

March 13, 2013 12:16

Hello everyone,

and thanks to Timo for highlighting this important topic and sharing your experience with the community. Could you be so kind and add the speed-up plot which compares both methods for your test case to your post? The principal outcome of the study was that the AMI performance in OpenFOAM-2.1.1 stagnates for parallel computations above approx. 100 cores (speed-up is unity, where the speed-up was measured relative to the performance of 128 cores). In contrast, the GGI in OpenFOAM-1.6-ext seems to perform fairly good ("globalFaceZones" was used during decomposition), where the speed-up between 128 and 1024 cores is approx. 3.9.

best regards,
Marian

timo_IHS

March 13, 2013 12:38

1 Attachment(s)

Now it should be visible...

gschaider

March 14, 2013 14:33

Quote:

Originally Posted by Fuchs (Post 413747)

That's interesting (but I haven't really much experience with either GGI or AMI). Anyway: have you done a similar comparison with a non-GGI/AMI-case to make sure that this is only a problem of the method and not of the communication layer? (like for instance the commsType-switch etc). And what are the calculation times you're speeding up from (are they comparable)?

linnemann

March 15, 2013 07:19

1 Attachment(s)

Hi

I have a completely different result.

Its based on a real pump geometry with 7 interfaces.

timo_IHS

April 2, 2013 06:20

Thanks for the suggestions.

For cases without interface there is no problem in performance.
For 2.1.1 I get a segmentation fault with commsType blocking.
The computational time for 128 cores is (almost) comparable.

@linnemann: you did the speed up only up to 32 cores!
BTW: how many elements do you have in total?

Best regards,
Timo

linnemann

April 2, 2013 06:30

Yes I only did it with 32 cores but our cases are normally handled with 12-24 so no need to go above that. And the cell count is roughly around 750k all hex.

hjasak

June 7, 2013 08:16

Please put all GGIs into a single patch (pair) and you will get massively better scaling.

Hrv

timo_IHS

June 7, 2013 11:00

Hello Prof. Jasak,

do I understand you correctly, that you recommend to put a ggi pair respectively the adjacent cells on a single processor to get better performance? Henry told me this already but I haven't tried it because of following reasons:
The ggi-patch has cylindrical shape, which leads to very bad distribution of the "ggi"-processors elements and the ggi-patches have between 70k and 100k faces. With this method I would have to keep ~170k elements on one processor. This leads to a large imbalance regarding the aim to use ~40k elements per processor.

Best regards,
Timo

hjasak

June 7, 2013 11:18

No, what I said is that in a multi-stage machine you can take all rotating sides and put them into one ggi patch and all stationary sides and put them into another ggi patch.

The pair of patches then makes a single ggi interface and this will make it run much faster: each ggi pair causes one additional parallel comms per iteration.

I don't care about the ggi distribution on various processors or imbalance in ggi work. What matters is the balance of CELLS per processor and this is easy to achieve.

What we saw from the previous picture is that having 7 GGI pairs ruins the performance because thye speak 7 (additional) times instead of once.

Hope this helps,

Hrv

timo_IHS

July 22, 2013 08:29

Hello Prof. Jasak,

my file system was crashed...
So I tried to merge all rotating patches to one patch.
The problem is: I have one interface that couples stationary-stationary and this does not work with mixerGgiFvMesh.
So I hacked the code and I replaced in mixerGgiFvMesh/mixerGgiFvMesh.C

Code:

    // Grab the ggi patches on the moving side

    wordList movingPatches(dict_.subDict("slider").lookup("moving"));



    forAll (movingPatches, patchI)

    {

        const label movingSliderID =

            boundaryMesh().findPatchID(movingPatches[patchI]);



        if (movingSliderID < 0)

        {

            FatalErrorIn("void mixerGgiFvMeshTK::calcMovingMasks() const")

                << "Moving slider named " << movingPatches[patchI]

                << " not found.  Valid patch names: "

                << boundaryMesh().names() << abort(FatalError);

        }



        const ggiPolyPatch& movingGgiPatch =

            refCast<const ggiPolyPatch>(boundaryMesh()[movingSliderID]);



        const labelList& movingSliderAddr = movingGgiPatch.zone();



        forAll (movingSliderAddr, faceI)

        {

            const face& curFace = f[movingSliderAddr[faceI]];



            forAll (curFace, pointI)

            {

                movingPointsMask[curFace[pointI]] = 1;

            }

        }

    }

with

Code:

    wordList movingFaceZones(dict_.subDict("slider").lookup("movingFaceZones"));



    forAll (movingFaceZones, faceZoneI)

    {



        Info<< "movingFaceZones Name: " << movingFaceZones[faceZoneI]

            << endl;



        faceZoneID zoneID(movingFaceZones[faceZoneI], faceZones());



        const labelList& movingSliderAddr = faceZones()[zoneID.index()];



        forAll (movingSliderAddr, faceI)

        {

            const face& curFace = f[movingSliderAddr[faceI]];



            forAll (curFace, pointI)

            {

                movingPointsMask[curFace[pointI]] = 1;

            }

        }



    }

For a simple pipe test case this works, but for a real case it does not. My movingCellZone is not rotating anymore expect of the interface-patch.

So, am I allowed to do it with the faceZone as written above?

timo_IHS

July 22, 2013 08:36

Another thing is the number of faces of the ggi.
With the 7 stage turbine: how many faces are on the ggi-patches?

I have ~200000 faces on the ggi. So I run in a n^2 problem:

In OpenFOAM/interpolations/GGIInterpolation/GGIInterpolationWeights.C ~107
There is written

Code:

    // First, find a rough estimate of each slave and master facet

    // neighborhood by filtering out all the faces located outside of

    // an Axis-Aligned Bounding Box (AABB).  Warning: This algorithm

    // is based on the evaluation of AABB boxes, which is pretty fast;

    // but still the complexity of the algorithm is n^2, wich is

    // pretty bad for GGI patches composed of 100,000 of facets...  So

    // here is the place where we could certainly gain major speedup

    // for larger meshes.

My question: how could I/we gain speedup for larger meshes?

mbeaudoin

July 22, 2013 16:40

Hello Timo,

> I have ~200000 faces on the ggi. So I run in a n^2 problem:

Well, that would be true if you are still using the AABB search algorithm for finding the GGI facets neighbours, or an old version of 1.6-ext.

Almost 2 years ago, I have introduced an octree-based search algorithm for speeding things up quite a bit when searching for GGI facets neighbours.

This is now the default search algorithm for the GGI (take a look at the constructors for Foam::ggiPolyPatch), so you should no longer run into the n^2 problem you are mentioning.

Best,

Martin

Quote:

Originally Posted by timo_IHS (Post 441235)

Code:

    // First, find a rough estimate of each slave and master facet

    // neighborhood by filtering out all the faces located outside of

    // an Axis-Aligned Bounding Box (AABB).  Warning: This algorithm

    // is based on the evaluation of AABB boxes, which is pretty fast;

    // but still the complexity of the algorithm is n^2, wich is

    // pretty bad for GGI patches composed of 100,000 of facets...  So

    // here is the place where we could certainly gain major speedup

    // for larger meshes.

My question: how could I/we gain speedup for larger meshes?

timo_IHS

July 23, 2013 02:56

Hello Martin,

okay, I understand.

Do you think it might be worth to test these parameters? And if so, do you have a rule of thumb for large facets?

Code:

       //  For GGI patches larger than ~100K facets, your mileage may vary.

       //  So these 3 control parameters are adjustable using the following

       //  global optimization switches:

       //

       //     GGIOctreeSearchMinNLevel

       //     GGIOctreeSearchMaxLeafRatio

       //     GGIOctreeSearchMaxShapeRatio

Best regards,
Timo

mbeaudoin

July 23, 2013 09:41

Hello Timo,

> Do you think it might be worth to test these parameters?
Yup, for large amount of GGI facets, definitely.

You can use the OptimisationSwitches section of your global controlDict file to play with these.

Here are the default values, taken from GGIInterpolationQuickRejectTests.C:

Code:

    debug::optimisationSwitch("GGIOctreeSearchMinNLevel", 3)

    debug::optimisationSwitch("GGIOctreeSearchMaxLeafRatio", 3)

    debug::optimisationSwitch("GGIOctreeSearchMaxShapeRatio", 1)

> And if so, do you have a rule of thumb for large facets?
Not really. The default values I came up with are based on my own tests, using smaller meshes than yours.

You can have a look at the header from octree.H for some comments on the values for those three parameters.

Code:

    The construction on the depth of the tree is:

      - one can specify a minimum depth

        (though the tree will never be refined if all leaves contain <=1

         shapes)

      - after the minimum depth two statistics are used to decide further

        refinement:

        - average number of entries per leaf (leafRatio). Since inside a

          leaf most algorithms are n or n^2 this value has to be small.

        - average number of leaves a shape is in. Because of bounding boxes,

          a single shape can be in multiple leaves. If the bbs are large

          compared to the leaf size this multiplicity can become extremely

          large and will become larger with more levels.

Don't hesitate to report any nice findings based on your large meshes.

Best,

Martin

Quote:

Originally Posted by timo_IHS (Post 441370)

Hello Martin,

okay, I understand.

Do you think it might be worth to test these parameters? And if so, do you have a rule of thumb for large facets?

Code:

       //  For GGI patches larger than ~100K facets, your mileage may vary.

       //  So these 3 control parameters are adjustable using the following

       //  global optimization switches:

       //

       //     GGIOctreeSearchMinNLevel

       //     GGIOctreeSearchMaxLeafRatio

       //     GGIOctreeSearchMaxShapeRatio

Best regards,
Timo

timo_IHS

April 25, 2014 07:33

Hi Martin,

I think it is quite long ago, but I found my results again.
I played (not really scientifically) a bit around with the parameters for a 14M element case and got 5% of performance improvement for the configuration 5-2-1

Best,
Timo

All times are GMT -4. The time now is 11:05.