CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM

Parallel Performance 2.1.1/AMI vs. 1.6-ext/GGI

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree4Likes
  • 1 Post By timo_IHS
  • 1 Post By linnemann
  • 2 Post By timo_IHS

Reply
 
LinkBack Thread Tools Display Modes
Old   March 13, 2013, 11:34
Default Parallel Performance 2.1.1/AMI vs. 1.6-ext/GGI
  #1
Member
 
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 7
timo_IHS is on a distinguished road
Hello everybody,

I have find out, that the parallel performance of AMI is not good or even bad for simulations above ~100 cores.

Description of my (test) case:
-40M elements
-4 different meshes coupled by GGI/AMI
-one of them is rotating (turbine)
-transientSimpleDyMFoam
-partitions: 128, 256, 512, 1024
-versions: 1.6-ext and 2.1.1

Has anybody similar results or suggestions on improving something?

Best regards,
Timo


timo_IHS is offline   Reply With Quote

Old   March 13, 2013, 13:16
Default
  #2
New Member
 
Marian Fuchs
Join Date: Dec 2010
Location: Berlin, Germany
Posts: 9
Rep Power: 6
Fuchs is on a distinguished road
Hello everyone,

and thanks to Timo for highlighting this important topic and sharing your experience with the community. Could you be so kind and add the speed-up plot which compares both methods for your test case to your post? The principal outcome of the study was that the AMI performance in OpenFOAM-2.1.1 stagnates for parallel computations above approx. 100 cores (speed-up is unity, where the speed-up was measured relative to the performance of 128 cores). In contrast, the GGI in OpenFOAM-1.6-ext seems to perform fairly good ("globalFaceZones" was used during decomposition), where the speed-up between 128 and 1024 cores is approx. 3.9.

best regards,
Marian
Fuchs is offline   Reply With Quote

Old   March 13, 2013, 13:38
Default
  #3
Member
 
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 7
timo_IHS is on a distinguished road
Now it should be visible...
Attached Images
File Type: jpg speedupRun.jpg (43.0 KB, 202 views)
elvis likes this.
timo_IHS is offline   Reply With Quote

Old   March 14, 2013, 15:33
Default
  #4
Assistant Moderator
 
Bernhard Gschaider
Join Date: Mar 2009
Posts: 3,915
Rep Power: 40
gschaider will become famous soon enoughgschaider will become famous soon enough
Quote:
Originally Posted by Fuchs View Post
Hello everyone,

and thanks to Timo for highlighting this important topic and sharing your experience with the community. Could you be so kind and add the speed-up plot which compares both methods for your test case to your post? The principal outcome of the study was that the AMI performance in OpenFOAM-2.1.1 stagnates for parallel computations above approx. 100 cores (speed-up is unity, where the speed-up was measured relative to the performance of 128 cores). In contrast, the GGI in OpenFOAM-1.6-ext seems to perform fairly good ("globalFaceZones" was used during decomposition), where the speed-up between 128 and 1024 cores is approx. 3.9.

best regards,
Marian
That's interesting (but I haven't really much experience with either GGI or AMI). Anyway: have you done a similar comparison with a non-GGI/AMI-case to make sure that this is only a problem of the method and not of the communication layer? (like for instance the commsType-switch etc). And what are the calculation times you're speeding up from (are they comparable)?
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request
gschaider is offline   Reply With Quote

Old   March 15, 2013, 08:19
Default
  #5
Senior Member
 
linnemann's Avatar
 
Niels Nielsen
Join Date: Mar 2009
Location: NJ - Denmark
Posts: 451
Rep Power: 15
linnemann will become famous soon enough
Hi

I have a completely different result.

Its based on a real pump geometry with 7 interfaces.
Attached Images
File Type: jpg 2013.03.15.000004.jpg (42.1 KB, 191 views)
wyldckat likes this.
__________________
Linnemann

PS. I do not do personal support, so please post in the forums.
linnemann is offline   Reply With Quote

Old   April 2, 2013, 06:20
Default
  #6
Member
 
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 7
timo_IHS is on a distinguished road
Thanks for the suggestions.

For cases without interface there is no problem in performance.
For 2.1.1 I get a segmentation fault with commsType blocking.
The computational time for 128 cores is (almost) comparable.

@linnemann: you did the speed up only up to 32 cores!
BTW: how many elements do you have in total?

Best regards,
Timo
timo_IHS is offline   Reply With Quote

Old   April 2, 2013, 06:30
Default
  #7
Senior Member
 
linnemann's Avatar
 
Niels Nielsen
Join Date: Mar 2009
Location: NJ - Denmark
Posts: 451
Rep Power: 15
linnemann will become famous soon enough
Yes I only did it with 32 cores but our cases are normally handled with 12-24 so no need to go above that. And the cell count is roughly around 750k all hex.
__________________
Linnemann

PS. I do not do personal support, so please post in the forums.
linnemann is offline   Reply With Quote

Old   June 7, 2013, 08:16
Default
  #8
Senior Member
 
Hrvoje Jasak
Join Date: Mar 2009
Location: London, England
Posts: 1,763
Rep Power: 21
hjasak will become famous soon enough
Please put all GGIs into a single patch (pair) and you will get massively better scaling.

Hrv
__________________
Hrvoje Jasak
Providing commercial FOAM/OpenFOAM and CFD Consulting: http://wikki.co.uk
hjasak is offline   Reply With Quote

Old   June 7, 2013, 11:00
Default
  #9
Member
 
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 7
timo_IHS is on a distinguished road
Hello Prof. Jasak,

do I understand you correctly, that you recommend to put a ggi pair respectively the adjacent cells on a single processor to get better performance? Henry told me this already but I haven't tried it because of following reasons:
The ggi-patch has cylindrical shape, which leads to very bad distribution of the "ggi"-processors elements and the ggi-patches have between 70k and 100k faces. With this method I would have to keep ~170k elements on one processor. This leads to a large imbalance regarding the aim to use ~40k elements per processor.

Best regards,
Timo
timo_IHS is offline   Reply With Quote

Old   June 7, 2013, 11:18
Default
  #10
Senior Member
 
Hrvoje Jasak
Join Date: Mar 2009
Location: London, England
Posts: 1,763
Rep Power: 21
hjasak will become famous soon enough
No, what I said is that in a multi-stage machine you can take all rotating sides and put them into one ggi patch and all stationary sides and put them into another ggi patch.

The pair of patches then makes a single ggi interface and this will make it run much faster: each ggi pair causes one additional parallel comms per iteration.

I don't care about the ggi distribution on various processors or imbalance in ggi work. What matters is the balance of CELLS per processor and this is easy to achieve.

What we saw from the previous picture is that having 7 GGI pairs ruins the performance because thye speak 7 (additional) times instead of once.

Hope this helps,

Hrv
__________________
Hrvoje Jasak
Providing commercial FOAM/OpenFOAM and CFD Consulting: http://wikki.co.uk
hjasak is offline   Reply With Quote

Old   July 22, 2013, 08:29
Default
  #11
Member
 
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 7
timo_IHS is on a distinguished road
Hello Prof. Jasak,

my file system was crashed...
So I tried to merge all rotating patches to one patch.
The problem is: I have one interface that couples stationary-stationary and this does not work with mixerGgiFvMesh.
So I hacked the code and I replaced in mixerGgiFvMesh/mixerGgiFvMesh.C
Code:
    // Grab the ggi patches on the moving side
    wordList movingPatches(dict_.subDict("slider").lookup("moving"));

    forAll (movingPatches, patchI)
    {
        const label movingSliderID =
            boundaryMesh().findPatchID(movingPatches[patchI]);

        if (movingSliderID < 0)
        {
            FatalErrorIn("void mixerGgiFvMeshTK::calcMovingMasks() const")
                << "Moving slider named " << movingPatches[patchI]
                << " not found.  Valid patch names: "
                << boundaryMesh().names() << abort(FatalError);
        }

        const ggiPolyPatch& movingGgiPatch =
            refCast<const ggiPolyPatch>(boundaryMesh()[movingSliderID]);

        const labelList& movingSliderAddr = movingGgiPatch.zone();

        forAll (movingSliderAddr, faceI)
        {
            const face& curFace = f[movingSliderAddr[faceI]];

            forAll (curFace, pointI)
            {
                movingPointsMask[curFace[pointI]] = 1;
            }
        }
    }
with

Code:
    wordList movingFaceZones(dict_.subDict("slider").lookup("movingFaceZones"));

    forAll (movingFaceZones, faceZoneI)
    {

        Info<< "movingFaceZones Name: " << movingFaceZones[faceZoneI]
            << endl;

        faceZoneID zoneID(movingFaceZones[faceZoneI], faceZones());

        const labelList& movingSliderAddr = faceZones()[zoneID.index()];

        forAll (movingSliderAddr, faceI)
        {
            const face& curFace = f[movingSliderAddr[faceI]];

            forAll (curFace, pointI)
            {
                movingPointsMask[curFace[pointI]] = 1;
            }
        }

    }
For a simple pipe test case this works, but for a real case it does not. My movingCellZone is not rotating anymore expect of the interface-patch.

So, am I allowed to do it with the faceZone as written above?
timo_IHS is offline   Reply With Quote

Old   July 22, 2013, 08:36
Default
  #12
Member
 
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 7
timo_IHS is on a distinguished road
Another thing is the number of faces of the ggi.
With the 7 stage turbine: how many faces are on the ggi-patches?

I have ~200000 faces on the ggi. So I run in a n^2 problem:

In OpenFOAM/interpolations/GGIInterpolation/GGIInterpolationWeights.C ~107
There is written
Code:
    // First, find a rough estimate of each slave and master facet
    // neighborhood by filtering out all the faces located outside of
    // an Axis-Aligned Bounding Box (AABB).  Warning: This algorithm
    // is based on the evaluation of AABB boxes, which is pretty fast;
    // but still the complexity of the algorithm is n^2, wich is
    // pretty bad for GGI patches composed of 100,000 of facets...  So
    // here is the place where we could certainly gain major speedup
    // for larger meshes.

My question: how could I/we gain speedup for larger meshes?
timo_IHS is offline   Reply With Quote

Old   July 22, 2013, 16:40
Default
  #13
Senior Member
 
Martin Beaudoin
Join Date: Mar 2009
Posts: 330
Rep Power: 13
mbeaudoin will become famous soon enough
Hello Timo,

> I have ~200000 faces on the ggi. So I run in a n^2 problem:

Well, that would be true if you are still using the AABB search algorithm for finding the GGI facets neighbours, or an old version of 1.6-ext.

Almost 2 years ago, I have introduced an octree-based search algorithm for speeding things up quite a bit when searching for GGI facets neighbours.

This is now the default search algorithm for the GGI (take a look at the constructors for Foam::ggiPolyPatch), so you should no longer run into the n^2 problem you are mentioning.

Best,

Martin

Quote:
Originally Posted by timo_IHS View Post
Another thing is the number of faces of the ggi.
With the 7 stage turbine: how many faces are on the ggi-patches?

I have ~200000 faces on the ggi. So I run in a n^2 problem:

In OpenFOAM/interpolations/GGIInterpolation/GGIInterpolationWeights.C ~107
There is written
Code:
    // First, find a rough estimate of each slave and master facet
    // neighborhood by filtering out all the faces located outside of
    // an Axis-Aligned Bounding Box (AABB).  Warning: This algorithm
    // is based on the evaluation of AABB boxes, which is pretty fast;
    // but still the complexity of the algorithm is n^2, wich is
    // pretty bad for GGI patches composed of 100,000 of facets...  So
    // here is the place where we could certainly gain major speedup
    // for larger meshes.

My question: how could I/we gain speedup for larger meshes?
mbeaudoin is offline   Reply With Quote

Old   July 23, 2013, 02:56
Default
  #14
Member
 
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 7
timo_IHS is on a distinguished road
Hello Martin,

okay, I understand.

Do you think it might be worth to test these parameters? And if so, do you have a rule of thumb for large facets?

Code:
       //  For GGI patches larger than ~100K facets, your mileage may vary.
       //  So these 3 control parameters are adjustable using the following
       //  global optimization switches:
       //
       //     GGIOctreeSearchMinNLevel
       //     GGIOctreeSearchMaxLeafRatio
       //     GGIOctreeSearchMaxShapeRatio
Best regards,
Timo
timo_IHS is offline   Reply With Quote

Old   July 23, 2013, 09:41
Default
  #15
Senior Member
 
Martin Beaudoin
Join Date: Mar 2009
Posts: 330
Rep Power: 13
mbeaudoin will become famous soon enough
Hello Timo,

> Do you think it might be worth to test these parameters?
Yup, for large amount of GGI facets, definitely.

You can use the OptimisationSwitches section of your global controlDict file to play with these.

Here are the default values, taken from GGIInterpolationQuickRejectTests.C:

Code:
    debug::optimisationSwitch("GGIOctreeSearchMinNLevel", 3)
    debug::optimisationSwitch("GGIOctreeSearchMaxLeafRatio", 3)
    debug::optimisationSwitch("GGIOctreeSearchMaxShapeRatio", 1)
> And if so, do you have a rule of thumb for large facets?
Not really. The default values I came up with are based on my own tests, using smaller meshes than yours.

You can have a look at the header from octree.H for some comments on the values for those three parameters.

Code:
    The construction on the depth of the tree is:
      - one can specify a minimum depth
        (though the tree will never be refined if all leaves contain <=1
         shapes)
      - after the minimum depth two statistics are used to decide further
        refinement:
        - average number of entries per leaf (leafRatio). Since inside a
          leaf most algorithms are n or n^2 this value has to be small.
        - average number of leaves a shape is in. Because of bounding boxes,
          a single shape can be in multiple leaves. If the bbs are large
          compared to the leaf size this multiplicity can become extremely
          large and will become larger with more levels.
Don't hesitate to report any nice findings based on your large meshes.

Best,

Martin

Quote:
Originally Posted by timo_IHS View Post
Hello Martin,

okay, I understand.

Do you think it might be worth to test these parameters? And if so, do you have a rule of thumb for large facets?

Code:
       //  For GGI patches larger than ~100K facets, your mileage may vary.
       //  So these 3 control parameters are adjustable using the following
       //  global optimization switches:
       //
       //     GGIOctreeSearchMinNLevel
       //     GGIOctreeSearchMaxLeafRatio
       //     GGIOctreeSearchMaxShapeRatio
Best regards,
Timo
mbeaudoin is offline   Reply With Quote

Old   April 25, 2014, 07:33
Default
  #16
Member
 
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 7
timo_IHS is on a distinguished road
Hi Martin,

I think it is quite long ago, but I found my results again.
I played (not really scientifically) a bit around with the parameters for a 14M element case and got 5% of performance improvement for the configuration 5-2-1

Best,
Timo
elvis and SailorLiu like this.
timo_IHS is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
parallel computing with GGI (OF 1.6 extend) A.Wendy OpenFOAM Running, Solving & CFD 1 November 18, 2012 18:27
parallel performance on BX900 uzawa OpenFOAM Installation on Windows, Mac and other Unsupported Platforms 3 September 5, 2011 15:52
Performance of GGI case in parallel hannes OpenFOAM Running, Solving & CFD 26 August 3, 2011 03:07
Parallel performance OpenFoam Vs Fluent prapanj Main CFD Forum 0 March 26, 2009 06:43
ANSYS CFX 10.0 Parallel Performance for Windows XP Saturn CFX 4 August 13, 2006 12:27


All times are GMT -4. The time now is 10:56.