Parallel Performance 2.1.1/AMI vs. 1.6-ext/GGI
Hello everybody,
I have find out, that the parallel performance of AMI is not good or even bad for simulations above ~100 cores. Description of my (test) case: -40M elements -4 different meshes coupled by GGI/AMI -one of them is rotating (turbine) -transientSimpleDyMFoam -partitions: 128, 256, 512, 1024 -versions: 1.6-ext and 2.1.1 Has anybody similar results or suggestions on improving something? Best regards, Timo http://http://www.cfd-online.com/For...84-speedup.pnghttp://http://www.cfd-online.com/For...84-speedup.pnghttp://www.cfd-online.com/Forums/mem...84-speedup.png http://http://www.cfd-online.com/For...84-speedup.png |
Hello everyone,
and thanks to Timo for highlighting this important topic and sharing your experience with the community. Could you be so kind and add the speed-up plot which compares both methods for your test case to your post? The principal outcome of the study was that the AMI performance in OpenFOAM-2.1.1 stagnates for parallel computations above approx. 100 cores (speed-up is unity, where the speed-up was measured relative to the performance of 128 cores). In contrast, the GGI in OpenFOAM-1.6-ext seems to perform fairly good ("globalFaceZones" was used during decomposition), where the speed-up between 128 and 1024 cores is approx. 3.9. best regards, Marian |
1 Attachment(s)
Now it should be visible...
|
Quote:
|
1 Attachment(s)
Hi
I have a completely different result. Its based on a real pump geometry with 7 interfaces. |
Thanks for the suggestions.
For cases without interface there is no problem in performance. For 2.1.1 I get a segmentation fault with commsType blocking. The computational time for 128 cores is (almost) comparable. @linnemann: you did the speed up only up to 32 cores! BTW: how many elements do you have in total? Best regards, Timo |
Yes I only did it with 32 cores but our cases are normally handled with 12-24 so no need to go above that. And the cell count is roughly around 750k all hex.
|
Please put all GGIs into a single patch (pair) and you will get massively better scaling.
Hrv |
Hello Prof. Jasak,
do I understand you correctly, that you recommend to put a ggi pair respectively the adjacent cells on a single processor to get better performance? Henry told me this already but I haven't tried it because of following reasons: The ggi-patch has cylindrical shape, which leads to very bad distribution of the "ggi"-processors elements and the ggi-patches have between 70k and 100k faces. With this method I would have to keep ~170k elements on one processor. This leads to a large imbalance regarding the aim to use ~40k elements per processor. Best regards, Timo |
No, what I said is that in a multi-stage machine you can take all rotating sides and put them into one ggi patch and all stationary sides and put them into another ggi patch.
The pair of patches then makes a single ggi interface and this will make it run much faster: each ggi pair causes one additional parallel comms per iteration. I don't care about the ggi distribution on various processors or imbalance in ggi work. What matters is the balance of CELLS per processor and this is easy to achieve. What we saw from the previous picture is that having 7 GGI pairs ruins the performance because thye speak 7 (additional) times instead of once. Hope this helps, Hrv |
Hello Prof. Jasak,
my file system was crashed... So I tried to merge all rotating patches to one patch. The problem is: I have one interface that couples stationary-stationary and this does not work with mixerGgiFvMesh. So I hacked the code and I replaced in mixerGgiFvMesh/mixerGgiFvMesh.C Code:
// Grab the ggi patches on the moving side Code:
wordList movingFaceZones(dict_.subDict("slider").lookup("movingFaceZones")); So, am I allowed to do it with the faceZone as written above? |
Another thing is the number of faces of the ggi.
With the 7 stage turbine: how many faces are on the ggi-patches? I have ~200000 faces on the ggi. So I run in a n^2 problem: In OpenFOAM/interpolations/GGIInterpolation/GGIInterpolationWeights.C ~107 There is written Code:
// First, find a rough estimate of each slave and master facet My question: how could I/we gain speedup for larger meshes? |
Hello Timo,
> I have ~200000 faces on the ggi. So I run in a n^2 problem: Well, that would be true if you are still using the AABB search algorithm for finding the GGI facets neighbours, or an old version of 1.6-ext. Almost 2 years ago, I have introduced an octree-based search algorithm for speeding things up quite a bit when searching for GGI facets neighbours. This is now the default search algorithm for the GGI (take a look at the constructors for Foam::ggiPolyPatch), so you should no longer run into the n^2 problem you are mentioning. Best, Martin Quote:
|
Hello Martin,
okay, I understand. Do you think it might be worth to test these parameters? And if so, do you have a rule of thumb for large facets? Code:
// For GGI patches larger than ~100K facets, your mileage may vary. Timo |
Hello Timo,
> Do you think it might be worth to test these parameters? Yup, for large amount of GGI facets, definitely. You can use the OptimisationSwitches section of your global controlDict file to play with these. Here are the default values, taken from GGIInterpolationQuickRejectTests.C: Code:
debug::optimisationSwitch("GGIOctreeSearchMinNLevel", 3) Not really. The default values I came up with are based on my own tests, using smaller meshes than yours. You can have a look at the header from octree.H for some comments on the values for those three parameters. Code:
The construction on the depth of the tree is: Best, Martin Quote:
|
Hi Martin,
I think it is quite long ago, but I found my results again. I played (not really scientifically) a bit around with the parameters for a 14M element case and got 5% of performance improvement for the configuration 5-2-1 Best, Timo |
All times are GMT -4. The time now is 11:05. |