massively parallel AMI: decomposePar with singleProcessorFaceSets constraint

louisgag · April 9, 2020, 12:47

I'm not able to get a balanced distribution with the automatic algorithms (scotch, kahip) when I have many patches pairs that are constrained to remain on the same processor.

I have a single AMI zone but It has more cells that what a balanced decomposition would give me. Both AMI patches together touch 44k cells but I would like to have 10k-14k cells per processor.

What I do is to slice the cylinder into more AMI patch pairs, convert them to sets, and then give those as constrained set in decomposeParDict.

When I leave the processor as -1, every set gets assigned to the same processor and I am back to having an unbalanced decomposition. I think it has to do with these lines of code: https://www.openfoam.com/documentati...ce.html#l00284

When I impose different processors myself, taking for example 5 or 50 subslices of my AMI cylinder, I still end up unbalanced. Some processor will even have 0 cells!

Is my only option to go to manual decomposition? As an alternative I'm thinking of constraining the AMI to at least each remain on the same computer socket, using multilevel or multiregion decomposition, but I have the feeling that means a lot of hand tuning and wasted time.

Anyone has experience with AMI interfaces that have more cells that an optimal parallel simulation can take? Comments/advice would be appreciated.

simply-alex · December 1, 2020, 02:17

Hey Louis, did you ever find an effective solution to this issue? I am having the same problem.

All I can think of, is splitting the large cyclic patches into multiple smaller ones and grouping them into multiple faceSets

louisgag · December 1, 2020, 03:32

Hi Alexandra,

I did try what you mention (so making "slices" of the AMIs), it works to some extent, but I usually still end up with a struggling decomposition algorithm that gives a quite unbalanced decomposition, even away from the AMI patches. It was not helping parallelization speedup.

For my last attempts, I ended up doing multi-region decomposition:
1. split the domain into regions between outer, AMI-faces, and inner regions.
2. run a modified* decomposePar with the -cellDist and by regions, using for example scotch on the outer and inner regions and a simple, manual, or hierarchical decomposition on the faces of the AMIs.
3. use the resulting cell distribution to manually decompose your original mesh (the one that was not split into regions)
*: I had to write my own decomposePar utility to do it, because otherwise each region is decomposed using the same processors, but here you want to have, for example, region0 on procs[00-05], region1 on procs[06-07], region2 on procs[08-16], etc and if you do it using python on the created cellDist files it takes forever, but with the modified decomposePar utility its quite fast.

Just as a heads up, in the end I do not gain parallel speedup by doing so, the AMI algorithm is apparently not well parallelized and thus even when all the cells are on one or a few processors you still are slow.
I made a splash talk about this last summer, only images here, but the plot on the last slide is self explanatory (free decomp is when I did not do anything to maintain the AMI faces/cells on a single or set of processors)...

I was also hinted by Eugene de Villiers, that a precomputation of the AMI weights would help and was somehow already done by someone, but I never found the related code.

Kind regards,

JMcnetee · September 6, 2021, 13:31

I am finding that OF is not scaling with number of processors (on Rescale). Based on the comments below it may be due to the AMI that I am using.

Is there anything more concrete available on how openFOAM scales when using an AMI? Is there a better alternative to allow for a better scaling with number of processors? I would like to use > 100 processors.

louisgag · September 7, 2021, 03:01

AMI is probably your best bet. Try to make the interface as clean as possible and avoid using specific rules on your decomposition method. I've been lately able to get much better scaling as before, also with higher number of processors than what you mention. I think, however, the architecture of the supercomputer you're using also plays a major role...

JMcnetee · September 15, 2021, 08:42

Hi Louis

Based on the fact that you appear to be in Stuttgart, I imagine that you are familiar with the crossflow turbine problem.

I am modelling a crossflow hydrokinetic turbine in Openfoam. Can you provide some more details of how you are able to scale up successfully to a large number of processors.

All I can get to is 60 cores.

Best Regards
Jarlath

louisgag · September 15, 2021, 08:54

No, I am not familiar with it. I do cycloidal rotors: https://www.youtube.com/watch?v=q0pafX63_x0

give me more details about your case and I'll see if I can pinpoint something, but scaling is, from my experience, very sensible... you need to do a lot of test runs...

kind regards

JMcnetee · September 15, 2021, 09:08

Hi Louis

based on your video, I think we are doing the exact same problem. In general we use fixed pitch foils, but we are looking at variable pitch also.

https://orpc.co/our-solutions/turbine-generator-unit

So the problem setup is likely very similar.

I usually setup a cylindrical AMI around the rotor zone, and generally there is only one AMI. In principle I'd like to model with y+ around 1 on the blades. That leads to large meshes.

I usually use Scotch as the decomposition method, with no constraints.

louisgag · September 16, 2021, 03:11

how many cells per processor do you have?
How many cells overall and how many AMI faces?
Are the AMI cells faces similarly sized on both sides of your interface?
What solve do you user? And Courant number? (I use pimple and very high courant number to avoid "wasting" time with AMI)

sqek · September 20, 2021, 13:49

Hello
I've just had (and solved!) a similar issue
I already had my AMI patches split into lots of smaller patches, and each patch pair had a faceSet, used in singleProcessorFaceSets
But because it forces any cells that share a *point* with a face in a singleProcessorFaceSet, and my split AMI patches were next to each other, they all shared points and were forced onto the same processor
So for patches with names

Code:

AMI1_M, AMI1_S, AMI2_M, AMI2_S, AMI3_M, AMI3_S, etc

and a topoSetDict with

Code:

{ name AMI1Set; type faceSet; action new; source patchToFace; sourceInfo { name "AMI1_[MS]"; } }

for each AMI patch pair, the point shared by AMI1_M and AMI2_M still forced them to be on the same processor.

The solution was to get topoSet to remove any faces that share points between two patches, so to get

Code:

 /   AMI1_M    \ /   AMI2_M    \
o---o---o---o===o===o---o---o---o
 \___AMI1Set___/ \___AMI2Set___/

to change to

Code:

 /   AMI1_M    \ /   AMI2_M    \
o---o---o---o===o===o---o---o---o
 \_AMI1Set_/         \_AMI2Set_/

( - s represent faces/cells, os represent points, = s are faces/cells that are highlighted - hopefully this makes some sense)
Before, the cells represented by === were part of both singleProcessorFaceSet constraints, because they both shared the middle point with both faceSets; after, each cell is only in one constraint
I did this using topoSet with something like

Code:

{ name AMI1Set; type faceSet; action new; source patchToFace; sourceInfo { name "AMI1_[MS]"; } }
{ name points1; type pointSet; action new; source faceToPoint; sourceInfo { option all; set AMI1Set; } }
{ name AMI2Set; type faceSet; action new; source patchToFace; sourceInfo { name "AMI2_[MS]"; } }
{ name points2; type pointSet; action new; source faceToPoint; sourceInfo { option all; set AMI2Set; } }
{ name AMI1Set; type faceSet; action delete; source pointToFace; sourceInfo { set points2; option any; } }
{ name AMI2Set; type faceSet; action delete; source pointToFace; sourceInfo { set points1; option any; } }

The faces that have been removed from the sets, are still kept on the same processor between AMI1Set and AMI2Set, because they are on cells which neighbour the end points of the faces which have been kept - basically if you imagine decomposePar making a pointSet from the faceSets you give it, then making a cellSet with option any, then keeping cells in that cellSet on the same processor

This way the constraints become much less severe, and the decomposition gets much better, as long as you split the cyclicAMI into small enough patches (i.e. with much fewer adjacent per patch than your target/average cells per processor)

Hope this helps!

JMcnetee · September 21, 2021, 16:56

60 processors
11.5 million cells

Number of processor faces = 574347
Max number of cells = 191344 (0.999487% above average 189450)
Max number of processor patches = 22 (204.147% above average 7.23333)
Max number of faces between processors = 35445 (85.1407% above average 19144.9)

One AMI. Uses two patches (AMI_inner and AMI_outer). The meshes on each of these patches should have been the exact same mesh, but I just noticed that the log file says that there are difference numbers of faces between source and target.

"AMI: Creating addressing and weights between 27332 source faces and 28772 target faces"

There are three surfaces to the AMI, the cylinder surface, and the two end caps. The mesh size on all three surfaces is similar.

Using pimpleFOAM

"Courant Number mean: 0.0945093 max: 1315.62"

I am using a "time step" of 1 degree. If I increase the time step to 4 degrees, the results are still good, but a little different from the 1 degree run. As time step increases to 6 degree and 8 degree, the results start to become different from the 1 degree case.

Cheers!
Jarlath

April 9, 2020, 12:47	massively parallel AMI: decomposePar with singleProcessorFaceSets constraint	#1
louisgag Senior Member Louis Gagnon Join Date: Mar 2009 Location: Stuttgart, Germany Posts: 338 Rep Power: 18	I'm not able to get a balanced distribution with the automatic algorithms (scotch, kahip) when I have many patches pairs that are constrained to remain on the same processor. I have a single AMI zone but It has more cells that what a balanced decomposition would give me. Both AMI patches together touch 44k cells but I would like to have 10k-14k cells per processor. What I do is to slice the cylinder into more AMI patch pairs, convert them to sets, and then give those as constrained set in decomposeParDict. When I leave the processor as -1, every set gets assigned to the same processor and I am back to having an unbalanced decomposition. I think it has to do with these lines of code: https://www.openfoam.com/documentati...ce.html#l00284 When I impose different processors myself, taking for example 5 or 50 subslices of my AMI cylinder, I still end up unbalanced. Some processor will even have 0 cells! Is my only option to go to manual decomposition? As an alternative I'm thinking of constraining the AMI to at least each remain on the same computer socket, using multilevel or multiregion decomposition, but I have the feeling that means a lot of hand tuning and wasted time. Anyone has experience with AMI interfaces that have more cells that an optimal parallel simulation can take? Comments/advice would be appreciated.

December 1, 2020, 03:32		#3
louisgag Senior Member Louis Gagnon Join Date: Mar 2009 Location: Stuttgart, Germany Posts: 338 Rep Power: 18	Hi Alexandra, I did try what you mention (so making "slices" of the AMIs), it works to some extent, but I usually still end up with a struggling decomposition algorithm that gives a quite unbalanced decomposition, even away from the AMI patches. It was not helping parallelization speedup. For my last attempts, I ended up doing multi-region decomposition: 1. split the domain into regions between outer, AMI-faces, and inner regions. 2. run a modified* decomposePar with the -cellDist and by regions, using for example scotch on the outer and inner regions and a simple, manual, or hierarchical decomposition on the faces of the AMIs. 3. use the resulting cell distribution to manually decompose your original mesh (the one that was not split into regions) : I had to write my own decomposePar utility to do it, because otherwise each region is decomposed using the same processors, but here you want to have, for example, region0 on procs[00-05], region1 on procs[06-07], region2 on procs[08-16], etc and if you do it using python on the created cellDist files it takes forever, but with the modified decomposePar utility its quite fast. Just as a heads up, in the end I do not gain parallel speedup by doing so*, the AMI algorithm is apparently not well parallelized and thus even when all the cells are on one or a few processors you still are slow. I made a splash talk about this last summer, only images here, but the plot on the last slide is self explanatory (free decomp is when I did not do anything to maintain the AMI faces/cells on a single or set of processors)... I was also hinted by Eugene de Villiers, that a precomputation of the AMI weights would help and was somehow already done by someone, but I never found the related code. Kind regards, simply-alex likes this.

September 6, 2021, 13:31	AMI Parallelization	#4
JMcnetee New Member Jarlath Join Date: Apr 2018 Posts: 12 Rep Power: 7	I am finding that OF is not scaling with number of processors (on Rescale). Based on the comments below it may be due to the AMI that I am using. Is there anything more concrete available on how openFOAM scales when using an AMI? Is there a better alternative to allow for a better scaling with number of processors? I would like to use > 100 processors.

September 7, 2021, 03:01		#5
louisgag Senior Member Louis Gagnon Join Date: Mar 2009 Location: Stuttgart, Germany Posts: 338 Rep Power: 18	AMI is probably your best bet. Try to make the interface as clean as possible and avoid using specific rules on your decomposition method. I've been lately able to get much better scaling as before, also with higher number of processors than what you mention. I think, however, the architecture of the supercomputer you're using also plays a major role... JMcnetee likes this.

September 15, 2021, 08:42	AMI and crossflow turbines	#6
JMcnetee New Member Jarlath Join Date: Apr 2018 Posts: 12 Rep Power: 7	Hi Louis Based on the fact that you appear to be in Stuttgart, I imagine that you are familiar with the crossflow turbine problem. I am modelling a crossflow hydrokinetic turbine in Openfoam. Can you provide some more details of how you are able to scale up successfully to a large number of processors. All I can get to is 60 cores. Best Regards Jarlath

December 1, 2020, 02:17		#2
simply-alex New Member Alexandra H Join Date: May 2020 Posts: 4 Rep Power: 5	Hey Louis, did you ever find an effective solution to this issue? I am having the same problem. All I can think of, is splitting the large cyclic patches into multiple smaller ones and grouping them into multiple faceSets

September 15, 2021, 08:54		#7
louisgag Senior Member Louis Gagnon Join Date: Mar 2009 Location: Stuttgart, Germany Posts: 338 Rep Power: 18	No, I am not familiar with it. I do cycloidal rotors: https://www.youtube.com/watch?v=q0pafX63_x0 give me more details about your case and I'll see if I can pinpoint something, but scaling is, from my experience, very sensible... you need to do a lot of test runs... kind regards

September 15, 2021, 09:08	AMI Parallelization	#8
JMcnetee New Member Jarlath Join Date: Apr 2018 Posts: 12 Rep Power: 7	Hi Louis based on your video, I think we are doing the exact same problem. In general we use fixed pitch foils, but we are looking at variable pitch also. https://orpc.co/our-solutions/turbine-generator-unit So the problem setup is likely very similar. I usually setup a cylindrical AMI around the rotor zone, and generally there is only one AMI. In principle I'd like to model with y+ around 1 on the blades. That leads to large meshes. I usually use Scotch as the decomposition method, with no constraints.

September 16, 2021, 03:11		#9
louisgag Senior Member Louis Gagnon Join Date: Mar 2009 Location: Stuttgart, Germany Posts: 338 Rep Power: 18	how many cells per processor do you have? How many cells overall and how many AMI faces? Are the AMI cells faces similarly sized on both sides of your interface? What solve do you user? And Courant number? (I use pimple and very high courant number to avoid "wasting" time with AMI)

September 20, 2021, 13:49		#10
sqek Member Join Date: May 2017 Posts: 31 Rep Power: 8	Hello I've just had (and solved!) a similar issue I already had my AMI patches split into lots of smaller patches, and each patch pair had a faceSet, used in singleProcessorFaceSets But because it forces any cells that share a point with a face in a singleProcessorFaceSet, and my split AMI patches were next to each other, they all shared points and were forced onto the same processor So for patches with names Code: AMI1_M, AMI1_S, AMI2_M, AMI2_S, AMI3_M, AMI3_S, etc and a topoSetDict with Code: { name AMI1Set; type faceSet; action new; source patchToFace; sourceInfo { name "AMI1_[MS]"; } } for each AMI patch pair, the point shared by AMI1_M and AMI2_M still forced them to be on the same processor. The solution was to get topoSet to remove any faces that share points between two patches, so to get Code: / AMI1_M \ / AMI2_M \ o---o---o---o===o===o---o---o---o \___AMI1Set___/ \___AMI2Set___/ to change to Code: / AMI1_M \ / AMI2_M \ o---o---o---o===o===o---o---o---o \_AMI1Set_/ \_AMI2Set_/ ( - s represent faces/cells, os represent points, = s are faces/cells that are highlighted - hopefully this makes some sense) Before, the cells represented by === were part of both singleProcessorFaceSet constraints, because they both shared the middle point with both faceSets; after, each cell is only in one constraint I did this using topoSet with something like Code: { name AMI1Set; type faceSet; action new; source patchToFace; sourceInfo { name "AMI1_[MS]"; } } { name points1; type pointSet; action new; source faceToPoint; sourceInfo { option all; set AMI1Set; } } { name AMI2Set; type faceSet; action new; source patchToFace; sourceInfo { name "AMI2_[MS]"; } } { name points2; type pointSet; action new; source faceToPoint; sourceInfo { option all; set AMI2Set; } } { name AMI1Set; type faceSet; action delete; source pointToFace; sourceInfo { set points2; option any; } } { name AMI2Set; type faceSet; action delete; source pointToFace; sourceInfo { set points1; option any; } } The faces that have been removed from the sets, are still kept on the same processor between AMI1Set and AMI2Set, because they are on cells which neighbour the end points of the faces which have been kept - basically if you imagine decomposePar making a pointSet from the faceSets you give it, then making a cellSet with option any, then keeping cells in that cellSet on the same processor This way the constraints become much less severe, and the decomposition gets much better, as long as you split the cyclicAMI into small enough patches (i.e. with much fewer adjacent per patch than your target/average cells per processor) Hope this helps!

September 21, 2021, 16:56		#11
JMcnetee New Member Jarlath Join Date: Apr 2018 Posts: 12 Rep Power: 7	60 processors 11.5 million cells Number of processor faces = 574347 Max number of cells = 191344 (0.999487% above average 189450) Max number of processor patches = 22 (204.147% above average 7.23333) Max number of faces between processors = 35445 (85.1407% above average 19144.9) One AMI. Uses two patches (AMI_inner and AMI_outer). The meshes on each of these patches should have been the exact same mesh, but I just noticed that the log file says that there are difference numbers of faces between source and target. "AMI: Creating addressing and weights between 27332 source faces and 28772 target faces" There are three surfaces to the AMI, the cylinder surface, and the two end caps. The mesh size on all three surfaces is similar. Using pimpleFOAM "Courant Number mean: 0.0945093 max: 1315.62" I am using a "time step" of 1 degree. If I increase the time step to 4 degrees, the results are still good, but a little different from the 1 degree run. As time step increases to 6 degree and 8 degree, the results start to become different from the 1 degree case. Cheers! Jarlath

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[swak4Foam] reconstructPar error in MRFsimpleFoam (ami apprach) using GroovyBC velocity inlet	Krao	OpenFOAM Community Contributions	3	August 19, 2019 05:40
MPI error in parallel application	usv001	OpenFOAM Programming & Development	2	September 14, 2017 11:30
chtMultiRegionSimpleFoam: crash on parallel run	student666	OpenFOAM Running, Solving & CFD	3	April 20, 2017 11:05
snappyHexMesh in parallel with AMI	louisgag	OpenFOAM Pre-Processing	8	September 15, 2014 02:57
New sixDoFRigidBody BC working with laplaceFaceDecomposition	Ya_Squall2010	OpenFOAM Running, Solving & CFD	13	April 17, 2013 02:04