Hybrid OpenmMP+MPI optimisation

geronimo_750 · March 30, 2022, 15:20

Hi All,

I am trying to optimize a run utilizing the hybrid parallelization approach implemented in SU2.

I am trying to run using 2 nodes, 2 tasks per node, and 16 CPUs with the following SLURM script:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=16

export OMP_NUM_THREAD=$SLURM_CPUS_PER_TASK
export OMP_WAIT_POLICY=ACTIVE

mpirun -n $SLURM_NTASKS --bind-to none SU2_CFD -t $SLURM_CPUS_PER_TASK Config.cfg

The mesh has 20M cells, and the run works fine with sensible results.

The problem I am having however is that I cannot optimise the performance of the approach and I keep receiving the following warning.

WARNING: On 4 MPI ranks the coloring efficiency was less than 0.875 (min value was 0.0625). Those ranks will now use a fallback strategy, better performance may be possible with a different value of config option EDGE_COLORING_GROUP_SIZE (default 512).

I have tried different values of EDGE_COLORING_GROUP_SIZE, (e.g., 32,64,128,512,1028) but I keep receiving the same message.

If anybody can shed some light on this, that would be much appreciated!

Thanks a lot for your help.

pcg · March 31, 2022, 06:28

Hi Marco,
If you have multigrid turned on that message might be for the coarse grids (where it does not matter much).
I assume the nodes have 2 CPUs of 16 cores each?
In general you never want to do bind to none, should be numa-node. And the threads should bind to cores.
Finally with only 2 nodes and a large mesh it is possible that the communication costs are not high enough to offset the openmp overhead, how much slower is it compared to just MPI? You can try using more tasks and fewer CPUs per task.

geronimo_750 · March 31, 2022, 08:03

Hi Pedro,

Thanks for your feedback.

Just to answer to your questions:

No, I do not have multigrid on. Unfortunately, that message is for the full mesh!
Each node has 2 AMD EPYC 7702 dual 64-core Processor
- 128 cores in total per node
I was having some problems with bind to numa, but I will try bind to numa-node

The main reason why I am trying to use the Hybrid approach is that our cluster suffers from "bandwidth congestion"!

If we use only MPI, the performance degrades sensibly during heavy usage of the cluster, especially if SU2 runs on multiple nodes.
We can obviously run on a single node, but that increases exponentially the waiting time in the queue!

What we are hoping to achieve is that using the hybrid approach, we could be able to minimize MPI communication achieving performances which are less affected by how the cluster is used (if it makes sense!)

Thanks again for your help.

Marco

Quote:

Originally Posted by pcg

Hi Marco,
If you have multigrid turned on that message might be for the coarse grids (where it does not matter much).
I assume the nodes have 2 CPUs of 16 cores each?
In general you never want to do bind to none, should be numa-node. And the threads should bind to cores.
Finally with only 2 nodes and a large mesh it is possible that the communication costs are not high enough to offset the openmp overhead, how much slower is it compared to just MPI? You can try using more tasks and fewer CPUs per task.

pcg · March 31, 2022, 12:27

Understood. For that CPU it will be essential to bind to numa and not use less than 1 MPI rank per numa node.
From this "(min value was 0.0625)" (1/16) it looks like the coloring is failing, since the mesh is large you may try increasing the group size more.

flotus1 · April 5, 2022, 06:02

Not that I would know the first thing about SU2 in particular, but for hybrid parallelization with MPI+OpenMP it is often necessary to take the underlying hardware into account.

AMD EPYC 7702 CPUs are quite complex in that regard. For each CPU you have:

up to 4 NUMA nodes with NPS=4, which is probably the better setting compared to NPS=1
8 dies
two segments of L3 cache within each die. Communicating outside of these 16MB of L3 cache can be slow

For an MPI+OpenMP approach, my first order of business would be to set NPS=4 in bios. Consult the output of lstopo, lscpu or numactl --hardware to see how many NUMA nodes you have.
Then limit each OpenMP region to a single NUMA node (now 4 per CPU or 8 per node with 16 CPU cores each), and let MPI handle communication across each NUMA node,
Depending on communication/synchronization requirements of the solver, it might even be necessary to go one step further: Each OpenMP region only spans a single segment of L3 cache (containing 4 CPU cores).

geronimo_750 · April 5, 2022, 18:21

Hello Alex, thanks a lot for your answer, but I must say I am a bit lost!

Unfortunately altering the BIOS is not an option, being an HPC facility and I do not have control over it.

Secondly, I run the numactl --hardware command and below is the output I got:

[[kelvin2] ~]$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 128426 MB
node 0 free: 94622 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 129020 MB
node 1 free: 120385 MB
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 2 size: 64508 MB
node 2 free: 55138 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 3 size: 64496 MB
node 3 free: 57355 MB
node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 4 size: 129004 MB
node 4 free: 110361 MB
node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 5 size: 129020 MB
node 5 free: 122277 MB
node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 6 size: 64508 MB
node 6 free: 58994 MB
node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 7 size: 64508 MB
node 7 free: 51406 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 12 12 12 32 32 32 32
1: 12 10 12 12 32 32 32 32
2: 12 12 10 12 32 32 32 32
3: 12 12 12 10 32 32 32 32
4: 32 32 32 32 10 12 12 12
5: 32 32 32 32 12 10 12 12
6: 32 32 32 32 12 12 10 12
7: 32 32 32 32 12 12 12 10

And there I am kind of lost! :-) I understand that the 128 cores available are divided in 8 nodes (dies??), the second line should be the memory available and free for each "node" but the meaning of node distance is unclear to me.

Moreover, (and apologies for my ignorance) I am not really sure on how to interpret what you wrote about “let MPI handle communication across each NUMA node”.

Does it mean that for the hardware I have, I should limit the shared memory operation (-t) to 8 ?

Thanks a gain!

pcg · April 5, 2022, 18:42

Yep a minimum of 8 tasks per node, and 16 CPUs per task. And those tasks should bind to numa node.
Given the L3 cache detail 16 tasks per node with 8 CPUs may indeed be better.
With fewer CPUs per task it will be easier to find a suitable color group size that is efficient.

flotus1 · April 6, 2022, 04:59

Quote:

And there I am kind of lost! :-) I understand that the 128 cores available are divided in 8 nodes (dies??), the second line should be the memory available and free for each "node" but the meaning of node distance is unclear to me.

Moreover, (and apologies for my ignorance) I am not really sure on how to interpret what you wrote about “let MPI handle communication across each NUMA node”.

From the output of numactl, we can see that NPS=4 is already set. Hence the 8 NUMA nodes total, 4 per CPU.
1 NUMA node with NPS=4 corresponds to a memory controller. Each CPU has 4 dual-channel memory controllers.
CPU dies are one more layer of segmentation, with 2 dies per memory controller.
And lastly, each die has two "CCX", each with its own separate chunk of L3 cache.

"Distance" in this output gives you a first rough idea how fast communication is between the individual NUMA nodes. E.g. the 1st line, 1st column entry is 10. Meaning that communicating within the first NUMA node is relatively fast. On the other end, 1st line, 8th column entry is 32. So communication between cores on NUMA node 0 and 7 is much slower.
Don't read too much into that for now, it just tells us what we already know: intra-node communication is faster than inter-node communication. Hence my recommendation of keeping OpenMP regions contained within a NUMA node.

But another issue sticks out: Memory population on this machine is unbalanced. You can see it from the different sizes of the NUMA nodes. This should really be avoided, otherwise it can cause performance regression with these CPUs. It strikes me as very odd that a HPC facility would run their nodes like this

I wish I could help you more, but I am not familiar with the nomenclature of SU2.

geronimo_750 · April 6, 2022, 07:29

Quote:

Originally Posted by flotus1

But another issue sticks out: Memory population on this machine is unbalanced. You can see it from the different sizes of the NUMA nodes. This should really be avoided, otherwise it can cause performance regression with these CPUs. It strikes me as very odd that a HPC facility would run their nodes like this

This is something I have been discussing with the HPC system guys for more than 2 years, but for whatever reason they do not want to address the problem!

Thanks again for your explanation, it is much appreciated.

March 30, 2022, 15:20	Hybrid OpenmMP+MPI optimisation	#1
geronimo_750 New Member Marco Join Date: Mar 2014 Posts: 8 Rep Power: 12	Hi All, I am trying to optimize a run utilizing the hybrid parallelization approach implemented in SU2. I am trying to run using 2 nodes, 2 tasks per node, and 16 CPUs with the following SLURM script: #SBATCH --nodes=2 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=16 export OMP_NUM_THREAD=$SLURM_CPUS_PER_TASK export OMP_WAIT_POLICY=ACTIVE mpirun -n $SLURM_NTASKS --bind-to none SU2_CFD -t $SLURM_CPUS_PER_TASK Config.cfg The mesh has 20M cells, and the run works fine with sensible results. The problem I am having however is that I cannot optimise the performance of the approach and I keep receiving the following warning. WARNING: On 4 MPI ranks the coloring efficiency was less than 0.875 (min value was 0.0625). Those ranks will now use a fallback strategy, better performance may be possible with a different value of config option EDGE_COLORING_GROUP_SIZE (default 512). I have tried different values of EDGE_COLORING_GROUP_SIZE, (e.g., 32,64,128,512,1028) but I keep receiving the same message. If anybody can shed some light on this, that would be much appreciated! Thanks a lot for your help.

April 5, 2022, 06:02		#5
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,412 Rep Power: 49	Not that I would know the first thing about SU2 in particular, but for hybrid parallelization with MPI+OpenMP it is often necessary to take the underlying hardware into account. AMD EPYC 7702 CPUs are quite complex in that regard. For each CPU you have: up to 4 NUMA nodes with NPS=4, which is probably the better setting compared to NPS=1 8 dies two segments of L3 cache within each die. Communicating outside of these 16MB of L3 cache can be slow For an MPI+OpenMP approach, my first order of business would be to set NPS=4 in bios. Consult the output of lstopo, lscpu or numactl --hardware to see how many NUMA nodes you have. Then limit each OpenMP region to a single NUMA node (now 4 per CPU or 8 per node with 16 CPU cores each), and let MPI handle communication across each NUMA node, Depending on communication/synchronization requirements of the solver, it might even be necessary to go one step further: Each OpenMP region only spans a single segment of L3 cache (containing 4 CPU cores).

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
mpirun, best parameters	pablodecastillo	Hardware	18	November 10, 2016 12:36
[OpenFOAM.org] MPI compiling and version mismatch	pki	OpenFOAM Installation	7	June 15, 2015 16:21
Sgimpi	pere	OpenFOAM	27	September 24, 2011 07:57
Error using LaunderGibsonRSTM on SGI ALTIX 4700	jaswi	OpenFOAM	2	April 29, 2008 10:54
Is Testsuite on the way or not	lakeat	OpenFOAM Installation	6	April 28, 2008 11:12

March 31, 2022, 06:28		#2
pcg Senior Member Pedro Gomes Join Date: Dec 2017 Posts: 466 Rep Power: 13	Hi Marco, If you have multigrid turned on that message might be for the coarse grids (where it does not matter much). I assume the nodes have 2 CPUs of 16 cores each? In general you never want to do bind to none, should be numa-node. And the threads should bind to cores. Finally with only 2 nodes and a large mesh it is possible that the communication costs are not high enough to offset the openmp overhead, how much slower is it compared to just MPI? You can try using more tasks and fewer CPUs per task.

March 31, 2022, 12:27		#4
pcg Senior Member Pedro Gomes Join Date: Dec 2017 Posts: 466 Rep Power: 13	Understood. For that CPU it will be essential to bind to numa and not use less than 1 MPI rank per numa node. From this "(min value was 0.0625)" (1/16) it looks like the coloring is failing, since the mesh is large you may try increasing the group size more.

April 5, 2022, 18:21		#6
geronimo_750 New Member Marco Join Date: Mar 2014 Posts: 8 Rep Power: 12	Hello Alex, thanks a lot for your answer, but I must say I am a bit lost! Unfortunately altering the BIOS is not an option, being an HPC facility and I do not have control over it. Secondly, I run the numactl --hardware command and below is the output I got: [[kelvin2] ~]$ numactl --hardware available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 size: 128426 MB node 0 free: 94622 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 1 size: 129020 MB node 1 free: 120385 MB node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 2 size: 64508 MB node 2 free: 55138 MB node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 3 size: 64496 MB node 3 free: 57355 MB node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 node 4 size: 129004 MB node 4 free: 110361 MB node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 5 size: 129020 MB node 5 free: 122277 MB node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 node 6 size: 64508 MB node 6 free: 58994 MB node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 7 size: 64508 MB node 7 free: 51406 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 12 12 12 32 32 32 32 1: 12 10 12 12 32 32 32 32 2: 12 12 10 12 32 32 32 32 3: 12 12 12 10 32 32 32 32 4: 32 32 32 32 10 12 12 12 5: 32 32 32 32 12 10 12 12 6: 32 32 32 32 12 12 10 12 7: 32 32 32 32 12 12 12 10 And there I am kind of lost! :-) I understand that the 128 cores available are divided in 8 nodes (dies??), the second line should be the memory available and free for each "node" but the meaning of node distance is unclear to me. Moreover, (and apologies for my ignorance) I am not really sure on how to interpret what you wrote about “let MPI handle communication across each NUMA node”. Does it mean that for the hardware I have, I should limit the shared memory operation (-t) to 8 ? Thanks a gain!

April 5, 2022, 18:42		#7
pcg Senior Member Pedro Gomes Join Date: Dec 2017 Posts: 466 Rep Power: 13	Yep a minimum of 8 tasks per node, and 16 CPUs per task. And those tasks should bind to numa node. Given the L3 cache detail 16 tasks per node with 8 CPUs may indeed be better. With fewer CPUs per task it will be easier to find a suitable color group size that is efficient.