CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > SU2

Hybrid OpenmMP+MPI optimisation

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   March 30, 2022, 15:20
Default Hybrid OpenmMP+MPI optimisation
  #1
New Member
 
Marco
Join Date: Mar 2014
Posts: 8
Rep Power: 12
geronimo_750 is on a distinguished road
Hi All,

I am trying to optimize a run utilizing the hybrid parallelization approach implemented in SU2.

I am trying to run using 2 nodes, 2 tasks per node, and 16 CPUs with the following SLURM script:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=16

export OMP_NUM_THREAD=$SLURM_CPUS_PER_TASK
export OMP_WAIT_POLICY=ACTIVE

mpirun -n $SLURM_NTASKS --bind-to none SU2_CFD -t $SLURM_CPUS_PER_TASK Config.cfg

The mesh has 20M cells, and the run works fine with sensible results.

The problem I am having however is that I cannot optimise the performance of the approach and I keep receiving the following warning.

WARNING: On 4 MPI ranks the coloring efficiency was less than 0.875 (min value was 0.0625). Those ranks will now use a fallback strategy, better performance may be possible with a different value of config option EDGE_COLORING_GROUP_SIZE (default 512).

I have tried different values of EDGE_COLORING_GROUP_SIZE, (e.g., 32,64,128,512,1028) but I keep receiving the same message.

If anybody can shed some light on this, that would be much appreciated!

Thanks a lot for your help.
geronimo_750 is offline   Reply With Quote

Old   March 31, 2022, 06:28
Default
  #2
pcg
Senior Member
 
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 13
pcg is on a distinguished road
Hi Marco,
If you have multigrid turned on that message might be for the coarse grids (where it does not matter much).
I assume the nodes have 2 CPUs of 16 cores each?
In general you never want to do bind to none, should be numa-node. And the threads should bind to cores.
Finally with only 2 nodes and a large mesh it is possible that the communication costs are not high enough to offset the openmp overhead, how much slower is it compared to just MPI? You can try using more tasks and fewer CPUs per task.
pcg is offline   Reply With Quote

Old   March 31, 2022, 08:03
Default
  #3
New Member
 
Marco
Join Date: Mar 2014
Posts: 8
Rep Power: 12
geronimo_750 is on a distinguished road
Hi Pedro,

Thanks for your feedback.

Just to answer to your questions:
  • No, I do not have multigrid on. Unfortunately, that message is for the full mesh!
  • Each node has 2 AMD EPYC 7702 dual 64-core Processor
    • 128 cores in total per node
  • I was having some problems with bind to numa, but I will try bind to numa-node
The main reason why I am trying to use the Hybrid approach is that our cluster suffers from "bandwidth congestion"!

If we use only MPI, the performance degrades sensibly during heavy usage of the cluster, especially if SU2 runs on multiple nodes.
We can obviously run on a single node, but that increases exponentially the waiting time in the queue!

What we are hoping to achieve is that using the hybrid approach, we could be able to minimize MPI communication achieving performances which are less affected by how the cluster is used (if it makes sense!)

Thanks again for your help.


Marco



Quote:
Originally Posted by pcg View Post
Hi Marco,
If you have multigrid turned on that message might be for the coarse grids (where it does not matter much).
I assume the nodes have 2 CPUs of 16 cores each?
In general you never want to do bind to none, should be numa-node. And the threads should bind to cores.
Finally with only 2 nodes and a large mesh it is possible that the communication costs are not high enough to offset the openmp overhead, how much slower is it compared to just MPI? You can try using more tasks and fewer CPUs per task.
geronimo_750 is offline   Reply With Quote

Old   March 31, 2022, 12:27
Default
  #4
pcg
Senior Member
 
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 13
pcg is on a distinguished road
Understood. For that CPU it will be essential to bind to numa and not use less than 1 MPI rank per numa node.
From this "(min value was 0.0625)" (1/16) it looks like the coloring is failing, since the mesh is large you may try increasing the group size more.
pcg is offline   Reply With Quote

Old   April 5, 2022, 06:02
Default
  #5
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,406
Rep Power: 48
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Not that I would know the first thing about SU2 in particular, but for hybrid parallelization with MPI+OpenMP it is often necessary to take the underlying hardware into account.

AMD EPYC 7702 CPUs are quite complex in that regard. For each CPU you have:
  • up to 4 NUMA nodes with NPS=4, which is probably the better setting compared to NPS=1
  • 8 dies
  • two segments of L3 cache within each die. Communicating outside of these 16MB of L3 cache can be slow

For an MPI+OpenMP approach, my first order of business would be to set NPS=4 in bios. Consult the output of lstopo, lscpu or numactl --hardware to see how many NUMA nodes you have.
Then limit each OpenMP region to a single NUMA node (now 4 per CPU or 8 per node with 16 CPU cores each), and let MPI handle communication across each NUMA node,
Depending on communication/synchronization requirements of the solver, it might even be necessary to go one step further: Each OpenMP region only spans a single segment of L3 cache (containing 4 CPU cores).
flotus1 is offline   Reply With Quote

Old   April 5, 2022, 18:21
Default
  #6
New Member
 
Marco
Join Date: Mar 2014
Posts: 8
Rep Power: 12
geronimo_750 is on a distinguished road
Hello Alex, thanks a lot for your answer, but I must say I am a bit lost!

Unfortunately altering the BIOS is not an option, being an HPC facility and I do not have control over it.

Secondly, I run the numactl --hardware command and below is the output I got:

[[kelvin2] ~]$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 128426 MB
node 0 free: 94622 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 129020 MB
node 1 free: 120385 MB
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 2 size: 64508 MB
node 2 free: 55138 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 3 size: 64496 MB
node 3 free: 57355 MB
node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 4 size: 129004 MB
node 4 free: 110361 MB
node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 5 size: 129020 MB
node 5 free: 122277 MB
node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 6 size: 64508 MB
node 6 free: 58994 MB
node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 7 size: 64508 MB
node 7 free: 51406 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 12 12 12 32 32 32 32
1: 12 10 12 12 32 32 32 32
2: 12 12 10 12 32 32 32 32
3: 12 12 12 10 32 32 32 32
4: 32 32 32 32 10 12 12 12
5: 32 32 32 32 12 10 12 12
6: 32 32 32 32 12 12 10 12
7: 32 32 32 32 12 12 12 10


And there I am kind of lost! :-) I understand that the 128 cores available are divided in 8 nodes (dies??), the second line should be the memory available and free for each "node" but the meaning of node distance is unclear to me.

Moreover, (and apologies for my ignorance) I am not really sure on how to interpret what you wrote about “let MPI handle communication across each NUMA node”.

Does it mean that for the hardware I have, I should limit the shared memory operation (-t) to 8 ?

Thanks a gain!
geronimo_750 is offline   Reply With Quote

Old   April 5, 2022, 18:42
Default
  #7
pcg
Senior Member
 
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 13
pcg is on a distinguished road
Yep a minimum of 8 tasks per node, and 16 CPUs per task. And those tasks should bind to numa node.
Given the L3 cache detail 16 tasks per node with 8 CPUs may indeed be better.
With fewer CPUs per task it will be easier to find a suitable color group size that is efficient.
pcg is offline   Reply With Quote

Old   April 6, 2022, 04:59
Default
  #8
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,406
Rep Power: 48
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
And there I am kind of lost! :-) I understand that the 128 cores available are divided in 8 nodes (dies??), the second line should be the memory available and free for each "node" but the meaning of node distance is unclear to me.

Moreover, (and apologies for my ignorance) I am not really sure on how to interpret what you wrote about “let MPI handle communication across each NUMA node”.
From the output of numactl, we can see that NPS=4 is already set. Hence the 8 NUMA nodes total, 4 per CPU.
1 NUMA node with NPS=4 corresponds to a memory controller. Each CPU has 4 dual-channel memory controllers.
CPU dies are one more layer of segmentation, with 2 dies per memory controller.
And lastly, each die has two "CCX", each with its own separate chunk of L3 cache.

"Distance" in this output gives you a first rough idea how fast communication is between the individual NUMA nodes. E.g. the 1st line, 1st column entry is 10. Meaning that communicating within the first NUMA node is relatively fast. On the other end, 1st line, 8th column entry is 32. So communication between cores on NUMA node 0 and 7 is much slower.
Don't read too much into that for now, it just tells us what we already know: intra-node communication is faster than inter-node communication. Hence my recommendation of keeping OpenMP regions contained within a NUMA node.

But another issue sticks out: Memory population on this machine is unbalanced. You can see it from the different sizes of the NUMA nodes. This should really be avoided, otherwise it can cause performance regression with these CPUs. It strikes me as very odd that a HPC facility would run their nodes like this

I wish I could help you more, but I am not familiar with the nomenclature of SU2.
flotus1 is offline   Reply With Quote

Old   April 6, 2022, 07:29
Unhappy
  #9
New Member
 
Marco
Join Date: Mar 2014
Posts: 8
Rep Power: 12
geronimo_750 is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
But another issue sticks out: Memory population on this machine is unbalanced. You can see it from the different sizes of the NUMA nodes. This should really be avoided, otherwise it can cause performance regression with these CPUs. It strikes me as very odd that a HPC facility would run their nodes like this

This is something I have been discussing with the HPC system guys for more than 2 years, but for whatever reason they do not want to address the problem!


Thanks again for your explanation, it is much appreciated.
geronimo_750 is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
mpirun, best parameters pablodecastillo Hardware 18 November 10, 2016 12:36
[OpenFOAM.org] MPI compiling and version mismatch pki OpenFOAM Installation 7 June 15, 2015 16:21
Sgimpi pere OpenFOAM 27 September 24, 2011 07:57
Error using LaunderGibsonRSTM on SGI ALTIX 4700 jaswi OpenFOAM 2 April 29, 2008 10:54
Is Testsuite on the way or not lakeat OpenFOAM Installation 6 April 28, 2008 11:12


All times are GMT -4. The time now is 12:26.