MPI & shared memory

usv001 · December 15, 2018, 02:57

Dear Foamers,

I would like to know if the following is possible:

Say that I am running a case in parallel. Assuming that all the cores are within the same node, is it possible to declare a shared memory in the heap that is visible to all the cores. Specifically, if each processor creates a field as shown below,

Code:

scalarField* fieldPtr(new scalarField(n));

Can one core access the field created by another core using the pointer address?

Has anyone implemented something like this before? If so, how to go about doing it?

USV

olesen · December 30, 2018, 05:30

Currently no DMA or RDMA wrapping in OpenFOAM. You will have create your own MPI communicators, access windows etc.

usv001 · December 30, 2018, 23:47

Thank you Mark.

After a little scouring of the Internet, I came to the same conclusion. However, there is a simple but limited solution which is to use OpenMP. Since I created my own schemes and solver, I was able to incorporate quite a bit of OpenMP parallelism into the code. For those trying to use existing solvers/schemes, unfortunately, this won't help you too much unless you re-write the schemes with OpenMP pragmas.

To compile with OpenMP, include the 'fopenmp' flag in the file '$WM_PROJECT_DIR/wmake/rules/linux64Gcc/c++Opt'. So, it should read like this:

Code:

$:cat $WM_PROJECT_DIR/wmake/rules/linux64Gcc/c++Opt
c++DBUG     = 
c++OPT      = -O2 -fopenmp

Remember to source the etc/bashrc file before compiling.

In your solver/scheme, you may need to include "omp.h" for the pragmas to work. After this, you're pretty much set. You can parallelize loops as follows:

Code:

#pragma omp parallel for
forAll(neighbour, celli)
{
    ...
}

When running in MPI/OpenMP mode, I decomposed the domain into the number of NUMA nodes that I am using rather than the total number of cores. After that, I set the number of OMP threads to the number of cores present in each NUMA node using the environment variable 'OMP_NUM_THREADS'. For instance, let's say that there are 4 NUMA nodes in each socket and each NUMA node consists of 6 cores (i.e. 24 cores per socket). If I wish to use 2 sockets in total, I can decompose the domain into 8 sub-domains and run the solver as follows:

Code:

export OMP_NUM_THREADS=6
mpirun -np 8 --map-by ppr:1:numa:pe=6 solver -parallel

This tells that I would like to start 8 MPI processes (same as the number of sub-domains) and each process should be given to 1 NUMA node with 6 cores/threads allocated to each process. Inside each NUMA node, OpenMP can then parallelize over the 6 cores/threads available.

A word of caution though. This may not run any faster (it fact, it ran much slower in many cases) unless a significant portion of the code (i.e. the heavy duty loops) is parallelized and the OpenMP overhead is kept small. Usually, the benefits start showing at higher core counts when MPI traffic starts to dominate. In other cases, I think the bulit-in MPI alone is more efficient.

Lastly, I am no expert in these areas. Just an amateur. So, there could be other things that I am missing and better ways to go about doing it. So, feel free to correct my mistakes and suggest better ways to go about it...

Cheers,
USV

olesen · December 31, 2018, 04:06

Quote:

Originally Posted by usv001

Thank you Mark.

After a little scouring of the Internet, I came to the same conclusion. However, there is a simple but limited solution which is to use OpenMP.
...
To compile with OpenMP, include the 'fopenmp' flag in the file '$WM_PROJECT_DIR/wmake/rules/linux64Gcc/c++Opt'. So, it should read like this:

Code:

$:cat $WM_PROJECT_DIR/wmake/rules/linux64Gcc/c++Opt
c++DBUG     = 
c++OPT      = -O2 -fopenmp

The preferred method is to use the COMP_OPENMP and LINK_OPENMP definitions instead (in your Make/options file) and do NOT touch the wmake rules. Apart from less editing, easier upgrading etc, these are also defined for clang and Intel as well as gcc.
Take a look at the cfmesh integration for examples of using these defines, as well as various openmp directives.

Note that it is also good practice (I think) to guard your openmp pragmas with ifdef/endif so that you can rapidly enable/disable these. Sometimes debugging mpi + openmp can be rather "challenging".

olesen · December 31, 2018, 04:15

Quote:

Originally Posted by usv001

A word of caution though. This may not run any faster (it fact, it ran much slower in many cases) unless a significant portion of the code (i.e. the heavy duty loops) is parallelized and the OpenMP overhead is kept small.

Memory bandwidth affects many codes (not just OpenFOAM). You should give this a read :
https://www.ixpug.org/images/docs/IX...g-OpenFOAM.pdf

zhangyan · December 31, 2018, 19:02

Hi
I'm also interested in this issue.
I want to ask is there any possible to create a shared class whose member variables cost much memory.

PS: For openMP in OpenFOAM, I've found a github repository.

usv001 · January 2, 2019, 08:00

Hello Mark,

Quote:

Originally Posted by olesen

The preferred method is to use the COMP_OPENMP and LINK_OPENMP definitions instead (in your Make/options file) and do NOT touch the wmake rules. Apart from less editing, easier upgrading etc, these are also defined for clang and Intel as well as gcc.
Take a look at the cfmesh integration for examples of using these defines, as well as various openmp directives.

That looks interesting. I tried look to for them but couldn't find anything relevant. Could you please post an example of how the Make/options file should look like?

By the way, when OpenMP is not linked, the relevant pragmas are ignored by the compiler. This happens in both GCC and ICC. I don't use Clang though. So, I guess there is no need for guards.

Quote:

Originally Posted by olesen

Memory bandwidth affects many codes (not just OpenFOAM). You should give this a read :
https://www.ixpug.org/images/docs/IX...g-OpenFOAM.pdf

I agree with you completely. I have been doing some preliminary profiling of my code and memory accesses are taking up nearly 80% of the computation time! Clearly, OpenFOAM would do better with more vectorization.

USV

olesen · January 2, 2019, 09:05

Quote:

Originally Posted by usv001

Hello Mark,

That looks interesting. I tried look to for them but couldn't find anything relevant. Could you please post an example of how the Make/options file should look like?

By the way, when OpenMP is not linked, the relevant pragmas are ignored by the compiler. This happens in both GCC and ICC. I don't use Clang though. So, I guess there is no need for guards.

The simplest example is applications/test/openmp/Make/options (in 1712 and later).

If you check the corresponding source file (Test-openmp.C) you'll perhaps see what I mean about the guards. As a minimum, you need a guard around the include <omp.h> statement.
After that you can decide to use any of the following approaches:

Just use the pragmas and let the compiler decide to use/ignore.
Guard with the standard #ifdef _OPENMP
Guard with the cfmesh/OpenFOAM #ifdef USE_OMP

The only reason I suggest the USE_OMP guard is to let you explicitly disable openmp for benchmarking and debugging as required by changing the Make/options entry. If you don't need this for benchmarking, debugging etc, no worries.

Quote:

I agree with you completely. I have been doing some preliminary profiling of my code and memory accesses are taking up nearly 80% of the computation time! Clearly, OpenFOAM would do better with more vectorization.

I wouldn't draw the same conclusion at all, but state that vectorization makes the most sense when the arithmetic intensity is much higher (see the roofline model in the CINECA presentation).

December 15, 2018, 02:57	MPI & shared memory	#1
usv001 Senior Member Join Date: Sep 2015 Location: Singapore Posts: 102 Rep Power: 10	Dear Foamers, I would like to know if the following is possible: Say that I am running a case in parallel. Assuming that all the cores are within the same node, is it possible to declare a shared memory in the heap that is visible to all the cores. Specifically, if each processor creates a field as shown below, Code: scalarField* fieldPtr(new scalarField(n)); Can one core access the field created by another core using the pointer address? Has anyone implemented something like this before? If so, how to go about doing it? USV raumpolizei likes this.

December 30, 2018, 23:47	MPI/OpenMP Hybrid Programming in OpenFOAM	#3
usv001 Senior Member Join Date: Sep 2015 Location: Singapore Posts: 102 Rep Power: 10	Thank you Mark. After a little scouring of the Internet, I came to the same conclusion. However, there is a simple but limited solution which is to use OpenMP. Since I created my own schemes and solver, I was able to incorporate quite a bit of OpenMP parallelism into the code. For those trying to use existing solvers/schemes, unfortunately, this won't help you too much unless you re-write the schemes with OpenMP pragmas. To compile with OpenMP, include the 'fopenmp' flag in the file '$WM_PROJECT_DIR/wmake/rules/linux64Gcc/c++Opt'. So, it should read like this: Code: $:cat $WM_PROJECT_DIR/wmake/rules/linux64Gcc/c++Opt c++DBUG = c++OPT = -O2 -fopenmp Remember to source the etc/bashrc file before compiling. In your solver/scheme, you may need to include "omp.h" for the pragmas to work. After this, you're pretty much set. You can parallelize loops as follows: Code: #pragma omp parallel for forAll(neighbour, celli) { ... } When running in MPI/OpenMP mode, I decomposed the domain into the number of NUMA nodes that I am using rather than the total number of cores. After that, I set the number of OMP threads to the number of cores present in each NUMA node using the environment variable 'OMP_NUM_THREADS'. For instance, let's say that there are 4 NUMA nodes in each socket and each NUMA node consists of 6 cores (i.e. 24 cores per socket). If I wish to use 2 sockets in total, I can decompose the domain into 8 sub-domains and run the solver as follows: Code: export OMP_NUM_THREADS=6 mpirun -np 8 --map-by ppr:1:numa:pe=6 solver -parallel This tells that I would like to start 8 MPI processes (same as the number of sub-domains) and each process should be given to 1 NUMA node with 6 cores/threads allocated to each process. Inside each NUMA node, OpenMP can then parallelize over the 6 cores/threads available. A word of caution though. This may not run any faster (it fact, it ran much slower in many cases) unless a significant portion of the code (i.e. the heavy duty loops) is parallelized and the OpenMP overhead is kept small. Usually, the benefits start showing at higher core counts when MPI traffic starts to dominate. In other cases, I think the bulit-in MPI alone is more efficient. Lastly, I am no expert in these areas. Just an amateur. So, there could be other things that I am missing and better ways to go about doing it. So, feel free to correct my mistakes and suggest better ways to go about it... Cheers, USV

December 31, 2018, 19:02		#6
zhangyan Senior Member Yan Zhang Join Date: May 2014 Posts: 120 Rep Power: 11	Hi I'm also interested in this issue. I want to ask is there any possible to create a shared class whose member variables cost much memory. PS: For openMP in OpenFOAM, I've found a github repository. __________________ https://openfoam.top

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Decomposing meshes	Tobi	OpenFOAM Pre-Processing	22	February 24, 2023 09:23
Foam::error::PrintStack	almir	OpenFOAM Running, Solving & CFD	91	December 21, 2022 04:50
decomposePar problem: Cell 0contains face labels out of range	vaina74	OpenFOAM Pre-Processing	37	July 20, 2020 05:38
[mesh manipulation] Importing Multiple Meshes	thomasnwalshiii	OpenFOAM Meshing & Mesh Conversion	18	December 19, 2015 18:57
SigFpe when running ANY application in parallel	Pj.	OpenFOAM Running, Solving & CFD	3	April 23, 2015 14:53

December 30, 2018, 05:30		#2
olesen Senior Member Mark Olesen Join Date: Mar 2009 Location: https://olesenm.github.io/ Posts: 1,684 Rep Power: 40	Currently no DMA or RDMA wrapping in OpenFOAM. You will have create your own MPI communicators, access windows etc.