What are your views on MPI vs MPI+OpenMP?

aerosayan · September 23, 2022, 07:32

Hello everyone,

I was reading opensource codes, and observed that some devs use MPI+OpenMP for their solvers. I understand the reasoning, but currently I'm finding OpenMP code to be much more problematic. Essentially, apart from increasing the complexity of the solver, OpenMP loops apparently have a high startup cost.

For me, surprisingly MPI seems like a better option.

I haven't written implicit solvers with MPI, so I'm little bit inexperienced in that regards, but it seems like distributing mesh partitions on multiple processors, and letting the code run on each processor in a single threaded fashion seems like a very simple way to do things.

I like that our code running on each processor can be single threaded for most part, and only needs to transfer boundary information between neighbors.

I like it because, writing our linear solvers, and other things become easy.

OpenMP parallelizing even a Gaussian elimination solver is somewhat complicated, and has not been fruitful for me.

I like OpenMP for SIMD vectorization though. #pragma omp simd for example, instructs the compiler to generate optimized SIMD code. Similarly, I'm interested in it's graph based task execution model. Professor Jack Dongarra is apparently using this for finding the most efficient pathways through things like QR decomposition solvers, or Cholesky factorization solvers, etc...

What is your view on MPI vs MPI+OpenMP?

Thanks and regards
~Sayan

flotus1 · September 23, 2022, 08:22

My perspective on the matter is fairly simple:
Pure MPI is a good starting point no matter what your plans might be in the future.
One of the few reasons I know for adding in OpenMP parallelization later: you need better scaling on high core counts. As long as you are satisfied with scaling for the cases you typically run, it is just not worth the effort.
One thing gets easier for OpenMP when it is implemented on top of MPI: you don't have to worry about NUMA. That will be handled by OpenMP domains spanning only one NUMA node.

sbaffini · September 23, 2022, 08:54

I agree, pure MPI is just my sweet spot for now. With little to zero knowledge I can go to O(10^3) processes withuout problems in efficiency. And while it was just for testing purposes, I could also go to O(10^4) without clear issues (yet the efficiency was lower for my test cases).

Simplifying, MPI just works out of the box by just using common sense and awareness.

The amount of stuff you need to care about in OpenMP to make a good code has never been approachable for me, given my constraints.

Also, as you basically need to work at the loop level, and each loop is different, a lot of practical work is needed to do it and maintain it.

Finally, when I first investigated this, turned out that my MPI test code was faster on a single node than my OpenMP test code. So I didn't need more to take a decision.

However, as mentioned by flotus, scalability on some systems or problems require both, as beyond a certain number of MPI processes communication will still become significant even for very little data. At that point, if you can put in (and sustain in time) the effort of a mixed code with GOOD OpenMP, you are going to gain that 10x-50x possible increase in number of processes if it is needed (roughly estimated by the current number of cores on typical nodes).

aerosayan · September 23, 2022, 17:50

Quote:

Originally Posted by flotus1

One of the few reasons I know for adding in OpenMP parallelization later: you need better scaling on high core counts.

Thanks for your answer. I also received a lot of insight from Paolo, but I didn't understand this part, i.e how OpenMP helps in scaling.

Historically OpenMP didn't scale very well to high cores, so I'm assuming that it might be useful to use OpenMP locally on each node, and use MPI to distribute data to each node.

Like many devs are doing as mentioned in my question.

I'm assuming, it's to take advantage of the shared memory access in each node, and thus reducing MPI communication overhead locally on each node.

But I also learned little bit from Paolo about MPI being able to access shared memory, just like OpenMP does, and without much communication overhead. I think it's called one-sided communication.

It's a little advanced for me right now, but if possible I would probably like to use that, if it's actually good.

Can MPI shared memory access be used instead of OpenMP? and is it as good as OpenMP, or better, in your expert opinion?

I found a few tutorials and pdfs about this MPI shared memory topic, and am yet to learn it:

sbaffini · September 23, 2022, 18:14

Note that these are two distinct features.

What I was referring to in my answers was the simple fact that a code using MPI only works just fine on a single node (where it might or not take advantage of the fact depending on how MPI is implemented) and multiple nodes.

The remote access feature in MPI is also a sort of shared memory approach, but in my opinion should be used carefully and not as a replacement for it. That is, don't code basic exchanges or loops to leverage it. One place where it may be useful is if you need to monitor progress of a job split across processes. Each one will have its own progress and, in the use case I have in mind it would be too costy to do regular communication just to monitor global progress. In this scenario you may have each worker process update remotely a global monitor of the progress stored only on the master. That is, just minor things done jusy once outside the main iterations.

The scenario where OpenMP comes to help for MPI is instead when you have a single MPI process per node and all the cores on the same node use OpenMP to access their shared memory. As you noticed, the advantage here is that you have, typically, a factor 10 or more reduction in the global number of MPI processes (depending on how many cores you have in a node) which can make a difference on certain cases or machines, especially for some global communication patterns.

But, again, the general rule of thumb is to care about something if you actually hit its limits. It never happened to me, to date.

flotus1 · September 23, 2022, 18:19

MPI has a shared memory model, correct. But it's not the same as using OpenMP.

Let's assume we are using MPI with domain decomposition.
With a well thought-out MPI solver, the parallelization overhead -and thus loss of strong scaling- comes from domain boundaries. More sub-domains means higher boundary-to-volume ratio, and thus more parallel overhead.
This is where OpenMP can help. Let's say you want to run on 16384 cores.
With pure MPI, you need to partition the mesh into 16384 sub-domains. OpenMP allows you to reduce this number. E.g. the cluster has NUMA nodes that span 8 cores each. This means you only need to decomposition the mesh into 16384/8=2048 sub-domains. Communication between the sub-domains is handled by MPI, just as in the pure MPI solver. And the actual computations within a sub-domain are parallelized over 8 OpenMP threads.
This results in lower boundary area, less communication volume, and finally lower parallelization overhead. Which should then result in better scaling than with pure MPI.

Edit just for perspective: the core count of 16384 is a bad example. Edge-cases aside, OpenMP+MPI is typically done for solvers that need to run on 100k and more cores.

aerosayan · September 28, 2022, 04:04

Quote:

Originally Posted by flotus1

MPI has a shared memory model, correct. But it's not the same as using OpenMP.

Let's assume we are using MPI with domain decomposition.
With a well thought-out MPI solver, the parallelization overhead -and thus loss of strong scaling- comes from domain boundaries. More sub-domains means higher boundary-to-volume ratio, and thus more parallel overhead.
This is where OpenMP can help. Let's say you want to run on 16384 cores.
With pure MPI, you need to partition the mesh into 16384 sub-domains. OpenMP allows you to reduce this number. E.g. the cluster has NUMA nodes that span 8 cores each. This means you only need to decomposition the mesh into 16384/8=2048 sub-domains. Communication between the sub-domains is handled by MPI, just as in the pure MPI solver. And the actual computations within a sub-domain are parallelized over 8 OpenMP threads.
This results in lower boundary area, less communication volume, and finally lower parallelization overhead. Which should then result in better scaling than with pure MPI.

Edit just for perspective: the core count of 16384 is a bad example. Edge-cases aside, OpenMP+MPI is typically done for solvers that need to run on 100k and more cores.

I didn't learn task based parallelism of openmp, but I think it might help here.

I somewhat disagree with Paolo that openmp is used on a loop level. Yes, it is used on a loop level, but how we use it is important too.

I have a good point, and I'm not being a contrarian for this case.

I thinks it's fairly possible to use openmp as a portable implementation of native threading, and use it exactly as pthreads.

What I mean is, and I hope I'm able to convey my idea with exact precision, that we can use openmp to setup a big external parallel loop that distributes work to 8 local processors, and each processor NEVER creates any other openmp parallel loop again.

That is, something like

#omp parallel
for iproc = 0 to numLocalProcessors
SolveNavierStokesEquation(iproc, AllDataRequired);
end

Internally, we could segregate our data such that they could be accessed by indexing it with iproc. Imagine all of our local data is stored in an array of objects.

This is why I said that openmp can be used as a portable implementation of native threading, like pthreads or winthreads.

Because in those programming paradigms, we do exactly this. We call functions to run on different threads, and just sit and wait for them to finish.

Synchronization is needed to be handled carefully. If possible, the code should be written without any deadlocks.

I think it's possible, or we'll use barriers before to share the boundary data.

I think it's definitely possible to do this.

aerosayan · September 28, 2022, 04:07

I think this method is useful because we can maintain our previous simplicity that was present in MPI code i.e each processor works on their own set of data serially, and doesn't care about anything else.

All neighbor data required, will be provided to it with either proper deadlock free code, or by using barriers in worst case scenario.

If this doesn't work, we can always use pthreads, or another portable native threading library.

flotus1 · September 28, 2022, 04:38

I don't want to discourage you, but frankly I have little faith in your idea. Feel free to prove me wrong.

OpenMP is used on the loop level when the intent is to increase scaling of an MPI solver based on domain decomposition.
Assuming your idea works from a correctness standpoint, my gut feeling is that the housekeeping necessary to make it work will either negate the performance gains. Or become so complicated by itself that plain old loop-level OpenMP would be easier. Possibly both.

sbaffini · September 28, 2022, 05:08

Quote:

Originally Posted by aerosayan

I didn't learn task based parallelism of openmp, but I think it might help here.

I somewhat disagree with Paolo that openmp is used on a loop level. Yes, it is used on a loop level, but how we use it is important too.

I have a good point, and I'm not being a contrarian for this case.

I thinks it's fairly possible to use openmp as a portable implementation of native threading, and use it exactly as pthreads.

What I mean is, and I hope I'm able to convey my idea with exact precision, that we can use openmp to setup a big external parallel loop that distributes work to 8 local processors, and each processor NEVER creates any other openmp parallel loop again.

That is, something like

#omp parallel
for iproc = 0 to numLocalProcessors
SolveNavierStokesEquation(iproc, AllDataRequired);
end

Internally, we could segregate our data such that they could be accessed by indexing it with iproc. Imagine all of our local data is stored in an array of objects.

This is why I said that openmp can be used as a portable implementation of native threading, like pthreads or winthreads.

Because in those programming paradigms, we do exactly this. We call functions to run on different threads, and just sit and wait for them to finish.

Synchronization is needed to be handled carefully. If possible, the code should be written without any deadlocks.

I think it's possible, or we'll use barriers before to share the boundary data.

I think it's definitely possible to do this.

Yeah, don't take too seriously my statements on OpenMP, as they come from someone who hasn't coded it in 10 years. What I wanted to convey is that it is just more complex to handle than pure MPI.

Even if the single loop makes sense (and I have no reason to doubt that, simply because I don't know), this just saves you from a possible performance bottleneck, but you still have to distinguish, for every single part of the code, if every process on the node is meant to do it or not. To me that's a lot of work on code to write and maintain with little bang for the buck in my case.

aerosayan · September 28, 2022, 05:21

Quote:

Originally Posted by sbaffini

Even if the single loop makes sense (and I have no reason to doubt that, simply because I don't know), this just saves you from a possible performance bottleneck, but you still have to distinguish, for every single part of the code, if every process on the node is meant to do it or not. To me that's a lot of work on code to write and maintain with little bang for the buck in my case.

Every process is supposed to run all at once.

We don't decide which would run serially or in parallel.

Imagine it like a smaller MPI, and each process would run only on their partition of mesh, and it will run without creating any other children thread.

So, it will seem like the code is running serially on each openmp thread.

Respectfully, I don't agree with flotus right now, that there will be lots of data keeping required, and it will degrade performance.

The data is on shared memory, and if a copy is required, it will be fast. Possibly faster than if done through MPI

Some barriers or mutexes may be required before accessing data.

Buuuuuuuuuuuuuut I've been routinely wrong in the past, but I hope I'm not wrong this time.

flotus1 · September 28, 2022, 05:47

You have every right to disagree with my assessment at this stage. After all, it is mostly based on hunches and guesses.
I myself have toyed with tangentially related ideas for OpenMP in the past. But I quickly chickened out when it came to actually implementing them.

All I'm saying is this: If I had the task to increase scaling of an MPI solver by adding OpenMP, I would go with the tried and tested method of loop parallelism. If you have the time to try something new without guaranteed success, that's a great place to be in.

sbaffini · September 28, 2022, 05:48

Quote:

Originally Posted by aerosayan

Some barriers or mutexes may be required before accessing data.

I was referring to this. Every variable that I have now and I know for sure is handled by a single process, would now be potentially shared. I have to care about the fact that another process won't update, say, my dummy index in my for loop. Maybe it is super easy, yet I have to care about this for every variable and every loop.

In MPI every serial part of the code is just naturally redundant with no particular hassle. In contrast, with OpenMP I have to care about which process will do that, unless I want to explicitly have ALL the variables duplicated for each process... which in the end is, indeed, MPI without communication.

Let me also highlight again that, if you have a well coded MPI part (not supersmart, just reasonable), with possibly no collective operation in the iterations, non-blocking send/recv interleaved with computation wherever possible, a code with high arithmetic intensity (like in a coupled one), then pure MPI is going to just work up to 10k processes without a glitch. But then you also need to ask who is going to pay for 10k processes or more... that's order 1k$ per hour.

EDIT: Again, however, this is just about me and how I decided for full MPI. At the moment I'm doing code development, bug fixing, developer/user/theory documentation, internet site, GUI designing and testing, cloud designing and testing, maintaining and updating the software toolchain, grant proposal and follow up, also some regular consulting jobs... my code can't be a pain, not even for a day (it already is too much

)

aerosayan · September 28, 2022, 05:55

Quote:

Originally Posted by sbaffini

I was referring to this. Every variable that I have now and I know for sure is handled by a single process, would now be potentially shared. I have to care about the fact that another process won't update, say, my dummy index in my for loop. Maybe it is super easy, yet I have to care about this for every variable and every loop.

In MPI every serial part of the code is just naturally redundant with no particular hassle. In contrast, with OpenMP I have to care about which process will do that, unless I want to explicitly have ALL the variables duplicated for each process... which in the end is, indeed, MPI without communication.

Let me also highlight again that, if you have a well coded MPI part (not supersmart, just reasonable), with possibly no collective operation in the iterations, non-blocking send/recv interleaved with computation wherever possible, a code with high arithmetic intensity (like in a coupled one), then pure MPI is going to just work up to 10k processes without a glitch. But then you also need to ask who is going to pay for 10k processes or more... that's order 1k$ per hour

True.

My additional incentive for MPI+openmp, is to run on smaller hardware's. Say, a laptop with 4-16 cores.

If we can use shared data, then it would remove MPI communication on the node (cheap laptop).

Deadlocks and mutexes are an issue. But I'll have to study more on what I can do about it.

I will try it first with my explicit code and see what happens.

flotus1 · September 29, 2022, 05:28

Quote:

Originally Posted by aerosayan

True.
My additional incentive for MPI+openmp, is to run on smaller hardware's. Say, a laptop with 4-16 cores.

If we can use shared data, then it would remove MPI communication on the node (cheap laptop).

Why would you want to remove MPI communication for low core counts?
That's where the performance impact should be negligible anyway. And if it isn't, the first order of business would be to improve the MPI implementation.

aerosayan · October 2, 2022, 07:51

I did some research. What I mentioned is definitely possible, also flotus is right.

Copying boundary information would be costly. If possible, that should be prevented. Solution might be to apply parallelism at a lower level, but not as low as per-loop level.

That's where I somewhat disagree. Per-loop parallelism would become little cumbersome.

I think Calculix CCX does it well.

I think they have found a good balance: CCX uses pthreads for parallel assembly of stiffness and mass matrix on different local threads. It's low level, but not on a per-loop basis.

September 23, 2022, 07:32	What are your views on MPI vs MPI+OpenMP?	#1
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	Hello everyone, I was reading opensource codes, and observed that some devs use MPI+OpenMP for their solvers. I understand the reasoning, but currently I'm finding OpenMP code to be much more problematic. Essentially, apart from increasing the complexity of the solver, OpenMP loops apparently have a high startup cost. For me, surprisingly MPI seems like a better option. I haven't written implicit solvers with MPI, so I'm little bit inexperienced in that regards, but it seems like distributing mesh partitions on multiple processors, and letting the code run on each processor in a single threaded fashion seems like a very simple way to do things. I like that our code running on each processor can be single threaded for most part, and only needs to transfer boundary information between neighbors. I like it because, writing our linear solvers, and other things become easy. OpenMP parallelizing even a Gaussian elimination solver is somewhat complicated, and has not been fruitful for me. I like OpenMP for SIMD vectorization though. #pragma omp simd for example, instructs the compiler to generate optimized SIMD code. Similarly, I'm interested in it's graph based task execution model. Professor Jack Dongarra is apparently using this for finding the most efficient pathways through things like QR decomposition solvers, or Cholesky factorization solvers, etc... What is your view on MPI vs MPI+OpenMP? Thanks and regards ~Sayan

September 23, 2022, 08:22		#2
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	My perspective on the matter is fairly simple: Pure MPI is a good starting point no matter what your plans might be in the future. One of the few reasons I know for adding in OpenMP parallelization later: you need better scaling on high core counts. As long as you are satisfied with scaling for the cases you typically run, it is just not worth the effort. One thing gets easier for OpenMP when it is implemented on top of MPI: you don't have to worry about NUMA. That will be handled by OpenMP domains spanning only one NUMA node. aerosayan likes this.

September 23, 2022, 08:54		#3
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,152 Blog Entries: 29 Rep Power: 39	I agree, pure MPI is just my sweet spot for now. With little to zero knowledge I can go to O(10^3) processes withuout problems in efficiency. And while it was just for testing purposes, I could also go to O(10^4) without clear issues (yet the efficiency was lower for my test cases). Simplifying, MPI just works out of the box by just using common sense and awareness. The amount of stuff you need to care about in OpenMP to make a good code has never been approachable for me, given my constraints. Also, as you basically need to work at the loop level, and each loop is different, a lot of practical work is needed to do it and maintain it. Finally, when I first investigated this, turned out that my MPI test code was faster on a single node than my OpenMP test code. So I didn't need more to take a decision. However, as mentioned by flotus, scalability on some systems or problems require both, as beyond a certain number of MPI processes communication will still become significant even for very little data. At that point, if you can put in (and sustain in time) the effort of a mixed code with GOOD OpenMP, you are going to gain that 10x-50x possible increase in number of processes if it is needed (roughly estimated by the current number of cores on typical nodes). aerosayan likes this.

September 23, 2022, 18:14		#5
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,152 Blog Entries: 29 Rep Power: 39	Note that these are two distinct features. What I was referring to in my answers was the simple fact that a code using MPI only works just fine on a single node (where it might or not take advantage of the fact depending on how MPI is implemented) and multiple nodes. The remote access feature in MPI is also a sort of shared memory approach, but in my opinion should be used carefully and not as a replacement for it. That is, don't code basic exchanges or loops to leverage it. One place where it may be useful is if you need to monitor progress of a job split across processes. Each one will have its own progress and, in the use case I have in mind it would be too costy to do regular communication just to monitor global progress. In this scenario you may have each worker process update remotely a global monitor of the progress stored only on the master. That is, just minor things done jusy once outside the main iterations. The scenario where OpenMP comes to help for MPI is instead when you have a single MPI process per node and all the cores on the same node use OpenMP to access their shared memory. As you noticed, the advantage here is that you have, typically, a factor 10 or more reduction in the global number of MPI processes (depending on how many cores you have in a node) which can make a difference on certain cases or machines, especially for some global communication patterns. But, again, the general rule of thumb is to care about something if you actually hit its limits. It never happened to me, to date. aerosayan likes this.

September 23, 2022, 18:19		#6
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	MPI has a shared memory model, correct. But it's not the same as using OpenMP. Let's assume we are using MPI with domain decomposition. With a well thought-out MPI solver, the parallelization overhead -and thus loss of strong scaling- comes from domain boundaries. More sub-domains means higher boundary-to-volume ratio, and thus more parallel overhead. This is where OpenMP can help. Let's say you want to run on 16384 cores. With pure MPI, you need to partition the mesh into 16384 sub-domains. OpenMP allows you to reduce this number. E.g. the cluster has NUMA nodes that span 8 cores each. This means you only need to decomposition the mesh into 16384/8=2048 sub-domains. Communication between the sub-domains is handled by MPI, just as in the pure MPI solver. And the actual computations within a sub-domain are parallelized over 8 OpenMP threads. This results in lower boundary area, less communication volume, and finally lower parallelization overhead. Which should then result in better scaling than with pure MPI. Edit just for perspective: the core count of 16384 is a bad example. Edge-cases aside, OpenMP+MPI is typically done for solvers that need to run on 100k and more cores. sbaffini and aerosayan like this.

September 28, 2022, 04:07		#8
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	I think this method is useful because we can maintain our previous simplicity that was present in MPI code i.e each processor works on their own set of data serially, and doesn't care about anything else. All neighbor data required, will be provided to it with either proper deadlock free code, or by using barriers in worst case scenario. If this doesn't work, we can always use pthreads, or another portable native threading library.

September 28, 2022, 04:38		#9
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	I don't want to discourage you, but frankly I have little faith in your idea. Feel free to prove me wrong. OpenMP is used on the loop level when the intent is to increase scaling of an MPI solver based on domain decomposition. Assuming your idea works from a correctness standpoint, my gut feeling is that the housekeeping necessary to make it work will either negate the performance gains. Or become so complicated by itself that plain old loop-level OpenMP would be easier. Possibly both. sbaffini and aerosayan like this.

September 28, 2022, 05:47		#12
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	You have every right to disagree with my assessment at this stage. After all, it is mostly based on hunches and guesses. I myself have toyed with tangentially related ideas for OpenMP in the past. But I quickly chickened out when it came to actually implementing them. All I'm saying is this: If I had the task to increase scaling of an MPI solver by adding OpenMP, I would go with the tried and tested method of loop parallelism. If you have the time to try something new without guaranteed success, that's a great place to be in. sbaffini and aerosayan like this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Using OpenMP similar to MPI? Bad idea?	aerosayan	Main CFD Forum	3	January 6, 2021 05:42
Sgimpi	pere	OpenFOAM	27	September 24, 2011 07:57
Error using LaunderGibsonRSTM on SGI ALTIX 4700	jaswi	OpenFOAM	2	April 29, 2008 10:54
Is Testsuite on the way or not	lakeat	OpenFOAM Installation	6	April 28, 2008 11:12
OpenMP or MPI	mamaly60	OpenFOAM Running, Solving & CFD	4	October 30, 2007 22:31