What's the MPI communication pattern for exchanging boundary info between partitions?

aerosayan · September 15, 2022, 19:24

Hi everyone,

I'm trying to figure out what's the best/robust/efficient way to exchange boundary information between partitions.

To keep things simple, I'm thinking of just broadcasting every partition's boundary information to every other partitions. i.e just broadcast the data with an unique tag for each partition, and let the other partitions decide which ones they need, and want to receive.

Then after receiving the data, I'm thinking of inserting them directly in the boundary cell of each partition.

Is that good enough?

I don't know if it's efficient to broadcast every partition's boundary information to every other partition. Seems kind of sketchy, to be honest.

Thanks!

arjun · September 16, 2022, 01:18

Its difficult issue.

In Wildkatze the way it is done is by simply using send receive. That could be done because if a processor has to send some data to another processor then that processor also has to receive same data. So if both processors performed the send receive same time it works out. This could be organised by looping for processors from 0 to P-Total -1 and send the data and ask the receiver process to receive it.

This has worked good so far in Wildkatze.

JBeilke · September 16, 2022, 05:43

Please have a look at :

https://www.openfoam.com/news/main-n...v2206/parallel

There was a detailed article about this research somewhere, but I can not find it in the moment.

sbaffini · September 16, 2022, 06:52

Quote:

Originally Posted by aerosayan

Hi everyone,

I'm trying to figure out what's the best/robust/efficient way to exchange boundary information between partitions.

To keep things simple, I'm thinking of just broadcasting every partition's boundary information to every other partitions. i.e just broadcast the data with an unique tag for each partition, and let the other partitions decide which ones they need, and want to receive.

Then after receiving the data, I'm thinking of inserting them directly in the boundary cell of each partition.

Is that good enough?

I don't know if it's efficient to broadcast every partition's boundary information to every other partition. Seems kind of sketchy, to be honest.

Thanks!

Nope, this is not gonna work at large. It's a huge amount of data and it's a sort of unoptimized alltoall which would engulf most systems even with much less data.

Still, I think you need to better clarify which part you are referring to and is giving you trouble.

In practice, in my view, there are three phases in the code, punctuated by two very specific tasks:

PHASE 1

Before mesh partitioning (which is TASK 1). Here most things are kind of brute force. Either the master proc does it and then sends the info to the others or you use a parallel partitioner that does all in one. By the end of it each process should have a list of its own global cell numbers, nothing more. The info required by the partitioner is necessarily read from the mesh file on the fly, storing only the necessary part for the paritioner.

PHASE 2

This is the part where the output from the partitioner is used to build the structure to do the parallel exchanges efficiently. Assuming this is the part where you have troubles (however, below I will also assume it is instead in PHASE 3 that you have troubles), what you need here is to have each process read the mesh part that it owns AND the mesh in the number of ghost layers that your algorithm requires. As others mentioned in other posts, these cells will be saved, respectively, from 1 to Nlocal and from Nlocal + 1 to Nlocal + Nghosts. From the mesh storing perspective ghosts cells are stored just like regular cells, yet they won't have any connectivity with other ghost cells outside the layers you decided to have (you won't need that, and I suggest you to exchange gradients as well to avoid any issue here).

Now, to conclude PHASE 2, you need TASK 2, which is itself made of 2 parts.

First, each process, for each ghost cell needs to know which process actually has it. One scalable way that I forgot mentioning in my previous posts is to store the output of the partitioner as a partitioned array across the processes. That is, for N cells and p processes, assuming for simplicity N is an integer multiple of p (there is some bookkeeping otherwise, but you can figure that out by yourself), process 1 will have the first N/p elements of the partitioned array, process 2 will have the second N/p batch, etc. The idea of a partitioned array is that each process can determine exactly, given N and p, which process has the element it needs. Assuming the partitioner output is stored in such partitioned array (i.e., process 1 knows the owner process for the first N/p cells), you do an mpi_alltoall where each process sends to each other the number of elements it needs from the partitioned array, followed by an mpi_alltoallv where the actual lists of elements are sent. Finally, a final mpi_alltoallv, this time in reverse (sender and receiver are switched from the first one), is used to actually send the correct owners lists to the processes that asked.

Now, each process knows the owner of its ghost cells but not the reverse, the owner doesn't know to which process it has to send the info. To solve this you just use the same pattern used for the paritioned array above. An mpi_alltoall where each process sends to its ghost owners the amount of cells that must be sent, followed by an mpi_alltoallv where the actual lists are sent.

If you stored all the exchange information correctly, this actually completes PHASE 2, because now each process knows which process owns its ghost cells, i.e., who will fill its ghost cells, and, for each other process, which cells must be sent to fill their ghosts.

How to exactly store this information is kind of up to you, but I want you to notice that, for each process, there is no relevance here for the information related to processes not involved in any exchange, you simply don't store or care about that. You could go either with an optimized sort of CSR structure (exchange rows with process number and columns with cell indices), or some more regular data structure. The point is, you now have this info and only this, because this is all you need, so you just use this one. If you work on it you can also fit the periodic boundary conditions in the exact same framework (but here a single process can "exchange" with itself), but let's keep that for another time.

PHASE 3

If you actually stored correctly the info exchanged in PHASE 2, this should be easy. I will assume so and focus on a possible trouble in the SEND-RECV part. In practice, now each process has a list of processes to which data must be sent and, as mentioned by arjun, from which data must be received. What you need to do, whenever a parallel exchange is needed, is to have each process loop on ITS OWN list of neighbor processes (which, again ,you know from PHASE 2) and do the SEND-RECV part. I see people doing this in a lot of ways.

My favourite one is, still in PHASE 2, to first sort the lists of the processes to visit in the SEND-RECV part according to a common pattern. As I mentioned in other posts, I use this approach. In this way, a process has the list of its neighbor processes sorted in a such way that most communications from its side will be matched from the other side and, in any case, even if not exactly matched (the current communication partner might be still busy with a previous partner), ensured to be without deadlock. At this point, and only because of this specific order, you can use any mechanism to do the exchange, blocking, non blocking, a coupled sendrecv. However, I suggest to stick to non-blocking.

In practice, you should have a first routine that does a loop on the neighbor processes, as soon as possible starts a non-blocking recv and after that does the non-blocking send. This routine is effectively non blocking. Then a second routine should do the wait and completion part of the non-blocking calls, which is actually blocking. If done correctly you can use them separated by additional work while waiting the parallel exchange to complete.

In conclusion, if PHASE 2 is correctly handled, there is no logical space for sending around everything to everyone.

flotus1 · September 16, 2022, 11:12

Quote:

Originally Posted by sbaffini

Nope, this is not gonna work at large. It's a huge amount of data and it's a sort of unoptimized alltoall which would engulf most systems even with much less data.

100x this.
ANY kind of alltoall communication will tank performance once the number of threads is large enough. Even if most of the data is never communicated or discarded. Avoid it like the plague.

I have written a preprocessor that builds the communication data structures for our MPI solver. With static mesh/decomposition it can be fairly straightforward.
Every MPI process gets a list of which processes it sends values to, and a list of processes it receives data from.
These lists also contain information about the amount of data that is transferred, and where the values belong on the receiving process.
Building these data structures is computationally expensive of course, but it's a single preprocessing step, so we don't have to worry about that.

With that, MPI communication is just loops of non-blocking send/receive between pairs of processes. We got good scaling on 10k+ cores, with less than 20k "cells" on each core. That's not possible if alltoall communication is present.

Edit:

Quote:

Originally Posted by aerosayan

Hi everyone,
To keep things simple, I'm thinking of just broadcasting every partition's boundary information to every other partitions. i.e just broadcast the data with an unique tag for each partition, and let the other partitions decide which ones they need, and want to receive.

Then after receiving the data, I'm thinking of inserting them directly in the boundary cell of each partition.

If you can do this, you are almost there.
All you need to add is a second step that works with the information about which values actually got used. This will allow you to build the single sender single receiver communication. And bam, you got yourself a preprocessor for creating efficient MPI communication. There are many other ways to do this of course.

aerosayan · September 16, 2022, 11:39

Quote:

Originally Posted by sbaffini

The idea of a partitioned array is that each process can determine exactly, given N and p, which process has the element it needs.

Haha. Yesterday I derived the same solution. After the cells have been Morton ordered, we can use the neighbor cell indices of each cell, to identify the boundary/ghost cells, of each partition.

Really elegant solution to an otherwise complicated graph problem.

sbaffini · September 16, 2022, 12:56

Quote:

Originally Posted by aerosayan

Haha. Yesterday I derived the same solution. After the cells have been Morton ordered, we can use the neighbor cell indices of each cell, to identify the boundary/ghost cells, of each partition.

Really elegant solution to an otherwise complicated graph problem.

That's indeed the same machinery I use when an external partitioner is not used (because my cells are always ordered according to a SFC) and the ith process just gets the ith chunk of cells.

A partitioned array has the same bookkeeping formulas needed in the trivial SFC based partitioner.

September 16, 2022, 01:18		#2
arjun Senior Member Arjun Join Date: Mar 2009 Location: Nurenberg, Germany Posts: 1,272 Rep Power: 34	Its difficult issue. In Wildkatze the way it is done is by simply using send receive. That could be done because if a processor has to send some data to another processor then that processor also has to receive same data. So if both processors performed the send receive same time it works out. This could be organised by looping for processors from 0 to P-Total -1 and send the data and ask the receiver process to receive it. This has worked good so far in Wildkatze. aerosayan likes this.

September 16, 2022, 05:43		#3
JBeilke Senior Member Joern Beilke Join Date: Mar 2009 Location: Dresden Posts: 498 Rep Power: 20	Please have a look at : https://www.openfoam.com/news/main-n...v2206/parallel There was a detailed article about this research somewhere, but I can not find it in the moment. aerosayan likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
3D Windturbine simulation in SU2	k.vimalakanthan	SU2	15	October 12, 2023 05:53
Out File does not show Imbalance in %	Mmaragann	CFX	5	January 20, 2017 10:20
Basic Nozzle-Expander Design	karmavatar	CFX	20	March 20, 2016 08:44
Problem in setting Boundary Condition	Madhatter92	CFX	12	January 12, 2016 04:39
Error using LaunderGibsonRSTM on SGI ALTIX 4700	jaswi	OpenFOAM	2	April 29, 2008 10:54