CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Main CFD Forum

What's the MPI communication pattern for exchanging boundary info between partitions?

Register Blogs Community New Posts Updated Threads Search

Like Tree7Likes
  • 1 Post By arjun
  • 1 Post By JBeilke
  • 1 Post By sbaffini
  • 2 Post By flotus1
  • 1 Post By aerosayan
  • 1 Post By sbaffini

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   September 15, 2022, 19:24
Default What's the MPI communication pattern for exchanging boundary info between partitions?
  #1
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Hi everyone,

I'm trying to figure out what's the best/robust/efficient way to exchange boundary information between partitions.

To keep things simple, I'm thinking of just broadcasting every partition's boundary information to every other partitions. i.e just broadcast the data with an unique tag for each partition, and let the other partitions decide which ones they need, and want to receive.

Then after receiving the data, I'm thinking of inserting them directly in the boundary cell of each partition.

Is that good enough?

I don't know if it's efficient to broadcast every partition's boundary information to every other partition. Seems kind of sketchy, to be honest.

Thanks!
Attached Images
File Type: png mesh-decomposition.png (32.0 KB, 10 views)
aerosayan is offline   Reply With Quote

Old   September 16, 2022, 01:18
Default
  #2
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,272
Rep Power: 34
arjun will become famous soon enougharjun will become famous soon enough
Its difficult issue.

In Wildkatze the way it is done is by simply using send receive. That could be done because if a processor has to send some data to another processor then that processor also has to receive same data. So if both processors performed the send receive same time it works out. This could be organised by looping for processors from 0 to P-Total -1 and send the data and ask the receiver process to receive it.

This has worked good so far in Wildkatze.
aerosayan likes this.
arjun is offline   Reply With Quote

Old   September 16, 2022, 05:43
Default
  #3
Senior Member
 
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 498
Rep Power: 20
JBeilke is on a distinguished road
Please have a look at :

https://www.openfoam.com/news/main-n...v2206/parallel

There was a detailed article about this research somewhere, but I can not find it in the moment.
aerosayan likes this.
JBeilke is offline   Reply With Quote

Old   September 16, 2022, 06:52
Default
  #4
Senior Member
 
sbaffini's Avatar
 
Paolo Lampitella
Join Date: Mar 2009
Location: Italy
Posts: 2,151
Blog Entries: 29
Rep Power: 39
sbaffini will become famous soon enoughsbaffini will become famous soon enough
Send a message via Skype™ to sbaffini
Quote:
Originally Posted by aerosayan View Post
Hi everyone,

I'm trying to figure out what's the best/robust/efficient way to exchange boundary information between partitions.

To keep things simple, I'm thinking of just broadcasting every partition's boundary information to every other partitions. i.e just broadcast the data with an unique tag for each partition, and let the other partitions decide which ones they need, and want to receive.

Then after receiving the data, I'm thinking of inserting them directly in the boundary cell of each partition.

Is that good enough?

I don't know if it's efficient to broadcast every partition's boundary information to every other partition. Seems kind of sketchy, to be honest.

Thanks!
Nope, this is not gonna work at large. It's a huge amount of data and it's a sort of unoptimized alltoall which would engulf most systems even with much less data.

Still, I think you need to better clarify which part you are referring to and is giving you trouble.

In practice, in my view, there are three phases in the code, punctuated by two very specific tasks:

PHASE 1

Before mesh partitioning (which is TASK 1). Here most things are kind of brute force. Either the master proc does it and then sends the info to the others or you use a parallel partitioner that does all in one. By the end of it each process should have a list of its own global cell numbers, nothing more. The info required by the partitioner is necessarily read from the mesh file on the fly, storing only the necessary part for the paritioner.

PHASE 2

This is the part where the output from the partitioner is used to build the structure to do the parallel exchanges efficiently. Assuming this is the part where you have troubles (however, below I will also assume it is instead in PHASE 3 that you have troubles), what you need here is to have each process read the mesh part that it owns AND the mesh in the number of ghost layers that your algorithm requires. As others mentioned in other posts, these cells will be saved, respectively, from 1 to Nlocal and from Nlocal + 1 to Nlocal + Nghosts. From the mesh storing perspective ghosts cells are stored just like regular cells, yet they won't have any connectivity with other ghost cells outside the layers you decided to have (you won't need that, and I suggest you to exchange gradients as well to avoid any issue here).

Now, to conclude PHASE 2, you need TASK 2, which is itself made of 2 parts.

First, each process, for each ghost cell needs to know which process actually has it. One scalable way that I forgot mentioning in my previous posts is to store the output of the partitioner as a partitioned array across the processes. That is, for N cells and p processes, assuming for simplicity N is an integer multiple of p (there is some bookkeeping otherwise, but you can figure that out by yourself), process 1 will have the first N/p elements of the partitioned array, process 2 will have the second N/p batch, etc. The idea of a partitioned array is that each process can determine exactly, given N and p, which process has the element it needs. Assuming the partitioner output is stored in such partitioned array (i.e., process 1 knows the owner process for the first N/p cells), you do an mpi_alltoall where each process sends to each other the number of elements it needs from the partitioned array, followed by an mpi_alltoallv where the actual lists of elements are sent. Finally, a final mpi_alltoallv, this time in reverse (sender and receiver are switched from the first one), is used to actually send the correct owners lists to the processes that asked.

Now, each process knows the owner of its ghost cells but not the reverse, the owner doesn't know to which process it has to send the info. To solve this you just use the same pattern used for the paritioned array above. An mpi_alltoall where each process sends to its ghost owners the amount of cells that must be sent, followed by an mpi_alltoallv where the actual lists are sent.

If you stored all the exchange information correctly, this actually completes PHASE 2, because now each process knows which process owns its ghost cells, i.e., who will fill its ghost cells, and, for each other process, which cells must be sent to fill their ghosts.

How to exactly store this information is kind of up to you, but I want you to notice that, for each process, there is no relevance here for the information related to processes not involved in any exchange, you simply don't store or care about that. You could go either with an optimized sort of CSR structure (exchange rows with process number and columns with cell indices), or some more regular data structure. The point is, you now have this info and only this, because this is all you need, so you just use this one. If you work on it you can also fit the periodic boundary conditions in the exact same framework (but here a single process can "exchange" with itself), but let's keep that for another time.

PHASE 3

If you actually stored correctly the info exchanged in PHASE 2, this should be easy. I will assume so and focus on a possible trouble in the SEND-RECV part. In practice, now each process has a list of processes to which data must be sent and, as mentioned by arjun, from which data must be received. What you need to do, whenever a parallel exchange is needed, is to have each process loop on ITS OWN list of neighbor processes (which, again ,you know from PHASE 2) and do the SEND-RECV part. I see people doing this in a lot of ways.

My favourite one is, still in PHASE 2, to first sort the lists of the processes to visit in the SEND-RECV part according to a common pattern. As I mentioned in other posts, I use this approach. In this way, a process has the list of its neighbor processes sorted in a such way that most communications from its side will be matched from the other side and, in any case, even if not exactly matched (the current communication partner might be still busy with a previous partner), ensured to be without deadlock. At this point, and only because of this specific order, you can use any mechanism to do the exchange, blocking, non blocking, a coupled sendrecv. However, I suggest to stick to non-blocking.

In practice, you should have a first routine that does a loop on the neighbor processes, as soon as possible starts a non-blocking recv and after that does the non-blocking send. This routine is effectively non blocking. Then a second routine should do the wait and completion part of the non-blocking calls, which is actually blocking. If done correctly you can use them separated by additional work while waiting the parallel exchange to complete.

In conclusion, if PHASE 2 is correctly handled, there is no logical space for sending around everything to everyone.
aerosayan likes this.
sbaffini is offline   Reply With Quote

Old   September 16, 2022, 11:12
Default
  #5
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
Originally Posted by sbaffini View Post
Nope, this is not gonna work at large. It's a huge amount of data and it's a sort of unoptimized alltoall which would engulf most systems even with much less data.
100x this.
ANY kind of alltoall communication will tank performance once the number of threads is large enough. Even if most of the data is never communicated or discarded. Avoid it like the plague.

I have written a preprocessor that builds the communication data structures for our MPI solver. With static mesh/decomposition it can be fairly straightforward.
Every MPI process gets a list of which processes it sends values to, and a list of processes it receives data from.
These lists also contain information about the amount of data that is transferred, and where the values belong on the receiving process.
Building these data structures is computationally expensive of course, but it's a single preprocessing step, so we don't have to worry about that.

With that, MPI communication is just loops of non-blocking send/receive between pairs of processes. We got good scaling on 10k+ cores, with less than 20k "cells" on each core. That's not possible if alltoall communication is present.

Edit:
Quote:
Originally Posted by aerosayan View Post
Hi everyone,
To keep things simple, I'm thinking of just broadcasting every partition's boundary information to every other partitions. i.e just broadcast the data with an unique tag for each partition, and let the other partitions decide which ones they need, and want to receive.

Then after receiving the data, I'm thinking of inserting them directly in the boundary cell of each partition.
If you can do this, you are almost there.
All you need to add is a second step that works with the information about which values actually got used. This will allow you to build the single sender single receiver communication. And bam, you got yourself a preprocessor for creating efficient MPI communication. There are many other ways to do this of course.
sbaffini and aerosayan like this.
flotus1 is offline   Reply With Quote

Old   September 16, 2022, 11:39
Default
  #6
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Quote:
Originally Posted by sbaffini View Post
The idea of a partitioned array is that each process can determine exactly, given N and p, which process has the element it needs.

Haha. Yesterday I derived the same solution. After the cells have been Morton ordered, we can use the neighbor cell indices of each cell, to identify the boundary/ghost cells, of each partition.


Really elegant solution to an otherwise complicated graph problem.
sbaffini likes this.
aerosayan is offline   Reply With Quote

Old   September 16, 2022, 12:56
Default
  #7
Senior Member
 
sbaffini's Avatar
 
Paolo Lampitella
Join Date: Mar 2009
Location: Italy
Posts: 2,151
Blog Entries: 29
Rep Power: 39
sbaffini will become famous soon enoughsbaffini will become famous soon enough
Send a message via Skype™ to sbaffini
Quote:
Originally Posted by aerosayan View Post
Haha. Yesterday I derived the same solution. After the cells have been Morton ordered, we can use the neighbor cell indices of each cell, to identify the boundary/ghost cells, of each partition.


Really elegant solution to an otherwise complicated graph problem.
That's indeed the same machinery I use when an external partitioner is not used (because my cells are always ordered according to a SFC) and the ith process just gets the ith chunk of cells.

A partitioned array has the same bookkeeping formulas needed in the trivial SFC based partitioner.
aerosayan likes this.
sbaffini is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
3D Windturbine simulation in SU2 k.vimalakanthan SU2 15 October 12, 2023 05:53
Out File does not show Imbalance in % Mmaragann CFX 5 January 20, 2017 10:20
Basic Nozzle-Expander Design karmavatar CFX 20 March 20, 2016 08:44
Problem in setting Boundary Condition Madhatter92 CFX 12 January 12, 2016 04:39
Error using LaunderGibsonRSTM on SGI ALTIX 4700 jaswi OpenFOAM 2 April 29, 2008 10:54


All times are GMT -4. The time now is 08:24.