What are the data transfer patterns for parallelization of implicit solvers?

aerosayan · August 29, 2022, 19:41

Hello everyone,

Trying to understand parallelization of implicit solvers.

Specifically, I'm interested to learn, which tasks we're parallelizing, and which data are transmitted between each processor in every iteration.

In explicit solvers, we mainly need to transfer ghost cell data between each partition, I'm not sure what happens in implicit codes.

Here's what I currently think might happen:

In 2D, I can store the whole mesh in a single node, since memory requirement is less in 2D. So to keep my code simple, I'm thinking of only MPI parallelizing small sections of the code.

For example, the Krylov solver (GMRES) will need to be parallelized, even though I currently don't know how. That is, we form the sparse system on MPI's root processor (rank=0), then distribute the iterative solver work to our cluster. I don't know if we can build the matrix in parallel. We probably can, but it will be complicated, so I will try to avoid it for now.

Parallelizing the flux calculations would also be very good, but that might introduce lots of data transfer between processors. And, I don't know what's the correct and efficient way to do this.

As I see currently, only the linear solver, seems easily parallelizeable to me (as there are many papers on the topic), and the flux calculation, seems inefficient to parallelize (maybe i'm prematurely optimizing).

PS: I will probably need Async MPI data transfer, and some more advanced MPI data transfer/access routines. Which ones do you think, will be important?

Thanks!

arjun · August 29, 2022, 22:53

In wildkatze the data structure is that we have the indexing as:

1 ... N-internal-cells , N-internal-cells + 1, N-internal-cells +2 ..... N-HaloCells ..... TotalCells-With-Boundaries.

The linear system is constructed on N-HaloCells

naffrancois · August 30, 2022, 03:46

When I think about MPI for CFD codes I almost always think "domain decomposition MPI". This is I think quite intuitive and efficient way of doing MPI parallelization. Your solver reads in a partitionned mesh, each core works with a subset of the whole mesh.

In the code I am working on, and I guess many others work the same, each partition has N-internal-cells + N-halo-cells as Arjun said. At the begining of each time step, halo cells are filled by values coming from adjacent cpus with MPI (vector of primitive variables for example).

In cell center FV the inter core boudaries lies on faces of the mesh. When you compute the flux at such a face, left cell belongs to cpu 1 and right cell to cpu 2. On cpu1 you update the residual of left cell and discard residual of right cell. On cpu 2, you discard residual update of left cell and update right cell. There is extra work though at inter cpu faces.

In implicit solvers, it is a bit more complicated but the concept is the same. You work with augmented vectors and matrices but you need to fill halo cells more often, typically for Krylov solvers when you need to perform matrix-vector products.

sbaffini · August 30, 2022, 06:00

Quote:

Originally Posted by aerosayan

Hello everyone,

Trying to understand parallelization of implicit solvers.

Specifically, I'm interested to learn, which tasks we're parallelizing, and which data are transmitted between each processor in every iteration.

In explicit solvers, we mainly need to transfer ghost cell data between each partition, I'm not sure what happens in implicit codes.

Here's what I currently think might happen:

In 2D, I can store the whole mesh in a single node, since memory requirement is less in 2D. So to keep my code simple, I'm thinking of only MPI parallelizing small sections of the code.

For example, the Krylov solver (GMRES) will need to be parallelized, even though I currently don't know how. That is, we form the sparse system on MPI's root processor (rank=0), then distribute the iterative solver work to our cluster. I don't know if we can build the matrix in parallel. We probably can, but it will be complicated, so I will try to avoid it for now.

Parallelizing the flux calculations would also be very good, but that might introduce lots of data transfer between processors. And, I don't know what's the correct and efficient way to do this.

As I see currently, only the linear solver, seems easily parallelizeable to me (as there are many papers on the topic), and the flux calculation, seems inefficient to parallelize (maybe i'm prematurely optimizing).

PS: I will probably need Async MPI data transfer, and some more advanced MPI data transfer/access routines. Which ones do you think, will be important?

Thanks!

You are thinking in a shared memory fashion. In PDE resolution with MPI no processor holds the whole grid or any global array (there are some partial exceptions for IO, where performances might be better up to a certain size, if doable memorywise, when only a single process writes stuff).

Once you start mixing things up, you are using the receipt for a mess.

Still, I see that it is difficult to approach this without a guide. Thus, for example, this is where the power of OOP can be leveraged. I once had to transform a parallel code that was, actually, storing almost everything on every process. A real mess. But if you understand what information is logical to have locally and what not, you can start to incapsulate the functions that give you that information, even if not parallel. Than on the side you build the machinery that allows you to retrieve that information in parallel. In the end you inject the new machinery in the old incapsulated stuff and you're basically done.

As others have mentioned, the halo cells that need to be updated in parallel can be understood in two ways, which kind of exist at the same time. First of all, you just store themjust after the local cells. This is so because you need them close in memory to the actual local cells for the fluxes (cache, etc.), and also because it is typically faster to update properties in halo cells than to also exchange properties. In this regard, halo cells are just like regular fluid cells, but you just don't build their residual nor update their variables locally. As mentioned by naffrancois, you just update their independent variables at the beginning of an iteration or time step and use them wherever needed.

The second way to think halo cells is as ghost cells for boundary conditions. I'm taking a step further here and add that an almost exact correspondence exists between ghost cell for periodicity boundary conditions and halo cells. What is different is that values for halos don't come from the same process but a separate one.

aerosayan · August 30, 2022, 06:40

Quote:

Originally Posted by arjun

The linear system is constructed on N-HaloCells

Do you mean that you don't create a extremely big global matrix?

Kindly correct me if I'm misunderstanding you, but I understood it as, for each mesh partition, we treat the halo cells as boundary, and form a small Ax=b system of linear equations only for that partition.

Then we solve for x locally on that partition, and don't need to parallelize our GMRES (or any other) linear solver. The solver architecture itself would be parallel, and each algorithm would work "serially" on their own partition.

I think I kind of understand it now, and it seems easy, if I understood it correctly. After all, there's no mathematical difference between the global grid and any other small partition of the global grid. If we have the boundary conditions, we should be able to form a local matrix and solve it directly/iteratively on that processor.

Then we would just need to share the boundary/halo cell solution to the neighbor grids and then move to the next iteration. Similar to explicit codes.

sbaffini · August 30, 2022, 06:54

This is a rough sketch of what you need in a parallel solver:

1a) A single process reads the whole grid and only stores the information needed for the domain decomposition (e.g., metis) or

1b) Each process reads a chunk of the grid (n-th proc reading the n-th chunk) and uses the information for the parallel domain decomposition (e.g., parmetis)

2a) The process that invoked the serial domain decomposition sends the output to the other processes. You could just write a file readable by each process, or explicitly send the information to each of them (the best approach depends from several factors). The output here is just a list of processes, one per cell. That is, each process needs to know which cells are his own and which not or

2b) Nothing to do with a parallel domain decomposition approach, as they typically just return the correct information to each process.

3) Each process needs to read and store his own part of the mesh and the part relative to the halo cells. Note that you need to, somehow, navigate/explore the connectivity while reading, in order to mark your actual halo cells.

4) Each process needs to know who owns its halo cells, that is, which processes will send stuff to it to fill its halo cells. This is not quite different from the initial domain decomposition, and it is typically done in combination with 2 or 3, depending from the specific approach in use. Could be done in a second pass with a ring structure, but doesn't scale well in general.

5) Each process needs to inform the others of what it needs from them. This is a 2 step process. An MPI_ALLTOALL (where each process tells each other how many cells it will need from them) followed by an MPI_ALLTOALLV (where the actual list of cells is sent).

6) Now each process knows everything that is needed to send and receive stuff (note that the difficulty in parallel AMG is that this changes at each level of each iteration, while this is otherwise just done once at the beginning of the calculation). You just need some routine where you loop over the processes that need your stuff, pack that stuff for them and send it. Then you make a loop receiving. Using non-blocking send-recv here is not making any magic, and I still suggest using a detailed non blocking order of sends and recvs (e.g., see here), yet non-blocking.

7) Make the machinery in 6 reusable for each variable you might need to exchange and you are basically done. You just nned to treat the local domain as in the single process case and remember to exchange in parallel before stuff that is not up to date is needed.

8) Most linear system solvers are matrix-vector products. As you store the halo cells you can also store the CSR matrix to include halo coeffcients (they are not exchanged, they are built locally!!!). Then you're done. You need to remember to exchange the system solution before each iteration (and any other variable used in the linear solver, if there are several)

Few notes:

- when I mention reading from file, it is never just a simple read, or you will engulf the system very easily. My approach is to read stuff in chunks and to skip chunks based on previous information I have on what I need and where to look for it. If each process reads a whole file, line by line, it is game over.

- when processes exchange lists of cells, you need the bookkeeping between the global cell numbering (the one existing in the mesh file) and the local cell numbering (which is always contiguous, from 1 to N_local).

- more generally, you need functions to go from the global cell number to the local one (or, say, -1 if not local) and to the local halo number (or -1, if not a local halo), and the reverse for the local cells and halo. The latter are just 2 arrays, but the former requires an initial sorting and binary searches.

aerosayan · August 30, 2022, 07:05

Quote:

Originally Posted by sbaffini

CSR matrix

Thanks for the detailed information. Any benefit of using Compressed Sparse Row formation and not Compressed Sparse Column? Matlab uses CSC. But I'm thinking you maybe using CSR because the linear equation coefficients for each equation in Ax=b will lie on the same row, so keeping all of the data on each row close in memory, would be good for cache performance. Is that it?

sbaffini · August 30, 2022, 07:25

Quote:

Originally Posted by aerosayan

Thanks for the detailed information. Any benefit of using Compressed Sparse Row formation and not Compressed Sparse Column? Matlab uses CSC. But I'm thinking you maybe using CSR because the linear equation coefficients for each equation in Ax=b will lie on the same row, so keeping all of the data on each row close in memory, would be good for cache performance. Is that it?

Yes, that's what you need in the matrix vector products. And that's because, if each row is a cell, the matrix is decomposed by row, so the processes have full rows, not full columns.

sbaffini · August 30, 2022, 07:39

Quote:

Originally Posted by sbaffini

Yes, that's what you need in the matrix vector products. And that's because, if each row is a cell, the matrix is decomposed by row, so the processes have full rows, not full columns.

Let me put it straight. The main point is not cache efficiency. It would be if you stored the whole matrix on each process, which you can't in parallel. As the domain decomposition is, naturally, per cells, each process stores the matrix per cell, that is, per rows. As the matrix is sparse, each process can store the whole row without issues.

But differently from residuals (e.g., RHS), the halo coefficients (local cell row, local halo column) have to be updated.

aerosayan · August 30, 2022, 07:56

Quote:

Originally Posted by sbaffini

As the matrix is sparse, each process can store the whole row without issues.

I asked arjun before, but is it necessary to use an extremely big global matrix?

I was thinking, since each partition's ghost cells act as the boundary for that partition, it might be possible to create a small matrix for solving Ax=b set of linear equations locally on each processor.

We would then only need to share the boundary/halo cell data between the neighbor partitions, during each iterations. Similar to explicit codes.

I'm not sure if this is how it's done in industrial CFD codes, but this seems feasible.

Advantage would be, our complex codes, like linear solvers, pre-conditioners, etc. would then practically become serial in nature and can be solved on each processor, and the parallelism would be handled by our code's architecture and not each function.

naffrancois · August 30, 2022, 07:58

Quote:

Originally Posted by aerosayan

Do you mean that you don't create a extremely big global matrix?

Kindly correct me if I'm misunderstanding you, but I understood it as, for each mesh partition, we treat the halo cells as boundary, and form a small Ax=b system of linear equations only for that partition.

Then we solve for x locally on that partition, and don't need to parallelize our GMRES (or any other) linear solver. The solver architecture itself would be parallel, and each algorithm would work "serially" on their own partition.

I think I kind of understand it now, and it seems easy, if I understood it correctly. After all, there's no mathematical difference between the global grid and any other small partition of the global grid. If we have the boundary conditions, we should be able to form a local matrix and solve it directly/iteratively on that processor.

Then we would just need to share the boundary/halo cell solution to the neighbor grids and then move to the next iteration. Similar to explicit codes.

To complete the detailed answer of sbaffini. On each process you need a local matrix A of size [N-internal-cells , N-internal-cells+N-HaloCells ]. Blocks in matrix halo cells are not exchanged, they are computed from the MPI updated vector of prim or cons variables. Then you realize you cannot just apply the serial iterative linear solver, you need communications inside the solver. When you multiply the matrix by a vector, this vector needs values for his halo cells. Also dot products need reduction.

sbaffini · August 30, 2022, 08:25

Quote:

Originally Posted by aerosayan

I asked arjun before, but is it necessary to use an extremely big global matrix?

I was thinking, since each partition's ghost cells act as the boundary for that partition, it might be possible to create a small matrix for solving Ax=b set of linear equations locally on each processor.

We would then only need to share the boundary/halo cell data between the neighbor partitions, during each iterations. Similar to explicit codes.

I'm not sure if this is how it's done in industrial CFD codes, but this seems feasible.

Advantage would be, our complex codes, like linear solvers, pre-conditioners, etc. would then practically become serial in nature and can be solved on each processor, and the parallelism would be handled by our code's architecture and not each function.

This is what we are trying to tell you. The whole matrix doesn't exist at all. Each process has the rows corresponding to its own cells, like for any other variable.

Now, if a process has a full row, what it needs to compute his local part of the global matrix-vector product is just the vector in its local cells and its local halos. So, as mentioned by naffrancois, linear solvers, that mostly do matrix-vector products, need to exchange the vector variables in halo cells at each iteration.

An MPI code can't be identical to a serial code because, in general, your algorithmic and data structure choices are strictly dictated by the distributed memory feature. But besides few calls here and there, an MPI code is otherwise transparent to the parallel details. If your code is bloated by them, you are certainly doing something wrong.

August 29, 2022, 19:41	What are the data transfer patterns for parallelization of implicit solvers?	#1
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	Hello everyone, Trying to understand parallelization of implicit solvers. Specifically, I'm interested to learn, which tasks we're parallelizing, and which data are transmitted between each processor in every iteration. In explicit solvers, we mainly need to transfer ghost cell data between each partition, I'm not sure what happens in implicit codes. Here's what I currently think might happen: In 2D, I can store the whole mesh in a single node, since memory requirement is less in 2D. So to keep my code simple, I'm thinking of only MPI parallelizing small sections of the code. For example, the Krylov solver (GMRES) will need to be parallelized, even though I currently don't know how. That is, we form the sparse system on MPI's root processor (rank=0), then distribute the iterative solver work to our cluster. I don't know if we can build the matrix in parallel. We probably can, but it will be complicated, so I will try to avoid it for now. Parallelizing the flux calculations would also be very good, but that might introduce lots of data transfer between processors. And, I don't know what's the correct and efficient way to do this. As I see currently, only the linear solver, seems easily parallelizeable to me (as there are many papers on the topic), and the flux calculation, seems inefficient to parallelize (maybe i'm prematurely optimizing). PS: I will probably need Async MPI data transfer, and some more advanced MPI data transfer/access routines. Which ones do you think, will be important? Thanks!

August 29, 2022, 22:53		#2
arjun Senior Member Arjun Join Date: Mar 2009 Location: Nurenberg, Germany Posts: 1,274 Rep Power: 34	In wildkatze the data structure is that we have the indexing as: 1 ... N-internal-cells , N-internal-cells + 1, N-internal-cells +2 ..... N-HaloCells ..... TotalCells-With-Boundaries. The linear system is constructed on N-HaloCells sbaffini and aerosayan like this.

August 30, 2022, 03:46		#3
naffrancois Senior Member Join Date: Oct 2011 Posts: 241 Rep Power: 16	When I think about MPI for CFD codes I almost always think "domain decomposition MPI". This is I think quite intuitive and efficient way of doing MPI parallelization. Your solver reads in a partitionned mesh, each core works with a subset of the whole mesh. In the code I am working on, and I guess many others work the same, each partition has N-internal-cells + N-halo-cells as Arjun said. At the begining of each time step, halo cells are filled by values coming from adjacent cpus with MPI (vector of primitive variables for example). In cell center FV the inter core boudaries lies on faces of the mesh. When you compute the flux at such a face, left cell belongs to cpu 1 and right cell to cpu 2. On cpu1 you update the residual of left cell and discard residual of right cell. On cpu 2, you discard residual update of left cell and update right cell. There is extra work though at inter cpu faces. In implicit solvers, it is a bit more complicated but the concept is the same. You work with augmented vectors and matrices but you need to fill halo cells more often, typically for Krylov solvers when you need to perform matrix-vector products. aerosayan likes this.

August 30, 2022, 06:54		#6
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,152 Blog Entries: 29 Rep Power: 39	This is a rough sketch of what you need in a parallel solver: 1a) A single process reads the whole grid and only stores the information needed for the domain decomposition (e.g., metis) or 1b) Each process reads a chunk of the grid (n-th proc reading the n-th chunk) and uses the information for the parallel domain decomposition (e.g., parmetis) 2a) The process that invoked the serial domain decomposition sends the output to the other processes. You could just write a file readable by each process, or explicitly send the information to each of them (the best approach depends from several factors). The output here is just a list of processes, one per cell. That is, each process needs to know which cells are his own and which not or 2b) Nothing to do with a parallel domain decomposition approach, as they typically just return the correct information to each process. 3) Each process needs to read and store his own part of the mesh and the part relative to the halo cells. Note that you need to, somehow, navigate/explore the connectivity while reading, in order to mark your actual halo cells. 4) Each process needs to know who owns its halo cells, that is, which processes will send stuff to it to fill its halo cells. This is not quite different from the initial domain decomposition, and it is typically done in combination with 2 or 3, depending from the specific approach in use. Could be done in a second pass with a ring structure, but doesn't scale well in general. 5) Each process needs to inform the others of what it needs from them. This is a 2 step process. An MPI_ALLTOALL (where each process tells each other how many cells it will need from them) followed by an MPI_ALLTOALLV (where the actual list of cells is sent). 6) Now each process knows everything that is needed to send and receive stuff (note that the difficulty in parallel AMG is that this changes at each level of each iteration, while this is otherwise just done once at the beginning of the calculation). You just need some routine where you loop over the processes that need your stuff, pack that stuff for them and send it. Then you make a loop receiving. Using non-blocking send-recv here is not making any magic, and I still suggest using a detailed non blocking order of sends and recvs (e.g., see here), yet non-blocking. 7) Make the machinery in 6 reusable for each variable you might need to exchange and you are basically done. You just nned to treat the local domain as in the single process case and remember to exchange in parallel before stuff that is not up to date is needed. 8) Most linear system solvers are matrix-vector products. As you store the halo cells you can also store the CSR matrix to include halo coeffcients (they are not exchanged, they are built locally!!!). Then you're done. You need to remember to exchange the system solution before each iteration (and any other variable used in the linear solver, if there are several) Few notes: - when I mention reading from file, it is never just a simple read, or you will engulf the system very easily. My approach is to read stuff in chunks and to skip chunks based on previous information I have on what I need and where to look for it. If each process reads a whole file, line by line, it is game over. - when processes exchange lists of cells, you need the bookkeeping between the global cell numbering (the one existing in the mesh file) and the local cell numbering (which is always contiguous, from 1 to N_local). - more generally, you need functions to go from the global cell number to the local one (or, say, -1 if not local) and to the local halo number (or -1, if not a local halo), and the reverse for the local cells and halo. The latter are just 2 arrays, but the former requires an initial sorting and binary searches. naffrancois and aerosayan like this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[OpenFOAM] How to get the coordinates of velocity data at all cells and at all times	vidyadhar	ParaView	9	May 20, 2020 20:06
Question about heat transfer coefficient setting for CFX	Anna Tian	CFX	1	June 16, 2013 06:28
Error finding variable "THERMX"	sunilpatil	CFX	8	April 26, 2013 07:00
Transfer data in MPI	hall	Main CFD Forum	0	May 3, 2004 18:57
Data transfer	H. P. LIU	Main CFD Forum	5	May 19, 2003 10:47