How to partition cartesian grid for MPI in fortran

January 16, 2008, 01:11

Hi,

I'm programming in fortran and I'm trying to use a parallel solver known as Hypre, which is a library written in C. I'm working with cartesian grid and I wonder how I should partition my grid to run in parallel.

Which is the best option for a cartesian grid of size_x*size_y, asuming to be divided to 4 grids? :

1. 4 (size_x/4)*size_y grids

2. 4 size_x*(size_y/4) grids

3. 4 (size_x/2)*(size_y/2) grids

Is there some clear advantage of 1 over the other?

Thanks

January 16, 2008, 23:33

If you are of an experimental bent of mind, try all three partitions on your computer system and pick the one which has the smallest turn-around time! If rather you are of an analytical bent of mind, read on.

The general answer is: if it is possible to balance the load this way, then a one-dimensional partition in the direction with the greatest cell count is best. To find out why, read on.

A partition suitable for concurrent computing should be a good try at all of the following: (a) balance the load (equal division of computation volume), (b) minimize the total inter-subdomain boundary surface (minimize total inter-subdomain communication), (c) minimize the maximum shared boundary surface between any two subdomains, and (d) minimize the maximum (over the set of subdomains) number of neighbors.

How many subdomains should you partition the domain into? (Since you did not ask this question, you could skip this para, but I felt the trends were worth retelling). The answer represents a trade-off among the fiscal costs of computation and communication, and the time costs of the same two factors. If the fiscal costs of computation (processors) and communication (links and switches) are the same, then only the relative timings matter. Assume perfect concurrency of communication. In this case, one should partition into a number of subdomains such that the average [volume to surface] ratio of the subdomains is proportional to the average ratio of [communication wallclocktime between two cells on two different processors to the computational wallclocktime for a cell before information has to be exchanged]. Basically, you want as many subdomains as will make the maximum communication time between two subdomains roughly equal to the maximum computational time in a subdomain before it needs to exchange data again. A finer partition would decrease the volume to surface ratio and lead to diminishing returns from the extra processors which could remain unpurchased or be put to other tasks. I will assume fiscal costs for computation and communication are comparable in the following.

In special cases, you may find a single partition that meets all four criteria, but generally there is conflict among these criteria, and then trade-offs must be made.

The trade-off between criterion (a) and criteria (b), (c), and (d) depends on the time cost of computation in a cell (tn) versus that of communication between two cells on different processors (te). If te .gt. tn, then one should emphasize criteria (b), (c), (d) over criterion (a), and vice-versa.

The trade-offs among criteria (b), (c) and (d) would depend on the degree of concurrency of the communication among subdomains. This is dependent on the topology of the network linking your cores/processors. If there is zero concurrency (all communication must travel through a bottleneck master node), criterion (b) would be most important. If there is perfect concurrency at the subdomain level, criteria (c) and (d) would be more important than (b). If there is perfect concurrency at the message level, criterion (d) would be less important than (c).

All three of your candidate partitions satisfy criterion (a) exactly, so there is no distinction among them there. Let nx and ny be the numbers of cells in the x and y directions of your domain mesh. Let us assume periodic boundary conditions in both x and y directions (the conclusions would be somewhat different if the bcs are local, since you are partitioning into such a small number of subdomains, the domain boundaries matter as much as the interior). Then the maximum number of neighbors of among subdomains is the same (two) for all three partitions, and hence criterion (d) does not distinguish among the three, either.

For partition 1, the total inter-subdomain boundary is 4*ny, while the maximum boundary of any single subdomain is 2*ny. By symmetry, for partition 2, the total is 4*nx, while the max single is 2*nx. For partition 3, the total is 2*nx+2*ny, while the max single is nx+ny.

If nx=ny, it is clear that all three partitions have the same total boundary of 4*nx, and the same max single boundary of 2*nx, so there is no distinction among them. If nx .gt. ny, clearly partition 1 has both the least total boundary and the least max single boundary. If nx .lt. ny, it is partition 2 that has the advantage. Thus, your best choices are partition 1 if nx .gt. ny and partition 2 if nx .lt. ny. This can be stated more generally as: if it is possible to balance the load this way, then a one-dimensional partition in the direction with the greatest cell count is best.

January 17, 2008, 02:56

Thank you Ananda for the detailed explanation. My domain is usually where nx>ny and hence I believe partition 1 is the best.

January 17, 2008, 04:38

You are welcome. I forgot to mention that satisfying criterion (d) should generally also allow you to minimize the number of stages of communication needed. In your particular case, all three partitions satisfy the criterion. For example, for partition 1, with contiguous subdomains numbered 1 through 4, the first stage of communication may involve subdomains 1 and 2 talking with each other, at the same time as subs 3 and 4 talk with each other. In the second (and last) stage, subs 1 and 4 talk with each other simultaneously with subs 2 and 3 talking with each other. For more complex partitions in 3-D, with more than two subdomain neighbors, you may use a graph-coloring algorithm to minimize the number of needed non-interfering communication stages.

January 16, 2008, 01:11	How to partition cartesian grid for MPI in fortran	#1
CF Guest Posts: n/a	Hi, I'm programming in fortran and I'm trying to use a parallel solver known as Hypre, which is a library written in C. I'm working with cartesian grid and I wonder how I should partition my grid to run in parallel. Which is the best option for a cartesian grid of size_xsize_y, asuming to be divided to 4 grids? : 1. 4 (size_x/4)size_y grids 2. 4 size_x(size_y/4) grids 3. 4 (size_x/2)(size_y/2) grids Is there some clear advantage of 1 over the other? Thanks

January 16, 2008, 23:33	Re: How to partition cartesian grid for MPI in for	#2
Ananda Himansu Guest Posts: n/a	If you are of an experimental bent of mind, try all three partitions on your computer system and pick the one which has the smallest turn-around time! If rather you are of an analytical bent of mind, read on. The general answer is: if it is possible to balance the load this way, then a one-dimensional partition in the direction with the greatest cell count is best. To find out why, read on. A partition suitable for concurrent computing should be a good try at all of the following: (a) balance the load (equal division of computation volume), (b) minimize the total inter-subdomain boundary surface (minimize total inter-subdomain communication), (c) minimize the maximum shared boundary surface between any two subdomains, and (d) minimize the maximum (over the set of subdomains) number of neighbors. How many subdomains should you partition the domain into? (Since you did not ask this question, you could skip this para, but I felt the trends were worth retelling). The answer represents a trade-off among the fiscal costs of computation and communication, and the time costs of the same two factors. If the fiscal costs of computation (processors) and communication (links and switches) are the same, then only the relative timings matter. Assume perfect concurrency of communication. In this case, one should partition into a number of subdomains such that the average [volume to surface] ratio of the subdomains is proportional to the average ratio of [communication wallclocktime between two cells on two different processors to the computational wallclocktime for a cell before information has to be exchanged]. Basically, you want as many subdomains as will make the maximum communication time between two subdomains roughly equal to the maximum computational time in a subdomain before it needs to exchange data again. A finer partition would decrease the volume to surface ratio and lead to diminishing returns from the extra processors which could remain unpurchased or be put to other tasks. I will assume fiscal costs for computation and communication are comparable in the following. In special cases, you may find a single partition that meets all four criteria, but generally there is conflict among these criteria, and then trade-offs must be made. The trade-off between criterion (a) and criteria (b), (c), and (d) depends on the time cost of computation in a cell (tn) versus that of communication between two cells on different processors (te). If te .gt. tn, then one should emphasize criteria (b), (c), (d) over criterion (a), and vice-versa. The trade-offs among criteria (b), (c) and (d) would depend on the degree of concurrency of the communication among subdomains. This is dependent on the topology of the network linking your cores/processors. If there is zero concurrency (all communication must travel through a bottleneck master node), criterion (b) would be most important. If there is perfect concurrency at the subdomain level, criteria (c) and (d) would be more important than (b). If there is perfect concurrency at the message level, criterion (d) would be less important than (c). All three of your candidate partitions satisfy criterion (a) exactly, so there is no distinction among them there. Let nx and ny be the numbers of cells in the x and y directions of your domain mesh. Let us assume periodic boundary conditions in both x and y directions (the conclusions would be somewhat different if the bcs are local, since you are partitioning into such a small number of subdomains, the domain boundaries matter as much as the interior). Then the maximum number of neighbors of among subdomains is the same (two) for all three partitions, and hence criterion (d) does not distinguish among the three, either. For partition 1, the total inter-subdomain boundary is 4ny, while the maximum boundary of any single subdomain is 2ny. By symmetry, for partition 2, the total is 4nx, while the max single is 2nx. For partition 3, the total is 2nx+2ny, while the max single is nx+ny. If nx=ny, it is clear that all three partitions have the same total boundary of 4nx, and the same max single boundary of 2nx, so there is no distinction among them. If nx .gt. ny, clearly partition 1 has both the least total boundary and the least max single boundary. If nx .lt. ny, it is partition 2 that has the advantage. Thus, your best choices are partition 1 if nx .gt. ny and partition 2 if nx .lt. ny. This can be stated more generally as: if it is possible to balance the load this way, then a one-dimensional partition in the direction with the greatest cell count is best.

January 17, 2008, 02:56	Re: How to partition cartesian grid for MPI in for	#3
CF Guest Posts: n/a	Thank you Ananda for the detailed explanation. My domain is usually where nx>ny and hence I believe partition 1 is the best.

January 17, 2008, 04:38	Re: How to partition cartesian grid for MPI in for	#4
Ananda Himansu Guest Posts: n/a	You are welcome. I forgot to mention that satisfying criterion (d) should generally also allow you to minimize the number of stages of communication needed. In your particular case, all three partitions satisfy the criterion. For example, for partition 1, with contiguous subdomains numbered 1 through 4, the first stage of communication may involve subdomains 1 and 2 talking with each other, at the same time as subs 3 and 4 talk with each other. In the second (and last) stage, subs 1 and 4 talk with each other simultaneously with subs 2 and 3 talking with each other. For more complex partitions in 3-D, with more than two subdomain neighbors, you may use a graph-coloring algorithm to minimize the number of needed non-interfering communication stages.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
grid generation and transferring it into Fortran code	HaKu	Main CFD Forum	1	April 28, 2011 02:11
Grid generation code in Fortran	artemiss1984	Main CFD Forum	3	February 23, 2011 01:19
Framing PPE for complex geometry over cartesian grid	ravikant	Main CFD Forum	0	January 9, 2010 12:45
moving cylinder in a cartesian grid	yusun	Main CFD Forum	1	July 1, 2009 03:48
Cartesian Grid	Harish	FLUENT	0	January 3, 2006 00:09