CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   Main CFD Forum (https://www.cfd-online.com/Forums/main/)
-   -   How to group cells to distribute onto parallel processors/threads? (https://www.cfd-online.com/Forums/main/231318-how-group-cells-distribute-onto-parallel-processors-threads.html)

aerosayan October 31, 2020 08:59

How to group cells to distribute onto parallel processors/threads?
 
I'm trying to parallelize my explicit compressible Euler FVM solver using OpenMP.


I want to distribute the cells onto 2 threads. But before distributing the cells onto the
2 threads, I want to group them together into 2 distinct sets.


This is my current result ; https://imgur.com/a/GrU9jvy


The cells of the same color are distributed onto the same thread.



As it can be seen, my current grouping algorithm needs to be improved. Currently I'm just simply dividing the cell list into 2 parts and distributing them to 2 threads.



However, this produces a scattered distribution as can be seen in the image. This is problematic for me.


I would like to minimize the intersection boundary between the cells of these 2 threads. I would like to make 2 groups of cells that are closely packed together, and not scattered as shown in the image.



Is there any algorithm that you can recommend to do this?

flotus1 October 31, 2020 10:02

Domain decomposition is not the most straightforward load distribution mechanism to go along with OpenMP. That's usually coupled with MPI parallelization. Not saying that it can't be done, or doesn't exist in the wild. Just that it not the most common combination.

What you are searching for falls under the term graph partitioning. There are quite a few implementations available. Scotch and Metis are two of the more commonly used ones.

BTW: if your colored graph reflects the numbering of your grid, there is a lot of performance up for grabs by renumbering the elements. The optimization goal being increased locality. I.e. elements that are direct neighbors spatially should also be as close as possible in their memory index. That increases cache reuse. Doing that first would also result in a much better domain decomposition when simply dividing the mesh along the memory index.

aerosayan October 31, 2020 10:38

Quote:

Originally Posted by flotus1 (Post 786431)
Domain decomposition is not the most straightforward load distribution mechanism to go along with OpenMP. That's usually coupled with MPI parallelization. Not saying that it can't be done, or doesn't exist in the wild. Just that it not the most common combination.

What you are searching for falls under the term graph partitioning. There are quite a few implementations available. Scotch and Metis are two of the more commonly used ones.

BTW: if your colored graph reflects the numbering of your grid, there is a lot of performance up for grabs by renumbering the elements. The optimization goal being increased locality. I.e. elements that are direct neighbors spatially should also be as close as possible in their memory index. That increases cache reuse. Doing that first would also result in a much better domain decomposition when simply dividing the mesh along the memory index.


Thanks for the reply. I'm new to solver dev, so I don't particularly know all of the terms very well. I'm trying to group the cells to improve performance by decreasing the cache misses. As you proposed, I will be renumbering the cells within each group to decrease cache misses.


And I did want to decompose the domain into multiple groups because I do have plans for implementing MPI support in future.


I was able to come up with a basic decomposition method. I simply sort the nodes based on their x-coordinate and then use a node_to_cell connectivity map to find the cells which are connected to each of the sorted nodes.


I was able to create 4 and 8 groups of cells using this simple method.


This works pretty well for me :)


RESULTS : https://imgur.com/a/EBa7QBg

sbaffini October 31, 2020 10:46

Space filling curves will work great for both your tasks, the current (shared memory parallelization) and the future one (distributed parallelization).

Still, you have to deeply understand how cache access works in shared memory in order to properly arrange the shared memory loops (which is where shared memory parallelization actually happens)

flotus1 October 31, 2020 15:44

Quote:

Originally Posted by aerosayan (Post 786432)
I was able to come up with a basic decomposition method. I simply sort the nodes based on their x-coordinate and then use a node_to_cell connectivity map to find the cells which are connected to each of the sorted nodes.

This works pretty well for me :)

Approaches like that work great...up to a certain point. Once your geometry gets more complicated, or you need more than a handful of domains, load balancing becomes almost impossible, and communication volume goes through the roof.
Space-filling curves are another interesting approach which works to some extent. But they do a poor job at minimizing, let alone equalizing communication volumes. And they have other limitations, e.g. once you have more than one load-balancing constraint, apart from element count. They are great for ordering elements in a cache-friendly way, but not necessarily the best choice for domain decomposition.
It all depends on where you want to go with this. If the examples you have shown here are as complex as it will get, then domain decomposition based on a spatial coordinate is fine. If you want to go further, you will have to look for alternatives.

aerosayan November 1, 2020 06:25

Quote:

Originally Posted by sbaffini (Post 786434)
Space filling curves will work great for both your tasks, the current (shared memory parallelization) and the future one (distributed parallelization).

Still, you have to deeply understand how cache access works in shared memory in order to properly arrange the shared memory loops (which is where shared memory parallelization actually happens)


I took your and flotus1's advice, and used z-order curve based mesh cell renumbering.



RESULTS : https://imgur.com/a/gBQg20X


In case someone else wants to use the same approach, I'm sharing the source...



CODE : https://pastebin.com/AvzmJNJX

EXPLANATION : https://archive.is/Th2xk

sbaffini November 2, 2020 08:03

Quote:

Originally Posted by flotus1 (Post 786445)
Space-filling curves are another interesting approach which works to some extent. But they do a poor job at minimizing, let alone equalizing communication volumes. And they have other limitations, e.g. once you have more than one load-balancing constraint, apart from element count. They are great for ordering elements in a cache-friendly way, but not necessarily the best choice for domain decomposition.

Of course I agree, the SFC suggestion was along the lines of a DIY solution with a great results/effort ratio, especially considering the shared/distributed requirement. However, with a little bit more of coding you can add weights to the cells and take them into account together with the SFC ordering, which is slightly more flexible than simple SFC ordering.

aerosayan November 5, 2020 06:59

Quote:

Originally Posted by flotus1 (Post 786431)
BTW: if your colored graph reflects the numbering of your grid, there is a lot of performance up for grabs by renumbering the elements. The optimization goal being increased locality. I.e. elements that are direct neighbors spatially should also be as close as possible in their memory index. That increases cache reuse. Doing that first would also result in a much better domain decomposition when simply dividing the mesh along the memory index.


https://imgur.com/KuEVYlA


Organized the cells using a space filling curve.


Organized the boundary cells at the front of the cell list. The boundary cells still follow the order they were prescribed by the space filling curve. So, they will be accessed in the same efficient order, thus improving cache performance.


Is their anything else that can be done to improve performance?

flotus1 November 5, 2020 08:22

Well...you can make a whole career out of optimizing things further :D

If we limit ourselves to grid renumbering:
You probably already do this, but just to make sure I didn't misunderstand your last post: each domain should have its share of boundary cells. If they all end up in the first domain, that would be a serious load balancing problem, and introduce a huge communication overhead.
For the order of cells, z-order is a great starting point, mostly because it's easy to understand and implement. But from my own testing, different space-filling curves can give slightly better performance overall. I am currently using a Hilbert curve for our LBM solvers. Which one works best definitely depends on the exact algorithm you are using, and most likely which hardware is used to run the code.
And for general purpose domain decomposition, tools like metis do a better job at minimizing the surfaces between different domains. I.e. minimizing communication between cores. For very simple geometries, you can come up with a hand-crafted domain decomposition scheme that might work slightly better. But you will quickly run out of feasible ideas once the geometry gets more complex.

aerosayan November 26, 2020 17:00

5 Attachment(s)
I made some upgrades : Moved the boundary cells to the end of the list, and based on the morton ordering, renumbered the edges. This allowed me to find out the total number of distinct faces/edges, setup the array to store the flux value at these faces/edges, and figure out the edges which make any triangular cell.


Using the octave code shown below, I loaded the 3 integer edge ids which make up each cell and analyzed for each cell, how much distance apart the 3 flux values will be in the flux array.



Code:

clc;
clear all;
y = load("optimized_edges.grid");

e1 = y(:,1);
e2 = y(:,2);
e3 = y(:,3);

emax = max(max(y))

plot(abs(e1-e2-e3)/emax)

grid on;
title('Edge Distribution');
xlabel('Cell Index');
ylabel('abs(e1-e2-e3)/emax');

The 5 images attached show the results.


The linear upward progression of the line is expected, however I'm interested in the spikes shown in the graph. These spikes show in order to access all 3 flux values for each cell, how much distance apart in memory, the code has to jump.


There are 234 boundary cells at the end of list, so the spikes there are expected to be huge at the end of the list. We can also see that due to the initial morton ordering of the cells, the spike heights have a periodic rise and fall. We can also see that due to the initial morton ordering of the cells, there are some regions where the curve is quite linear.


If possible, I would like to remove the spikes by another edge renumbering optimization.

Is there anything that can be done?

Thanks
- syn

aerosayan November 26, 2020 17:11

5 Attachment(s)
Mistakes were made.
I added the wrong figures in the previous post.
These are the correct images:

sbaffini November 26, 2020 17:23

Besides the fact that I don't see why what you plot should have any particular meaning (instead of, say, the max between any of e1-e2, e1-e3, e2-e3), let me clarify that the way you do the fluxes in FV is actually the other way around.

That is, you loop on edges/faces, grab variables from neighbor cells, compute fluxes (and store them if you like), distribute fluxes baxk to the neighbor cells.

So, if your cells are already ordered, the only problem you might have is that you access faces randomly. A first attempt then could be, instead of renumbering them independently, to actually do it by first looping on the ordered cells and then on their faces.

Again, a profiler can tell you where cache misses are

aerosayan November 26, 2020 17:34

Thanks. I will try that.

zx-81 November 27, 2020 13:59

Sort your cells according to their morton encoded cell center coordinates (efficiently computable space filling code).



then simply partition the sorted array.

inline uint32_t morton32_split3_( uint32_t x )
{
x &= 1023;
x = (x | (x << 16)) & 0x030000FF;
x = (x | (x << 8)) & 0x0300F00F;
x = (x | (x << 4)) & 0x030C30C3;
x = (x | (x << 2)) & 0x09249249;
return x;
}

inline uint32_t morton32_encode_( uint32_t x, uint32_t y, uint32_t z )
{
return morton32_split3_(x) | (morton32_split3_(y)<<1) | (morton32_split3_(z)<<2);
}




Quote:

Originally Posted by aerosayan (Post 786429)
I'm trying to parallelize my explicit compressible Euler FVM solver using OpenMP.


I want to distribute the cells onto 2 threads. But before distributing the cells onto the
2 threads, I want to group them together into 2 distinct sets.


This is my current result ; https://imgur.com/a/GrU9jvy


The cells of the same color are distributed onto the same thread.



As it can be seen, my current grouping algorithm needs to be improved. Currently I'm just simply dividing the cell list into 2 parts and distributing them to 2 threads.



However, this produces a scattered distribution as can be seen in the image. This is problematic for me.


I would like to minimize the intersection boundary between the cells of these 2 threads. I would like to make 2 groups of cells that are closely packed together, and not scattered as shown in the image.



Is there any algorithm that you can recommend to do this?



All times are GMT -4. The time now is 19:01.