Can we group multiple structured blocks together to speedup calculations?

aerosayan · January 18, 2021, 05:20

In the pictures shown below, I have multiple structured blocks at different layers.

Normally experts who developed BoxLib, AMRex, apparently just solve the different blocks in different CPUs using MPI. That makes sense since their computations are extremely large, and there's no way all of the data could be stored onto a single machine.

For my case, I don't care about scaling to thousands of cores. My objective is to make it run as fast as possible on a single core. For that objective, it's bad to solve these blocks separately.

If I try to solve them on multiple structured blocks, I will have to enforce the bounday conditions betweeneach block, every iteration. That would be bad. Also, I don't want to use MPI, I want to use OpenMP.

I'm greedy about performance. I want it all.

I'm thinking about flattening out the whole level (multiple structured blocks) into a single 1D array, and iterating over the whole thing using i=1,ncells_in_level.

In this method, I will only have to worry about setting the boundary conditions once (only between the grid levels L, L-1) per iteration and the whole freakin' calculation will be vectorized.

Will that work?

What should I look out for, so that i don't blow off my own foot doing this?

Because if this works : Lord forgive me for the performance boost I'm about to receive!!

EDIT : Upon further investigation, it looks like I will still have to update the data between the different boundary blocks, but the vectorization potential is still a good outcome.
Aaaaannnnddd, since everything's in the form of 1D arrays, I can absolutely distribute the work onto 4 of my CPU cores using OpenMP!!!!!!

Thanks
~sayan

andy_ · January 18, 2021, 06:38

Is this an implicit or explicit code?

aerosayan · January 18, 2021, 06:54

Quote:

Originally Posted by andy_

Is this an implicit or explicit code?

Explicit code.

andy_ · January 18, 2021, 07:22

Quote:

Originally Posted by aerosayan

Explicit code.

In this case there is no dependency between blocks when advancing one step. A simple approach is to add halo cells for your "bcs" of a depth given by the order of your code. After each time advance of the internal values copy the new values to all overlaying halo cells and advance in time again. The 4 sides of each block will require an index into the neighbouring block and likely a stride if the blocks can vary in size or orientation.

aerosayan · January 18, 2021, 07:31

Quote:

Originally Posted by andy_

In this case there is no dependency between blocks when advancing one step. A simple approach is to add halo cells for your "bcs" of a depth given by the order of your code. After each time advance of the internal values copy the new values to all overlaying halo cells and advance in time again. The 4 sides of each block will require an index into the neighbouring block and likely a stride if the blocks can vary in size or orientation.

Yes Andy, that seems like a great way to do it!

aerosayan · January 18, 2021, 13:00

Quote:

Originally Posted by andy_

In this case there is no dependency between blocks when advancing one step. A simple approach is to add halo cells for your "bcs" of a depth given by the order of your code. After each time advance of the internal values copy the new values to all overlaying halo cells and advance in time again. The 4 sides of each block will require an index into the neighbouring block and likely a stride if the blocks can vary in size or orientation.

Hey andy,
I thought about this. Can we keep the ghost/halo cells in a separate array?

I have the solution update loop as :

Code:

#pragma omp simd linear(i:1)
for(i=0; i<ncells; ++i)
{                                                                  // <- loop i
    U_[i] = UN_[i] + alpha*reciprocal_cvols_[i]*R_[i];
}                                                                  // -> loop i

U_, UN_, R_, etc contain values of the boundary and internal cells.
Since I will allocate the arrays as U_ = [U_ for block 1][U_ for block 2] ... [U_ for block N], i.e in a single contiguous memory block, I can't have the ghost/halo cells inside this region.

Is it okay if I keep them in a separatearray? Or will it cause too much problem down the line?

This is the first time I'm using halo cells, so I don't know much about them.

Thanks

BTW : Are you Andrew Hollingsworth?

andy_ · January 18, 2021, 15:22

Quote:

Originally Posted by aerosayan

Can we keep the ghost/halo cells in a separate array?

This is the first time I'm using halo cells, so I don't know much about them.

BTW : Are you Andrew Hollingsworth?

The halo cells allow the gradients near the boundary to be evaluated using the same (efficient) structured grid differencing as the cells away from the boundary. If the boundary values are not present in the structured grid blocks then you will have to catch and use some form of indirect addressing for the cells near the boundarys which will significantly increase the amount of coding and slow it down. If you take this approach then you might as well drop the boundary arrays altogether and directly address the values in the neighbouring blocks avoiding any copying.

It is easy enough to check but the ghost cells approach is likely to be computationally faster with simpler coding at the cost of a bit more storage.

I am not Andrew Hollingsworth whoever he may be.

aerosayan · January 19, 2021, 01:49

Quote:

Originally Posted by andy_

The halo cells allow the gradients near the boundary to be evaluated using the same (efficient) structured grid differencing as the cells away from the boundary. If the boundary values are not present in the structured grid blocks then you will have to catch and use some form of indirect addressing for the cells near the boundarys which will significantly increase the amount of coding and slow it down. If you take this approach then you might as well drop the boundary arrays altogether and directly address the values in the neighbouring blocks avoiding any copying.

It is easy enough to check but the ghost cells approach is likely to be computationally faster with simpler coding at the cost of a bit more storage.

I am not Andrew Hollingsworth whoever he may be.

Thanks for your help Andy.

I agree that the indirect referencing will be a little more complicated, and there may be performance penalty, I will have to see if I can make that efficient.

The flux calculations will be slow by default, there are so many branches inside it. Including this indirect referencing will also include a lot of branches if I implement it naively.

A naive implementation would be :

Code:

if(at_boundary)
{
    data = get_data_from_external_1d_array(i,j);

}

However, in the conventional form of ghost cells you mentioned, writing into the ghost cells will be quite slow too due to cache misses, since many cells at the boundary are located a little bit away in memory from each other (North and South wall cells to be specific). And we'll only be iterating over each level for max 2 times, so I will have to do some tests to see if the conventional form is superior. (EDIT : I don't think the slowdown would be negligible, because when we write into the ghost cells, the ghost cells haven't been loaded in memory before, so we will have a DRAM or L3 cache access, which is slow).

My plan was to use a single 1D array that will contain all of the ghost cell data, in exactly the order they will be accessed in. This is shown in the attached image. In this case, even if the structured block is large, the data will be present in at least L2 cache. Writing into it will be fast, and reading from it will also be fast.

andy_ · January 19, 2021, 06:46

Quote:

Originally Posted by aerosayan

My plan was to...

Tempted as I am to discuss the pros and cons these would be influenced by details I don't possess like the numerical scheme, the future of the code, involvement of others, etc... I will instead point out that you will almost certainly be going about things in a reliable way if you base your main decisions on evidence from testing (not surmising) and avoid getting too attached to elaborate algorithms you may have invested significant time, effort and pride in devising.

This tends to mean writing the simplest realistic scheme first (e.g. halo cells in this case) and testing it on small problems and large realistic problems. If it looks as if a more elaborate scheme might be worthwhile (i.e. achieved performance is less than 50% or so of theoretical) then develop a more elaborate scheme to address the deficiencies that have been measured. Typically when raising the computational performance of an existing numerical code only a few routines tend to matter and although most are the ones one would have picked beforehand quantitative measurement via profiling can change the assumed relative importance.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[Other] Multiple blocks for moving mesh	jiahui_93	OpenFOAM Meshing & Mesh Conversion	0	March 4, 2018 02:24
[blockMesh] How to define symmetric graded mesh without using multiple blocks	sahmed	OpenFOAM Meshing & Mesh Conversion	3	August 18, 2016 03:33
Two-Phase Buoyant Flow Issue	Miguel Baritto	CFX	4	August 31, 2006 12:02
How to use q1 and ground file?	zheh	Phoenics	5	September 9, 2001 05:01
BFC for Dam break problem	Mehdi BEN ELHADJ	Phoenics	0	January 18, 2001 15:22

January 18, 2021, 06:38		#2
andy_ Senior Member andy Join Date: May 2009 Posts: 270 Rep Power: 17	Is this an implicit or explicit code?