CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Main CFD Forum

Can we group multiple structured blocks together to speedup calculations?

Register Blogs Community New Posts Updated Threads Search

Like Tree6Likes
  • 2 Post By andy_
  • 1 Post By aerosayan
  • 1 Post By andy_
  • 2 Post By andy_

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   January 18, 2021, 05:20
Default Can we group multiple structured blocks together to speedup calculations?
  #1
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
In the pictures shown below, I have multiple structured blocks at different layers.


Normally experts who developed BoxLib, AMRex, apparently just solve the different blocks in different CPUs using MPI. That makes sense since their computations are extremely large, and there's no way all of the data could be stored onto a single machine.


For my case, I don't care about scaling to thousands of cores. My objective is to make it run as fast as possible on a single core. For that objective, it's bad to solve these blocks separately.


If I try to solve them on multiple structured blocks, I will have to enforce the bounday conditions betweeneach block, every iteration. That would be bad. Also, I don't want to use MPI, I want to use OpenMP.



I'm greedy about performance. I want it all.


I'm thinking about flattening out the whole level (multiple structured blocks) into a single 1D array, and iterating over the whole thing using i=1,ncells_in_level.

In this method, I will only have to worry about setting the boundary conditions once (only between the grid levels L, L-1) per iteration and the whole freakin' calculation will be vectorized.


Will that work?

What should I look out for, so that i don't blow off my own foot doing this?



Because if this works : Lord forgive me for the performance boost I'm about to receive!!




EDIT : Upon further investigation, it looks like I will still have to update the data between the different boundary blocks, but the vectorization potential is still a good outcome.
Aaaaannnnddd, since everything's in the form of 1D arrays, I can absolutely distribute the work onto 4 of my CPU cores using OpenMP!!!!!!





Thanks
~sayan
Attached Images
File Type: jpg AMR_regridding.jpg (26.6 KB, 23 views)
File Type: gif cfl3d-grid.gif (18.2 KB, 22 views)

Last edited by aerosayan; January 18, 2021 at 07:21. Reason: L+1 to L-1
aerosayan is offline   Reply With Quote

Old   January 18, 2021, 06:38
Default
  #2
Senior Member
 
andy
Join Date: May 2009
Posts: 270
Rep Power: 17
andy_ is on a distinguished road
Is this an implicit or explicit code?
andy_ is offline   Reply With Quote

Old   January 18, 2021, 06:54
Default
  #3
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Quote:
Originally Posted by andy_ View Post
Is this an implicit or explicit code?
Explicit code.
aerosayan is offline   Reply With Quote

Old   January 18, 2021, 07:22
Default
  #4
Senior Member
 
andy
Join Date: May 2009
Posts: 270
Rep Power: 17
andy_ is on a distinguished road
Quote:
Originally Posted by aerosayan View Post
Explicit code.
In this case there is no dependency between blocks when advancing one step. A simple approach is to add halo cells for your "bcs" of a depth given by the order of your code. After each time advance of the internal values copy the new values to all overlaying halo cells and advance in time again. The 4 sides of each block will require an index into the neighbouring block and likely a stride if the blocks can vary in size or orientation.
aerosayan and aero_head like this.
andy_ is offline   Reply With Quote

Old   January 18, 2021, 07:31
Default
  #5
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Quote:
Originally Posted by andy_ View Post
In this case there is no dependency between blocks when advancing one step. A simple approach is to add halo cells for your "bcs" of a depth given by the order of your code. After each time advance of the internal values copy the new values to all overlaying halo cells and advance in time again. The 4 sides of each block will require an index into the neighbouring block and likely a stride if the blocks can vary in size or orientation.

Yes Andy, that seems like a great way to do it!
aerosayan is offline   Reply With Quote

Old   January 18, 2021, 13:00
Default
  #6
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Quote:
Originally Posted by andy_ View Post
In this case there is no dependency between blocks when advancing one step. A simple approach is to add halo cells for your "bcs" of a depth given by the order of your code. After each time advance of the internal values copy the new values to all overlaying halo cells and advance in time again. The 4 sides of each block will require an index into the neighbouring block and likely a stride if the blocks can vary in size or orientation.

Hey andy,
I thought about this. Can we keep the ghost/halo cells in a separate array?


I have the solution update loop as :
Code:
#pragma omp simd linear(i:1)
for(i=0; i<ncells; ++i)
{                                                                  // <- loop i
    U_[i] = UN_[i] + alpha*reciprocal_cvols_[i]*R_[i];
}                                                                  // -> loop i
U_, UN_, R_, etc contain values of the boundary and internal cells.
Since I will allocate the arrays as U_ = [U_ for block 1][U_ for block 2] ... [U_ for block N], i.e in a single contiguous memory block, I can't have the ghost/halo cells inside this region.


Is it okay if I keep them in a separatearray? Or will it cause too much problem down the line?


This is the first time I'm using halo cells, so I don't know much about them.


Thanks


BTW : Are you Andrew Hollingsworth?
aero_head likes this.
aerosayan is offline   Reply With Quote

Old   January 18, 2021, 15:22
Default
  #7
Senior Member
 
andy
Join Date: May 2009
Posts: 270
Rep Power: 17
andy_ is on a distinguished road
Quote:
Originally Posted by aerosayan View Post
Can we keep the ghost/halo cells in a separate array?

This is the first time I'm using halo cells, so I don't know much about them.

BTW : Are you Andrew Hollingsworth?
The halo cells allow the gradients near the boundary to be evaluated using the same (efficient) structured grid differencing as the cells away from the boundary. If the boundary values are not present in the structured grid blocks then you will have to catch and use some form of indirect addressing for the cells near the boundarys which will significantly increase the amount of coding and slow it down. If you take this approach then you might as well drop the boundary arrays altogether and directly address the values in the neighbouring blocks avoiding any copying.

It is easy enough to check but the ghost cells approach is likely to be computationally faster with simpler coding at the cost of a bit more storage.

I am not Andrew Hollingsworth whoever he may be.
aerosayan likes this.
andy_ is offline   Reply With Quote

Old   January 19, 2021, 01:49
Default
  #8
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Quote:
Originally Posted by andy_ View Post
The halo cells allow the gradients near the boundary to be evaluated using the same (efficient) structured grid differencing as the cells away from the boundary. If the boundary values are not present in the structured grid blocks then you will have to catch and use some form of indirect addressing for the cells near the boundarys which will significantly increase the amount of coding and slow it down. If you take this approach then you might as well drop the boundary arrays altogether and directly address the values in the neighbouring blocks avoiding any copying.

It is easy enough to check but the ghost cells approach is likely to be computationally faster with simpler coding at the cost of a bit more storage.

I am not Andrew Hollingsworth whoever he may be.

Thanks for your help Andy.



I agree that the indirect referencing will be a little more complicated, and there may be performance penalty, I will have to see if I can make that efficient.


The flux calculations will be slow by default, there are so many branches inside it. Including this indirect referencing will also include a lot of branches if I implement it naively.


A naive implementation would be :
Code:
if(at_boundary)
{
    data = get_data_from_external_1d_array(i,j);

}
However, in the conventional form of ghost cells you mentioned, writing into the ghost cells will be quite slow too due to cache misses, since many cells at the boundary are located a little bit away in memory from each other (North and South wall cells to be specific). And we'll only be iterating over each level for max 2 times, so I will have to do some tests to see if the conventional form is superior. (EDIT : I don't think the slowdown would be negligible, because when we write into the ghost cells, the ghost cells haven't been loaded in memory before, so we will have a DRAM or L3 cache access, which is slow).



My plan was to use a single 1D array that will contain all of the ghost cell data, in exactly the order they will be accessed in. This is shown in the attached image. In this case, even if the structured block is large, the data will be present in at least L2 cache. Writing into it will be fast, and reading from it will also be fast.
Attached Images
File Type: jpg ghost-cells.jpg (58.8 KB, 9 views)

Last edited by aerosayan; January 19, 2021 at 05:18.
aerosayan is offline   Reply With Quote

Old   January 19, 2021, 06:46
Default
  #9
Senior Member
 
andy
Join Date: May 2009
Posts: 270
Rep Power: 17
andy_ is on a distinguished road
Quote:
Originally Posted by aerosayan View Post
My plan was to...
Tempted as I am to discuss the pros and cons these would be influenced by details I don't possess like the numerical scheme, the future of the code, involvement of others, etc... I will instead point out that you will almost certainly be going about things in a reliable way if you base your main decisions on evidence from testing (not surmising) and avoid getting too attached to elaborate algorithms you may have invested significant time, effort and pride in devising.

This tends to mean writing the simplest realistic scheme first (e.g. halo cells in this case) and testing it on small problems and large realistic problems. If it looks as if a more elaborate scheme might be worthwhile (i.e. achieved performance is less than 50% or so of theoretical) then develop a more elaborate scheme to address the deficiencies that have been measured. Typically when raising the computational performance of an existing numerical code only a few routines tend to matter and although most are the ones one would have picked beforehand quantitative measurement via profiling can change the assumed relative importance.
sbaffini and aerosayan like this.
andy_ is offline   Reply With Quote

Reply

Tags
solver development, solver optimization


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
[Other] Multiple blocks for moving mesh jiahui_93 OpenFOAM Meshing & Mesh Conversion 0 March 4, 2018 02:24
[blockMesh] How to define symmetric graded mesh without using multiple blocks sahmed OpenFOAM Meshing & Mesh Conversion 3 August 18, 2016 03:33
Two-Phase Buoyant Flow Issue Miguel Baritto CFX 4 August 31, 2006 12:02
How to use q1 and ground file? zheh Phoenics 5 September 9, 2001 05:01
BFC for Dam break problem Mehdi BEN ELHADJ Phoenics 0 January 18, 2001 15:22


All times are GMT -4. The time now is 05:29.