|
[Sponsors] |
Can we group multiple structured blocks together to speedup calculations? |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
January 18, 2021, 05:20 |
Can we group multiple structured blocks together to speedup calculations?
|
#1 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
In the pictures shown below, I have multiple structured blocks at different layers.
Normally experts who developed BoxLib, AMRex, apparently just solve the different blocks in different CPUs using MPI. That makes sense since their computations are extremely large, and there's no way all of the data could be stored onto a single machine. For my case, I don't care about scaling to thousands of cores. My objective is to make it run as fast as possible on a single core. For that objective, it's bad to solve these blocks separately. If I try to solve them on multiple structured blocks, I will have to enforce the bounday conditions betweeneach block, every iteration. That would be bad. Also, I don't want to use MPI, I want to use OpenMP. I'm greedy about performance. I want it all. I'm thinking about flattening out the whole level (multiple structured blocks) into a single 1D array, and iterating over the whole thing using i=1,ncells_in_level. In this method, I will only have to worry about setting the boundary conditions once (only between the grid levels L, L-1) per iteration and the whole freakin' calculation will be vectorized. Will that work? What should I look out for, so that i don't blow off my own foot doing this? Because if this works : Lord forgive me for the performance boost I'm about to receive!! EDIT : Upon further investigation, it looks like I will still have to update the data between the different boundary blocks, but the vectorization potential is still a good outcome. Aaaaannnnddd, since everything's in the form of 1D arrays, I can absolutely distribute the work onto 4 of my CPU cores using OpenMP!!!!!! Thanks ~sayan Last edited by aerosayan; January 18, 2021 at 07:21. Reason: L+1 to L-1 |
|
January 18, 2021, 06:38 |
|
#2 |
Senior Member
andy
Join Date: May 2009
Posts: 281
Rep Power: 18 |
Is this an implicit or explicit code?
|
|
January 18, 2021, 06:54 |
|
#3 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
||
January 18, 2021, 07:22 |
|
#4 |
Senior Member
andy
Join Date: May 2009
Posts: 281
Rep Power: 18 |
In this case there is no dependency between blocks when advancing one step. A simple approach is to add halo cells for your "bcs" of a depth given by the order of your code. After each time advance of the internal values copy the new values to all overlaying halo cells and advance in time again. The 4 sides of each block will require an index into the neighbouring block and likely a stride if the blocks can vary in size or orientation.
|
|
January 18, 2021, 07:31 |
|
#5 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
Yes Andy, that seems like a great way to do it! |
||
January 18, 2021, 13:00 |
|
#6 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
Hey andy, I thought about this. Can we keep the ghost/halo cells in a separate array? I have the solution update loop as : Code:
#pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { // <- loop i U_[i] = UN_[i] + alpha*reciprocal_cvols_[i]*R_[i]; } // -> loop i Since I will allocate the arrays as U_ = [U_ for block 1][U_ for block 2] ... [U_ for block N], i.e in a single contiguous memory block, I can't have the ghost/halo cells inside this region. Is it okay if I keep them in a separatearray? Or will it cause too much problem down the line? This is the first time I'm using halo cells, so I don't know much about them. Thanks BTW : Are you Andrew Hollingsworth? |
||
January 18, 2021, 15:22 |
|
#7 | |
Senior Member
andy
Join Date: May 2009
Posts: 281
Rep Power: 18 |
Quote:
It is easy enough to check but the ghost cells approach is likely to be computationally faster with simpler coding at the cost of a bit more storage. I am not Andrew Hollingsworth whoever he may be. |
||
January 19, 2021, 01:49 |
|
#8 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
Thanks for your help Andy. I agree that the indirect referencing will be a little more complicated, and there may be performance penalty, I will have to see if I can make that efficient. The flux calculations will be slow by default, there are so many branches inside it. Including this indirect referencing will also include a lot of branches if I implement it naively. A naive implementation would be : Code:
if(at_boundary) { data = get_data_from_external_1d_array(i,j); } My plan was to use a single 1D array that will contain all of the ghost cell data, in exactly the order they will be accessed in. This is shown in the attached image. In this case, even if the structured block is large, the data will be present in at least L2 cache. Writing into it will be fast, and reading from it will also be fast. Last edited by aerosayan; January 19, 2021 at 05:18. |
||
January 19, 2021, 06:46 |
|
#9 |
Senior Member
andy
Join Date: May 2009
Posts: 281
Rep Power: 18 |
Tempted as I am to discuss the pros and cons these would be influenced by details I don't possess like the numerical scheme, the future of the code, involvement of others, etc... I will instead point out that you will almost certainly be going about things in a reliable way if you base your main decisions on evidence from testing (not surmising) and avoid getting too attached to elaborate algorithms you may have invested significant time, effort and pride in devising.
This tends to mean writing the simplest realistic scheme first (e.g. halo cells in this case) and testing it on small problems and large realistic problems. If it looks as if a more elaborate scheme might be worthwhile (i.e. achieved performance is less than 50% or so of theoretical) then develop a more elaborate scheme to address the deficiencies that have been measured. Typically when raising the computational performance of an existing numerical code only a few routines tend to matter and although most are the ones one would have picked beforehand quantitative measurement via profiling can change the assumed relative importance. |
|
Tags |
solver development, solver optimization |
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[Other] Multiple blocks for moving mesh | jiahui_93 | OpenFOAM Meshing & Mesh Conversion | 0 | March 4, 2018 02:24 |
[blockMesh] How to define symmetric graded mesh without using multiple blocks | sahmed | OpenFOAM Meshing & Mesh Conversion | 3 | August 18, 2016 03:33 |
Two-Phase Buoyant Flow Issue | Miguel Baritto | CFX | 4 | August 31, 2006 12:02 |
How to use q1 and ground file? | zheh | Phoenics | 5 | September 9, 2001 05:01 |
BFC for Dam break problem | Mehdi BEN ELHADJ | Phoenics | 0 | January 18, 2001 15:22 |