|
[Sponsors] |
How can we optimize the flux calculations in explicit CFD solvers? |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
May 16, 2021, 00:55 |
How can we optimize the flux calculations in explicit CFD solvers?
|
#1 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
How can we optimize the flux calculations in explicit CFD solvers?
They seem to take the majority of the computational time, and there are so many if-else branches in the code, the branch misses cause some performance loss. Additionally, I haven't been able to vectorize the flux calculations at all. I have been able to find only two ways to improve the performance of the flux calculations :
Thanks |
|
May 16, 2021, 03:58 |
|
#2 |
Senior Member
|
I think we already discussed this, but fluxes in FV codes are not computed by looping on the cells.
You, instead, have to loop over faces, retrieve the variables on the two sides, compute the flux, spread the flux back to the two sides. It is better if your faces are already grouped by zones with uniform conditions or, at least, somehow ordered. In the first case you use the IF at the higher level, when looping over such groups of faces. In the second one you still have the IF in the main loop, but if the order of the faces has some sense (basically the same as if you were using the groups mentioned before), the branch prediction should be fine (will just miss 1 every N). |
|
May 16, 2021, 04:20 |
|
#3 |
Senior Member
|
What you can do in addition, if you are working on a coupled code, is to have the coupled variables saved together. That is, not, say, p, u, v, w and T as separate arrays, but a single one var(nvar,ncells). So that, when you loop over the faces for the flux, you get them all at once. This, of course, makes sense because the coupled approach requires them all. It certainly has no sense for segregated approaches.
|
|
May 16, 2021, 09:29 |
|
#4 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Thank you for your help and effort. You're correct in saying that we're iterating over cell faces. I wanted to say I was iterating over cells, and I was iterating over all faces of each cell.
Quote:
I'm currently stopped on this point. I need to benchmark if the variables being put in different arrays as [rho1, rho2, rho3, ...][u1, u2, u3, ...] causes any performance loss. I expect some performance loss at the first iteration due to 4 cache misses for reading the four variables from 4 arrays. I'm expecting the hardware pre-fetchers to kick in and preload the data required, since we're incrementing i by 1. I'm expecting that after the 1st iteration, there won't be a performance loss. If there's no performance loss by using different arrays to store the (rho, u, v, p) variables , or at-least, the performance is bound by the computational cost of the flux and residual calculations, the final solution update becomes extremely fast. Code:
// Solution update over all cells // #pragma omp simd linear(i:1) for(i=0; i<ncells*4; ++i) { // <- loop i U_[i] = UN_[i] + alpha*rcvols_[i]*R_[i]; } // -> loop i |
||
May 17, 2021, 04:15 |
|
#5 | |
Senior Member
|
Quote:
However, in my case, I either have a simpler loop, with just 2 variables, or a much complex one, where much more variables are needed (we are talking about an implicit, preconditioned, density based code, with porous media, etc.). In the latter case, where cache is trashed in any case, I guess, coupling variables just make it less trashed. Also, it really depends on how much cache are we talking about. For 3Mb over the 3 levels, you should be able to work on 10 scalars with groups of around 40k elements. Finally, the idea of the coupling is that it is introduced if the slowest part of the code is one where, indeed, you need all the variables, which for me is the fluxes part. If the slowest part is the update then, probably, as you show, coupling seems to be a bad choice. |
||
May 17, 2021, 05:35 |
|
#6 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
I think I might be causing undefined behavior due to incorrect use of the __restrict__ keyword.
__restrict__ keyword means that we give the compiler a guarantee that we won't go behind it's back, and modify a restricted memory location with any other pointer. However, I think I might be breaching the trust, by accessing the whole memory range of U1, U2, U3, U4, through U1. Even if the solution is correct now, the code may be unstable in future. This is probably why the CPU likes it when we access the designated memory ranges of U1, U2, U3, U4 and others, with their designated __restrict__ pointers. This is a pure hypothesis, but the best one I have right now. |
|
May 17, 2021, 06:26 |
|
#7 |
Senior Member
|
||
May 17, 2021, 06:44 |
|
#8 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Fortran looks good. It's safe and simple. C++ is being tied up to a nuke, and launched into the sun.
I will have to decide if I want to write the extremely critical sections of the computational code in Fortran-77, because that's the only one without any major side effects. Unfortunately most fortran compilers are garbage, except for Intel's fortran compilers. As I've shown before, even GCC creates bad code, and by design doesn't allow too many compiler directives to allow low level control. I will keep trying to identify the pitfalls of C++, and document them in the meantime. I need to setup some safe guidelines to prevent blowing my foot off, like I'm doing now. |
|
May 17, 2021, 07:59 |
|
#9 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
@paolo The safest way to use restrict in C and __restrict__ in C++ seems to be when they're passed to a function. I will write a safety guideline after I optimize and verify my code. In C++, using GCC compilers, we have an option to forcefully inline function calls using __attribute__((always_inline)). I'm thinking of keeping my array pointers unrestricted, and when we have a very performance critical section, I will send those pointers as restricted pointers to a function that will be forcefully inlined.
Code:
// Function to update solution void update_soln(double * __restrict__ u, double * __restrict__ un, double * __restrict__ rcvols, double * __restrict__ r, const size_t ncellsx4) __attribute__((always_inline)) { for(size_t i=0; i<ncellsx4; ++i) { // similar code as before } } Since generally less than 5% of the code is responsible for majority portion of the execution time, it's not an issue if the rest 95% of the code is operating sub-optimally. So, for the rest of the 95% of the code, I don't have any issue if the pointers are unrestricted, in fact I want them to be unrestricted, because that gives me more flexibility, and assurance that we aren't messing things up by accident. In explicit solvers, the hottest section of the code seem to be the flux calculations, and the solution update procedure. So, the __restrict__ keyword is absolutely important to get any kind of respectable performance in C++. I'm currently reading Mike Akton's writeups on Game Engine Development, and he mentions very good methods to improve performance : https://archive.ph/3zi3M |
|
May 17, 2021, 14:07 |
|
#10 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
@paolo Someone showed me why the L3 cache performance keeps degrading in the code.
It's actually not degrading. It's probably converging to the actual expected performance. When I increase the no. of outer iterations, the L3 cache miss rate converges to a steady 25%. This made me very curios. It's apparently theoretically what's expected. I have a 3MB L3 cache. Each of our 4 arrays (u, un, rcvols, r) are around 1MB in size. So, all 4 arrays can't be inserted into the 3MB L3 cache. Since the data can't be fit onto the L3 cache, it keeps being evicted and re-read continuously. Since our computation is not complex, our triple nested loop totaling 500*4*130,000 iterations, gets completed in 0.7 seconds only. 130,000 is actually equal to NCELLS*4. 500 is the total number of outer loop iterations. 4 is the number of mstage iterations. 130,000 iterations of the sample code gets completed in approximately 0.7/500/4 seconds. So, the L3 cache keeps continuously and repeatedly getting filled up and evicted as we go through the 500*4*130,000 iterations. And, our L3 cache miss rate converges to the expected 25%, since we only have 3MB L3 cache, and our data being used by the code, is 4 MB. The extra 1MB probably causes the issues. Aaaaaaaaaaaaaaaaaaaaaaaaaaaaannnnnddddd, when we reduced the iteration count, we saw performance improvement, probably because ALL 4 of the arrays could be filled up in the L3 cache, and we didn't have any expected loss. Which is what we observe by only iterating over NCELLS elements at a time. This is probably what happens. Last edited by aerosayan; May 17, 2021 at 14:14. Reason: formatting is hard |
|
May 18, 2021, 03:40 |
|
#11 | |
Senior Member
|
Quote:
In your case it is, indeed, obvious that your first scenario is worst, as you are just working on larger vectors as opposed to the second scenario, where each vector is 4 times smaller. What I didn't consider is that in the second scenario you just have 4 different loops. But, as we said, this is only possible for the update case. When you are computing fluxes, or making linear iterations for implicit cases, you need all the variables at once and you can't really split the loop in 4. It is in this scenario that coupling variables makes sense. If you look at what I wrote in the Fortran example, that's indeed the scenario I considered. So, yes, coupling has some limitations: - larger vectors which can eventually trash the cache (gradient computation, for example, is some place where you must be careful) - net performance degradation in cases where variables could be used alone (I/O and some related MPI communications incur a net penalty, as you tipically want to write variables one at the time) but has some clear advantages: - net speedup of loops where all the variables are needed all together, which typically are the more costly (e.g., fluxes, linear iterations, boundary conditions) - very concise and reusable code that I wouldn't exchange even for a 50% performance boost Note, however, that the risk of trashing the cache when using larger vectors is more than compensated in those cases where the loop itself might lead to cache misses (as you now do it once instead of N times). This is certainly the case, again, of fluxes. Of course, again, all of this makes sense only after profiling. |
||
May 18, 2021, 03:59 |
|
#12 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Thank you for your time and effort, you have made some really helpful points about how CFD solvers are traditionally written. They will be helpful for guiding my designs.
Quote:
How about for an approximately 300% performance boost *per thread* ? Yesterday I figured out how to SIMD vectorize van-leer flux. I don't know if the practical performance will be high. One battle at a time. |
||
Tags |
flux calculation, optimizaion |
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to accelerate explicit CFD solvers? | aerosayan | Main CFD Forum | 15 | January 22, 2021 09:44 |
jacobian block for implicit solver with non explicit flux function | naffrancois | Main CFD Forum | 8 | February 19, 2018 06:23 |
Types of CFD solvers | Seth.Prashant | Main CFD Forum | 2 | October 27, 2014 02:30 |
Evaluate Heat Flux through plane in CFD Post | hugo17 | CFX | 4 | May 13, 2013 20:51 |
Flux limiter and explicit method CFL restriction | bouloumag | Main CFD Forum | 5 | July 19, 2012 13:30 |