How can we optimize the flux calculations in explicit CFD solvers?

aerosayan · May 16, 2021, 00:55

How can we optimize the flux calculations in explicit CFD solvers?

They seem to take the majority of the computational time, and there are so many if-else branches in the code, the branch misses cause some performance loss. Additionally, I haven't been able to vectorize the flux calculations at all.

I have been able to find only two ways to improve the performance of the flux calculations :

Organize the data well (best is to organize the data of all NCELLS in 1D arrays), and use OpenMP/OpenMPI to spread the calculation of the flux of NCELLS over the CPUs available in the machine. This seems to be the most common approach that works very well.
If a particular face's flux was already calculated while iterating over the neighbor cell, use the pre-calculated flux instead of calculating it again. This lookup is very efficient in structured grids, and even in unstructured grids if the cells were ordered in memory using a space filling curve.

Kindly share more if you know.

Thanks

sbaffini · May 16, 2021, 03:58

I think we already discussed this, but fluxes in FV codes are not computed by looping on the cells.

You, instead, have to loop over faces, retrieve the variables on the two sides, compute the flux, spread the flux back to the two sides. It is better if your faces are already grouped by zones with uniform conditions or, at least, somehow ordered. In the first case you use the IF at the higher level, when looping over such groups of faces. In the second one you still have the IF in the main loop, but if the order of the faces has some sense (basically the same as if you were using the groups mentioned before), the branch prediction should be fine (will just miss 1 every N).

sbaffini · May 16, 2021, 04:20

What you can do in addition, if you are working on a coupled code, is to have the coupled variables saved together. That is, not, say, p, u, v, w and T as separate arrays, but a single one var(nvar,ncells). So that, when you loop over the faces for the flux, you get them all at once. This, of course, makes sense because the coupled approach requires them all. It certainly has no sense for segregated approaches.

aerosayan · May 16, 2021, 09:29

Thank you for your help and effort. You're correct in saying that we're iterating over cell faces. I wanted to say I was iterating over cells, and I was iterating over all faces of each cell.

Quote:

Originally Posted by sbaffini

This, of course, makes sense because the coupled approach requires them all. It certainly has no sense for segregated approaches.

I'm currently stopped on this point. I need to benchmark if the variables being put in different arrays as [rho1, rho2, rho3, ...][u1, u2, u3, ...] causes any performance loss. I expect some performance loss at the first iteration due to 4 cache misses for reading the four variables from 4 arrays. I'm expecting the hardware pre-fetchers to kick in and preload the data required, since we're incrementing i by 1. I'm expecting that after the 1st iteration, there won't be a performance loss.

If there's no performance loss by using different arrays to store the (rho, u, v, p) variables , or at-least, the performance is bound by the computational cost of the flux and residual calculations, the final solution update becomes extremely fast.

Code:

// Solution update over all cells
//
#pragma omp simd linear(i:1)
for(i=0; i<ncells*4; ++i)
{                                                                  // <- loop i
    U_[i] = UN_[i] + alpha*rcvols_[i]*R_[i];
}                                                                  // -> loop i

I need to benchmark.

sbaffini · May 17, 2021, 04:15

Quote:

Originally Posted by aerosayan

Thank you for your help and effort. You're correct in saying that we're iterating over cell faces. I wanted to say I was iterating over cells, and I was iterating over all faces of each cell.

I'm currently stopped on this point. I need to benchmark if the variables being put in different arrays as [rho1, rho2, rho3, ...][u1, u2, u3, ...] causes any performance loss. I expect some performance loss at the first iteration due to 4 cache misses for reading the four variables from 4 arrays. I'm expecting the hardware pre-fetchers to kick in and preload the data required, since we're incrementing i by 1. I'm expecting that after the 1st iteration, there won't be a performance loss.

If there's no performance loss by using different arrays to store the (rho, u, v, p) variables , or at-least, the performance is bound by the computational cost of the flux and residual calculations, the final solution update becomes extremely fast.

Code:

// Solution update over all cells
//
#pragma omp simd linear(i:1)
for(i=0; i<ncells*4; ++i)
{                                                                  // <- loop i
    U_[i] = UN_[i] + alpha*rcvols_[i]*R_[i];
}                                                                  // -> loop i

I need to benchmark.

I saw your other post with a benchmark of this (if I understand it correctly). Indeed, there is a culprit to this, and testing is always necesary. For example, doing the same for gradients might be worst rather than best. Still, I find really surprising that a loop like this one, where you just have 4 vectors, is giving you any trouble at all or that it is your bottleneck in a single iteration/time step.

However, in my case, I either have a simpler loop, with just 2 variables, or a much complex one, where much more variables are needed (we are talking about an implicit, preconditioned, density based code, with porous media, etc.). In the latter case, where cache is trashed in any case, I guess, coupling variables just make it less trashed.

Also, it really depends on how much cache are we talking about. For 3Mb over the 3 levels, you should be able to work on 10 scalars with groups of around 40k elements.

Finally, the idea of the coupling is that it is introduced if the slowest part of the code is one where, indeed, you need all the variables, which for me is the fluxes part. If the slowest part is the update then, probably, as you show, coupling seems to be a bad choice.

aerosayan · May 17, 2021, 05:35

I think I might be causing undefined behavior due to incorrect use of the __restrict__ keyword.

__restrict__ keyword means that we give the compiler a guarantee that we won't go behind it's back, and modify a restricted memory location with any other pointer.

However, I think I might be breaching the trust, by accessing the whole memory range of U1, U2, U3, U4, through U1.

Even if the solution is correct now, the code may be unstable in future.

This is probably why the CPU likes it when we access the designated memory ranges of U1, U2, U3, U4 and others, with their designated __restrict__ pointers.

This is a pure hypothesis, but the best one I have right now.

sbaffini · May 17, 2021, 06:26

What about this piece of code?

https://godbolt.org/z/6e3ox4K9x

For me it just works as expected

aerosayan · May 17, 2021, 06:44

Fortran looks good. It's safe and simple. C++ is being tied up to a nuke, and launched into the sun.

I will have to decide if I want to write the extremely critical sections of the computational code in Fortran-77, because that's the only one without any major side effects.

Unfortunately most fortran compilers are garbage, except for Intel's fortran compilers. As I've shown before, even GCC creates bad code, and by design doesn't allow too many compiler directives to allow low level control.

I will keep trying to identify the pitfalls of C++, and document them in the meantime.

I need to setup some safe guidelines to prevent blowing my foot off, like I'm doing now.

aerosayan · May 17, 2021, 07:59

@paolo The safest way to use restrict in C and __restrict__ in C++ seems to be when they're passed to a function. I will write a safety guideline after I optimize and verify my code. In C++, using GCC compilers, we have an option to forcefully inline function calls using __attribute__((always_inline)). I'm thinking of keeping my array pointers unrestricted, and when we have a very performance critical section, I will send those pointers as restricted pointers to a function that will be forcefully inlined.

Code:

// Function to update solution
void update_soln(double * __restrict__ u, double * __restrict__ un, double * __restrict__ rcvols, double * __restrict__ r, const size_t ncellsx4) __attribute__((always_inline))
{
  for(size_t i=0; i<ncellsx4; ++i)
  {
     // similar code as before

  }

}

This is just an idea, and I haven't tested it out, but this seems to be the safest option. We get the performance benefit of the restricted pointers, and there shouldn't be performance degradation due to function calls, as the function will be forcefully inlined. Best of both worlds.

Since generally less than 5% of the code is responsible for majority portion of the execution time, it's not an issue if the rest 95% of the code is operating sub-optimally.

So, for the rest of the 95% of the code, I don't have any issue if the pointers are unrestricted, in fact I want them to be unrestricted, because that gives me more flexibility, and assurance that we aren't messing things up by accident.

In explicit solvers, the hottest section of the code seem to be the flux calculations, and the solution update procedure.

So, the __restrict__ keyword is absolutely important to get any kind of respectable performance in C++.

I'm currently reading Mike Akton's writeups on Game Engine Development, and he mentions very good methods to improve performance : https://archive.ph/3zi3M

aerosayan · May 17, 2021, 14:07

@paolo Someone showed me why the L3 cache performance keeps degrading in the code.

It's actually not degrading. It's probably converging to the actual expected performance.

When I increase the no. of outer iterations, the L3 cache miss rate converges to a steady 25%. This made me very curios.

It's apparently theoretically what's expected. I have a 3MB L3 cache. Each of our 4 arrays (u, un, rcvols, r) are around 1MB in size. So, all 4 arrays can't be inserted into the 3MB L3 cache.

Since the data can't be fit onto the L3 cache, it keeps being evicted and re-read continuously. Since our computation is not complex, our triple nested loop totaling 500*4*130,000 iterations, gets completed in 0.7 seconds only.

130,000 is actually equal to NCELLS*4. 500 is the total number of outer loop iterations. 4 is the number of mstage iterations.

130,000 iterations of the sample code gets completed in approximately 0.7/500/4 seconds.

So, the L3 cache keeps continuously and repeatedly getting filled up and evicted as we go through the 500*4*130,000 iterations.

And, our L3 cache miss rate converges to the expected 25%, since we only have 3MB L3 cache, and our data being used by the code, is 4 MB.

The extra 1MB probably causes the issues.

Aaaaaaaaaaaaaaaaaaaaaaaaaaaaannnnnddddd, when we reduced the iteration count, we saw performance improvement, probably because ALL 4 of the arrays could be filled up in the L3 cache, and we didn't have any expected loss. Which is what we observe by only iterating over NCELLS elements at a time.

This is probably what happens.

sbaffini · May 18, 2021, 03:40

Quote:

Originally Posted by aerosayan

@paolo Someone showed me why the L3 cache performance keeps degrading in the code.

It's actually not degrading. It's probably converging to the actual expected performance.

When I increase the no. of outer iterations, the L3 cache miss rate converges to a steady 25%. This made me very curios.

It's apparently theoretically what's expected. I have a 3MB L3 cache. Each of our 4 arrays (u, un, rcvols, r) are around 1MB in size. So, all 4 arrays can't be inserted into the 3MB L3 cache.

Since the data can't be fit onto the L3 cache, it keeps being evicted and re-read continuously. Since our computation is not complex, our triple nested loop totaling 500*4*130,000 iterations, gets completed in 0.7 seconds only.

130,000 is actually equal to NCELLS*4. 500 is the total number of outer loop iterations. 4 is the number of mstage iterations.

130,000 iterations of the sample code gets completed in approximately 0.7/500/4 seconds.

So, the L3 cache keeps continuously and repeatedly getting filled up and evicted as we go through the 500*4*130,000 iterations.

And, our L3 cache miss rate converges to the expected 25%, since we only have 3MB L3 cache, and our data being used by the code, is 4 MB.

The extra 1MB probably causes the issues.

Aaaaaaaaaaaaaaaaaaaaaaaaaaaaannnnnddddd, when we reduced the iteration count, we saw performance improvement, probably because ALL 4 of the arrays could be filled up in the L3 cache, and we didn't have any expected loss. Which is what we observe by only iterating over NCELLS elements at a time.

This is probably what happens.

I was thinking about this post you wrote, and something didn't sound correct... until I realized how much confusion I made.

In your case it is, indeed, obvious that your first scenario is worst, as you are just working on larger vectors as opposed to the second scenario, where each vector is 4 times smaller. What I didn't consider is that in the second scenario you just have 4 different loops. But, as we said, this is only possible for the update case. When you are computing fluxes, or making linear iterations for implicit cases, you need all the variables at once and you can't really split the loop in 4. It is in this scenario that coupling variables makes sense.

If you look at what I wrote in the Fortran example, that's indeed the scenario I considered.

So, yes, coupling has some limitations:

- larger vectors which can eventually trash the cache (gradient computation, for example, is some place where you must be careful)
- net performance degradation in cases where variables could be used alone (I/O and some related MPI communications incur a net penalty, as you tipically want to write variables one at the time)

but has some clear advantages:

- net speedup of loops where all the variables are needed all together, which typically are the more costly (e.g., fluxes, linear iterations, boundary conditions)

- very concise and reusable code that I wouldn't exchange even for a 50% performance boost

Note, however, that the risk of trashing the cache when using larger vectors is more than compensated in those cases where the loop itself might lead to cache misses (as you now do it once instead of N times). This is certainly the case, again, of fluxes.

Of course, again, all of this makes sense only after profiling.

aerosayan · May 18, 2021, 03:59

Thank you for your time and effort, you have made some really helpful points about how CFD solvers are traditionally written. They will be helpful for guiding my designs.

Quote:

Originally Posted by sbaffini

very concise and reusable code that I wouldn't exchange even for a 50% performance boost.

How about for an approximately 300% performance boost *per thread* ?

Yesterday I figured out how to SIMD vectorize van-leer flux.

I don't know if the practical performance will be high. One battle at a time.

May 16, 2021, 00:55	How can we optimize the flux calculations in explicit CFD solvers?	#1
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	How can we optimize the flux calculations in explicit CFD solvers? They seem to take the majority of the computational time, and there are so many if-else branches in the code, the branch misses cause some performance loss. Additionally, I haven't been able to vectorize the flux calculations at all. I have been able to find only two ways to improve the performance of the flux calculations : Organize the data well (best is to organize the data of all NCELLS in 1D arrays), and use OpenMP/OpenMPI to spread the calculation of the flux of NCELLS over the CPUs available in the machine. This seems to be the most common approach that works very well. If a particular face's flux was already calculated while iterating over the neighbor cell, use the pre-calculated flux instead of calculating it again. This lookup is very efficient in structured grids, and even in unstructured grids if the cells were ordered in memory using a space filling curve. Kindly share more if you know. Thanks

May 16, 2021, 03:58		#2
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,152 Blog Entries: 29 Rep Power: 39	I think we already discussed this, but fluxes in FV codes are not computed by looping on the cells. You, instead, have to loop over faces, retrieve the variables on the two sides, compute the flux, spread the flux back to the two sides. It is better if your faces are already grouped by zones with uniform conditions or, at least, somehow ordered. In the first case you use the IF at the higher level, when looping over such groups of faces. In the second one you still have the IF in the main loop, but if the order of the faces has some sense (basically the same as if you were using the groups mentioned before), the branch prediction should be fine (will just miss 1 every N). FMDenaro and aerosayan like this.

May 16, 2021, 04:20		#3
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,152 Blog Entries: 29 Rep Power: 39	What you can do in addition, if you are working on a coupled code, is to have the coupled variables saved together. That is, not, say, p, u, v, w and T as separate arrays, but a single one var(nvar,ncells). So that, when you loop over the faces for the flux, you get them all at once. This, of course, makes sense because the coupled approach requires them all. It certainly has no sense for segregated approaches. aerosayan likes this.

May 17, 2021, 06:44		#8
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	Fortran looks good. It's safe and simple. C++ is being tied up to a nuke, and launched into the sun. I will have to decide if I want to write the extremely critical sections of the computational code in Fortran-77, because that's the only one without any major side effects. Unfortunately most fortran compilers are garbage, except for Intel's fortran compilers. As I've shown before, even GCC creates bad code, and by design doesn't allow too many compiler directives to allow low level control. I will keep trying to identify the pitfalls of C++, and document them in the meantime. I need to setup some safe guidelines to prevent blowing my foot off, like I'm doing now. sbaffini likes this.

May 17, 2021, 07:59		#9
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	@paolo The safest way to use restrict in C and __restrict__ in C++ seems to be when they're passed to a function. I will write a safety guideline after I optimize and verify my code. In C++, using GCC compilers, we have an option to forcefully inline function calls using __attribute__((always_inline)). I'm thinking of keeping my array pointers unrestricted, and when we have a very performance critical section, I will send those pointers as restricted pointers to a function that will be forcefully inlined. Code: // Function to update solution void update_soln(double * __restrict__ u, double * __restrict__ un, double * __restrict__ rcvols, double * __restrict__ r, const size_t ncellsx4) __attribute__((always_inline)) { for(size_t i=0; i<ncellsx4; ++i) { // similar code as before } } This is just an idea, and I haven't tested it out, but this seems to be the safest option. We get the performance benefit of the restricted pointers, and there shouldn't be performance degradation due to function calls, as the function will be forcefully inlined. Best of both worlds. Since generally less than 5% of the code is responsible for majority portion of the execution time, it's not an issue if the rest 95% of the code is operating sub-optimally. So, for the rest of the 95% of the code, I don't have any issue if the pointers are unrestricted, in fact I want them to be unrestricted, because that gives me more flexibility, and assurance that we aren't messing things up by accident. In explicit solvers, the hottest section of the code seem to be the flux calculations, and the solution update procedure. So, the __restrict__ keyword is absolutely important to get any kind of respectable performance in C++. I'm currently reading Mike Akton's writeups on Game Engine Development, and he mentions very good methods to improve performance : https://archive.ph/3zi3M sbaffini likes this.

May 17, 2021, 05:35		#6
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	I think I might be causing undefined behavior due to incorrect use of the __restrict__ keyword. __restrict__ keyword means that we give the compiler a guarantee that we won't go behind it's back, and modify a restricted memory location with any other pointer. However, I think I might be breaching the trust, by accessing the whole memory range of U1, U2, U3, U4, through U1. Even if the solution is correct now, the code may be unstable in future. This is probably why the CPU likes it when we access the designated memory ranges of U1, U2, U3, U4 and others, with their designated __restrict__ pointers. This is a pure hypothesis, but the best one I have right now.

May 17, 2021, 06:26		#7
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,152 Blog Entries: 29 Rep Power: 39	What about this piece of code? https://godbolt.org/z/6e3ox4K9x For me it just works as expected

May 17, 2021, 14:07		#10
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	@paolo Someone showed me why the L3 cache performance keeps degrading in the code. It's actually not degrading. It's probably converging to the actual expected performance. When I increase the no. of outer iterations, the L3 cache miss rate converges to a steady 25%. This made me very curios. It's apparently theoretically what's expected. I have a 3MB L3 cache. Each of our 4 arrays (u, un, rcvols, r) are around 1MB in size. So, all 4 arrays can't be inserted into the 3MB L3 cache. Since the data can't be fit onto the L3 cache, it keeps being evicted and re-read continuously. Since our computation is not complex, our triple nested loop totaling 5004130,000 iterations, gets completed in 0.7 seconds only. 130,000 is actually equal to NCELLS4. 500 is the total number of outer loop iterations. 4 is the number of mstage iterations. 130,000 iterations of the sample code gets completed in approximately 0.7/500/4 seconds. So, the L3 cache keeps continuously and repeatedly getting filled up and evicted as we go through the 5004130,000 iterations. And, our L3 cache miss rate converges to the expected 25%, since we only have 3MB L3 cache, and our data being used by the code, is 4 MB. The extra 1MB probably causes the issues. Aaaaaaaaaaaaaaaaaaaaaaaaaaaaannnnnddddd, when we reduced the iteration count, we saw performance improvement, probably because ALL 4 of the arrays could be filled up in the L3 cache, and we didn't have any expected loss. Which is what we observe by only iterating over NCELLS elements at a time. This is probably what happens. sbaffini likes this. Last edited by aerosayan; May 17, 2021 at 14:14. Reason: formatting is hard*

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to accelerate explicit CFD solvers?	aerosayan	Main CFD Forum	15	January 22, 2021 09:44
jacobian block for implicit solver with non explicit flux function	naffrancois	Main CFD Forum	8	February 19, 2018 06:23
Types of CFD solvers	Seth.Prashant	Main CFD Forum	2	October 27, 2014 02:30
Evaluate Heat Flux through plane in CFD Post	hugo17	CFX	4	May 13, 2013 20:51
Flux limiter and explicit method CFL restriction	bouloumag	Main CFD Forum	5	July 19, 2012 13:30