solver-dev : which variables to store in memory and which to compute on the fly?

aerosayan · November 22, 2020, 10:27

Loading data from memory can become costly due to cache misses.

In some cases it makes sense to calculate some variables on the fly and reduce memory pressure.

In your personal experience, which variables do you pre-compute and store in memory and which ones do you compute on the fly?

I'm currently pre-computing the surface normals, cell volumes, face areas, and storing them in memory. Each array contains NCELLS double/single precision values. However this seems like a seriously bad idea since I could've saved space for NCELLS*3 more cells.

What's the right choice?

sbaffini · November 22, 2020, 12:16

A typical approach is to store the face normals with magnitude, so that you don't store the face area separately.

Besides this, I second the store less and compute more approach. Still, there is not really that much places where to use this.

It really depends from the effort involved in computing and how much you reuse that variable. Areas and volumes are typically stored. Mass flow trough faces is typically stored as well.

aerosayan · November 22, 2020, 12:58

Quote:

Originally Posted by sbaffini

A typical approach is to store the face normals with magnitude, so that you don't store the face area separately. .... Mass flow trough faces is typically stored as well.

Thanks for the help.

How do you generally store the flux values?

To preserve continuity, the flux going through a face will be positive in one cell and negative in the neighbor cell. I haven't thought too much about it, but that is definitely going to cause lots of branch mispredictions when figuring out whether the flux will be positive or negative. So, multiplying with the unit surface normal (where surface normal is from the cell with high index to cell with low index) would actually improve performance.

So, wouldn't it be better to store the face normals simply as +ve or -ve unit vectors and not with their magnitude?

Also, I was planning to store all of my flux values in three 1D arrays (for three faces of each cell) of length NCELLS. However the gather and scatter operations to write/read into the three separate arrays would absolutely kill performance.

I think I can get away with only saving the residual as SUM(face_flux * face_area)

sbaffini · November 22, 2020, 13:47

First of all, let me clarify that I was referring to an unstructured code. The fact that you mention 3 faces (directions?) makes me think you are actually dealing with a structured code. In this case, I have little knowledge of the common approaches to improve performances. Surely, there might be some trick related to areas and volumes, which could be simpler to compute.

Besides this, tha mass flux is stored with its own sign, which is the one of the face normal... this just flows smoothly in code, nothing to worry about usually.

Also, you need the mass flow for the convection scheme of other scalars, and there you typically preload variables from both sides of a face and feed the convection scheme subroutine. I wouldn't rely on performance gains from branch prediction of upwind schemes. Also, as I mentioned in a previous post, I feel you are overoptimizing stuff while you should rely on a profiling of the whole code to guide you.

aerosayan · November 22, 2020, 15:01

Quote:

Originally Posted by sbaffini

First of all, let me clarify that I was referring to an unstructured code. The fact that you mention 3 faces (directions?) makes me think you are actually dealing with a structured code. In this case, I have little knowledge of the common approaches to improve performances. Surely, there might be some trick related to areas and volumes, which could be simpler to compute.

Besides this, tha mass flux is stored with its own sign, which is the one of the face normal... this just flows smoothly in code, nothing to worry about usually.

Also, you need the mass flow for the convection scheme of other scalars, and there you typically preload variables from both sides of a face and feed the convection scheme subroutine. I wouldn't rely on performance gains from branch prediction of upwind schemes. Also, as I mentioned in a previous post, I feel you are overoptimizing stuff while you should rely on a profiling of the whole code to guide you.

I'm currently working on unstructured grid (hence 3 faces of a triangle cell)

Also, I personally don't prefer to leave optimization till the end, when most of the mistakes are made in the initial stages of development. For example : Fortran being the language for high performance numerical computation, still doesn't have any method to enforce memory alignment (AFAIK). We have to call a C function to allocate and align the memory for us.

sbaffini · November 22, 2020, 16:09

Then I don't think I get your reasoning on the face normal. You still need 3 components (in 3D), the face area would be a sqrt away from you, instead of storing it. Also, you typically need n with the area, so you shouldn't actually compute the face area a lot, if any at all.

Note that the mass flux (the only one I suggest storing, because you reuse it a lot for other equations) is then already of the correct sign and includes the area from n.

But don't store it for cells!!! It belongs to faces, otherwise you end up storing it twice

sbaffini · November 22, 2020, 16:13

At this point it is also important what kind of solver you are developing. Pressure based or density based? Explicit or implicit? Which algorithm?

Because some of them have their peculiarities

aerosayan · November 22, 2020, 17:19

The solver would be a steady state 2D Compressible Euler solver based on Cell Centred Explicit FVM discretization and first order accurate Van Leer flux splitting for solution of the Riemann problem. M-stage update procedure is used along with local CFL condition based time stepping. No turbulence models are implemented.

I was initially planning to implement an Implicit solver, but I figured that I should really re-write the explicit solver to use OpenMP parallelization and actually make something that's scalable and performs well for the cases where explicit solvers are a must.

Quote:

Originally Posted by sbaffini

But don't store it for cells!!! It belongs to faces, otherwise you end up storing it twice

Makes sense. I wasn't thinking correctly about memory consumption as the number of cells scales up.

sbaffini · November 22, 2020, 17:48

Then probably the mass flux is not really needed yet.

But, as an example of what I said on overkilling optimizations, how do you store your variables (p,u,v,t I guess)? It may have sense to, say, store them in a single array nvar x ncells instead of nvar dinstinct vectors of ncells.

flotus1 · November 22, 2020, 18:21

Do I smell premature performance optimization based on guesses rather than profiling

On a more serious note:

Quote:

since I could've saved space for NCELLS*3

Does memory consumption really matter for a 2D code? RAM is cheap, your time isn't.

aerosayan · November 23, 2020, 05:39

Quote:

Originally Posted by flotus1

On a more serious note:

Does memory consumption really matter for a 2D code? RAM is cheap, your time isn't.

"...and I took that personally" -- Michael Jordon

I politely disagree. Actually it does matter. The solver only works for triangle grids. In order to get a good solution, the number of cells need to be cranked up. I work intentionally with only 4GB RAM to force myself to write better code.

The idea of "RAM is cheap" is something that inherently hurts everyone using the software. RAM is cheap when we're running the code on our own machines, but not so cheap when we decide to run it on a paid cloud server, or when we're assigned limited system resources by our university system admin.

sbaffini · November 23, 2020, 06:01

Let me tell you, it's not that you are wrong, in any of your statements or approach, just that it is like putting the cart before the horse.

Notably, there are things, even trivial, that a profiler can't catch. But once the obvious ones are taken into account (column/row major, proper algorithm choice, no bad scaling allocations), it can actually give you a lot of insight on things you wouldn't even know.

It is good to give proper thinking to how you write things, but it is wrong to become attached to any of your pieces of code. My current URANS code for unstructured grids is around 25k lines of code, but the commit history says that I changed around 350k lines of code. If code is alive, it will costantly change, and you will have to adapt to it.

I don't obviously know which is the state of your code but, if you haven't a working code yet, with all the planned features in place, you are, in my opinion, doing this wrong. If you have instead, I will post a nude if any of this is your major bottleneck

For example, you could have spent time on using MPI, which is way simpler than these micro-optimizations

flotus1 · November 23, 2020, 06:24

Quote:

"...and I took that personally" -- Michael Jordon

My intention really was not to offend you with anything I wrote.
Instead, it was an attempt to make you take a step back, and consider the bigger picture. It is far too easy to get lost in all those micro-level performance details. We have all been there at some point.
I could not agree more with sbafani: with the most obvious performance bottlenecks out of the way, it is time for profiling your code. Assumptions about performance impacts on this level will be wrong, no matter the level of expertise. This will of course force you to write a somewhat functional version of your code first, and then make changes where necessary. Instead of getting bogged down by all the "what if I did this instead" decisions before you approach a working solver.

About the memory usage though: if you want to restrict it as much as possible as an exercise, that's entirely up to you.
What I was trying to get across: with a 2D unstuctured tria solver, there is room for 10+ million cells in 4GB of RAM. Even if you treat RAM as a cheap resource.

aerosayan · November 23, 2020, 06:31

Quote:

Originally Posted by flotus1

My intention really was not to offend you with anything I wrote.

Not offended at all my good sir. It's a meme.

https://youtu.be/m38XhQSf1oU

Appreciate all of your help.

aerosayan · November 23, 2020, 08:22

Quote:

Originally Posted by sbaffini

I will post a nude if any of this is your major bottleneck

LOL hilarious

arjun · November 23, 2020, 08:42

Quote:

Originally Posted by aerosayan

The idea of "RAM is cheap" is something that inherently hurts everyone using the software.
.

Why drag everyone into this and speak on their behalf.
For example I have been writing codes for last 20 years and have written many times navier stokes solvers. I am not in the category of people who are hurt by "RAM is cheap" mantra.

In fact I am of opinion that one shall store something if it saves them the cost of calculations.
Your optimized code is as good as my unoptimized code when i do not compute things and just store most of the things. The things like sqrt, pow , exp etc take time to calculate.

I never bother about the optimization more than what -O3 flag can do. For most people who write serious code , maintenance of the code is most important issue. Low level personal optimization say at assembly level only leads to unmanageable nightmare for example.

My 2 cents.

aerosayan · November 23, 2020, 18:24

Quote:

Originally Posted by flotus1

What I was trying to get across: with a 2D unstuctured tria solver, there is room for 10+ million cells in 4GB of RAM. Even if you treat RAM as a cheap resource.

You're correct.

November 22, 2020, 10:27	solver-dev : which variables to store in memory and which to compute on the fly?	#1
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	Loading data from memory can become costly due to cache misses. In some cases it makes sense to calculate some variables on the fly and reduce memory pressure. In your personal experience, which variables do you pre-compute and store in memory and which ones do you compute on the fly? I'm currently pre-computing the surface normals, cell volumes, face areas, and storing them in memory. Each array contains NCELLS double/single precision values. However this seems like a *seriously bad idea* since I could've saved space for NCELLS*3 more cells. What's the right choice?

November 22, 2020, 12:16		#2
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,152 Blog Entries: 29 Rep Power: 39	A typical approach is to store the face normals with magnitude, so that you don't store the face area separately. Besides this, I second the store less and compute more approach. Still, there is not really that much places where to use this. It really depends from the effort involved in computing and how much you reuse that variable. Areas and volumes are typically stored. Mass flow trough faces is typically stored as well. aerosayan likes this.

November 22, 2020, 13:47		#4
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,152 Blog Entries: 29 Rep Power: 39	First of all, let me clarify that I was referring to an unstructured code. The fact that you mention 3 faces (directions?) makes me think you are actually dealing with a structured code. In this case, I have little knowledge of the common approaches to improve performances. Surely, there might be some trick related to areas and volumes, which could be simpler to compute. Besides this, tha mass flux is stored with its own sign, which is the one of the face normal... this just flows smoothly in code, nothing to worry about usually. Also, you need the mass flow for the convection scheme of other scalars, and there you typically preload variables from both sides of a face and feed the convection scheme subroutine. I wouldn't rely on performance gains from branch prediction of upwind schemes. Also, as I mentioned in a previous post, I feel you are overoptimizing stuff while you should rely on a profiling of the whole code to guide you. aerosayan likes this.

November 22, 2020, 16:09		#6
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,152 Blog Entries: 29 Rep Power: 39	Then I don't think I get your reasoning on the face normal. You still need 3 components (in 3D), the face area would be a sqrt away from you, instead of storing it. Also, you typically need n with the area, so you shouldn't actually compute the face area a lot, if any at all. Note that the mass flux (the only one I suggest storing, because you reuse it a lot for other equations) is then already of the correct sign and includes the area from n. But don't store it for cells!!! It belongs to faces, otherwise you end up storing it twice aerosayan likes this.

November 22, 2020, 16:13		#7
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,152 Blog Entries: 29 Rep Power: 39	At this point it is also important what kind of solver you are developing. Pressure based or density based? Explicit or implicit? Which algorithm? Because some of them have their peculiarities aerosayan likes this.

November 22, 2020, 17:48		#9
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,152 Blog Entries: 29 Rep Power: 39	Then probably the mass flux is not really needed yet. But, as an example of what I said on overkilling optimizations, how do you store your variables (p,u,v,t I guess)? It may have sense to, say, store them in a single array nvar x ncells instead of nvar dinstinct vectors of ncells.

November 23, 2020, 06:01		#12
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,152 Blog Entries: 29 Rep Power: 39	Let me tell you, it's not that you are wrong, in any of your statements or approach, just that it is like putting the cart before the horse. Notably, there are things, even trivial, that a profiler can't catch. But once the obvious ones are taken into account (column/row major, proper algorithm choice, no bad scaling allocations), it can actually give you a lot of insight on things you wouldn't even know. It is good to give proper thinking to how you write things, but it is wrong to become attached to any of your pieces of code. My current URANS code for unstructured grids is around 25k lines of code, but the commit history says that I changed around 350k lines of code. If code is alive, it will costantly change, and you will have to adapt to it. I don't obviously know which is the state of your code but, if you haven't a working code yet, with all the planned features in place, you are, in my opinion, doing this wrong. If you have instead, I will post a nude if any of this is your major bottleneck For example, you could have spent time on using MPI, which is way simpler than these micro-optimizations aerosayan likes this.