CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   Main CFD Forum (https://www.cfd-online.com/Forums/main/)
-   -   how to develop code that scales well without having access to supercomputers? (https://www.cfd-online.com/Forums/main/241170-how-develop-code-scales-well-without-having-access-supercomputers.html)

aerosayan February 11, 2022 14:17

how to develop code that scales well without having access to supercomputers?
 
Hi everyone, hope you had a great start to this new year,

I'm trying to understand how can we actually develop for scaling our code to large problem sizes (example : ~100 Million cells, for LES combustion) without having access to supercomputer clusters?

This is obviously not going to be verifiable right now, since without actually doing the scaling test runs, I won't be able to guarantee that my code scales well.

However I can still follow some "best practices", if you will, to not mess up very badly.

One of the things I try to follow now, is to reduce memory consumption. As an example, my grid generation code only consumes 20 MB RAM for 1 Million quad-tree cells in the highest refinement level of the grid. So I can guarantee that 100 Million cells will consume 2,000 MB, which is feasible even on a cheap laptop.

How can we give such strong scaling guarantees for computational performance?

What steps we can take to ensure our code scales very well?

Tech stack used: FORTRAN-2008, C++, OpenMP, MPI.

flotus1 February 11, 2022 14:47

The worst offenders that kill strong scaling on distributed memory systems are: latency and transfer volume. In this order.

To reduce latency issues as much as possible, make sure you pack/group your MPI transfers. Do as few MPI transfers as possible, and make them as large as possible.
E.g. a cell needs a value from a neighbor on different MPI rank. Then don't transfer this single value when it's needed, instead transfer all values of this type for all boundary cells in a single MPI transfer. And use non-blocking MPI routines.
Reducing transfer voluime should be straightforward. Only transfer the minimum amount of data necessary. This needs some planning ahead, before the first line of code is written.

These are the low-hanging fruit for strong scaling.
If you need even more, you can look into interleaving communication and computation. And of course, hybrid parallelization with both OpenMP and MPI.

Last not least, the easiest trick for great-looking strong scaling graphs: have a slow code, maybe even sprinkle in some additional computations :D

sbaffini February 11, 2022 17:42

For memory, do not allocate anything scaling with total number of processes (nor total problem size, of course). For algorithms, as well, avoid anything scaling linearly with the number of processes or more (or total problem size, of course).

Use non-blocking communications, possibly interleaving communication with computation.

Do not use collective communications in main loop, or possibly nowhere.

Then optimize local memory to make the code compute bound.

aerosayan February 12, 2022 05:18

Quote:

Originally Posted by sbaffini (Post 822138)
For memory, do not allocate anything scaling with total number of processes (nor total problem size, of course). For algorithms, as well, avoid anything scaling linearly with the number of processes or more (or total problem size, of course).


could you please explain this?


i kind of understood what you meant. we shouldn't allocate let's say 100 M cells worth of data on a single node. if possible we should try to reduce memory consumption by either finding a constant memory implementation of our algorithms, or use a smaller working buffer which won't be too large.


ex : instead of allocating a temporary 100 M integers, using it, then deallocating it, we should try to work on a smaller chunks of the data ... let's say 1M integers.


did i understand you correctly?

andy_ February 12, 2022 06:06

Quote:

Originally Posted by aerosayan (Post 822130)
What steps we can take to ensure our code scales very well?

Overall efficiency varies strongly with problem size, properties of the parallel hardware and the algorithm details. This means that in general one cannot have a single fixed approach that will be highly efficient across all problem sizes and different types of hardware. High overall efficiency requires tuning and adapting to the hardware being used.

If you don't have access to a range of supercomputers then developing code that can adapt well to the actual hardware is going to be difficult due to an inability to test reliably. Nonetheless if one accepts that adapting will be required then one can organise the code so that the kernel algorithms change with the degree of parallelisation, message sizes change for the relative CPU and communication speeds, calculations can be duplicated rather than results passed when it becomes faster, etc...

When I was most active in this area in the 80s and 90s I learnt that one cannot trust the manufacturers of parallel computers to delete benchmarking code one might derive by cutting down production code in order to assess real performance. There is a tendency to keep useful benchmarking code and for it to migrate to other manufacturers when people change jobs. After experiencing this first hand I switched to relying more on the sets of NAS parallel benchmarks and learning how their performance translated to the performance of our codes. This latter approach may be useful for those in your position without current access to hardware.

sbaffini February 12, 2022 06:33

Quote:

Originally Posted by aerosayan (Post 822159)
could you please explain this?


i kind of understood what you meant. we shouldn't allocate let's say 100 M cells worth of data on a single node. if possible we should try to reduce memory consumption by either finding a constant memory implementation of our algorithms, or use a smaller working buffer which won't be too large.


ex : instead of allocating a temporary 100 M integers, using it, then deallocating it, we should try to work on a smaller chunks of the data ... let's say 1M integers.


did i understand you correctly?

First of all, in case of doubts, let me clarify that this really applies to distributed computing. Let's say that the total problem size is N and the total number of processes is P.

Memory wise, you should not allocate anything of size N, but only N/P. This is in theory, in practice you could get along with just a process per node allocating a single array of size N at the time, but that won't scale in the end.

The same also holds for anything of size P, with a single vector of size P per node still being practically feasible, but still not scalable in the end.

Your algorithm, as well, should have a cost of N/P. And you should use communication patterns that, at most, cost log(P), not more.

However, I'm not saying this is easy or obvious so, in the end, you won't probably do all of this together until you actually hit a limit

sbaffini February 12, 2022 08:46

Let me give you a practical example.

When you need to write a solution to file you can proceed in 3 different ways (according to my experience):

1) Each process writes its own file, then one single process allocates a single array of size N, fills it by reading the files of the other processes and finally writes it to file. This actually works surprisingly well up to a certain point, then it doesn't anymore. The problem here is that a single process has to read the whole size N array from P files, hold it in memory and finally write it back to file.

2) A single process collects all the data directly trough mpi_recv, so no file is involved, and writes the whole array to file. Again, you have communication costs scaling with P and an order N allocation/writing. Surprisingly, this tends to work slightly worst than 1 up to a certain non negligible size, then it is slightly better.

3) Use MPI IO, which was developed exactly for this. Notably, this has a somehow large overhead and won't be the preferred choice for medium/small sized jobs, but is a clear winner for larger jobs.

So, in practice, the ideal code should have a way to test the machine and determine the best approach for each task at hand.

For reading, in contrast, I have found that, with the proper tricks (i.e., skip whole file blocks that don't have what a process needs), all the processes can read from the same file with very little overhead. However, MPI IO is there for this as well.

Another scenario where it is common to make mistakes is when a process needs to know which process owns his halo cells. I've seen a lot of horrors here, from each process holding the whole N sized array with this information, to ring communication patterns (where each process processes the request for each other). A better approach here is to split evenly the N sized array across all processes and either use one-sided reading or a couple of mpi_alltoall/v. Even writing the array to file and let each process read its part might work up to a certain size.

In the end, the reasoning is very simple: if a process has to do the whole work, store the whole size or interact with all the processes then it won't scale. But as the best solution might be difficult to implement, you practically won't implement it until you actually have a problem

aerosayan February 12, 2022 08:48

Thanks! This is very easy to understand now! :)

aerosayan May 21, 2022 18:23

Quote:

Originally Posted by sbaffini (Post 822166)
Memory wise, you should not allocate anything of size N, but only N/P. This is in theory, in practice you could get along with just a process per node allocating a single array of size N at the time, but that won't scale in the end.

I hit this limit. When the problem size becomes too large, we need to distribute the data to multiple cores.

For explicit codes, it's easy to just partition the grid, and distribute every one of them to different cores. We would just need to transfer boundary data between each partition.

Not sure if that is how it's done for parallelizing implicit codes too.

In theory we should be able to do that, right?

Most examples of MPI just show parallelization of specific algorithms like Gauss-Siedel, and not how real industrial CFD codes are designed. So my assumption about how to code in MPI is probably wrong. Surprisingly, OpenMP and CUDA codes also parallelize specific algorithms like linear solvers, matrix-matrix multiply, etc. and I don't think the same thinking model would work well for MPI.

My current assumption that I think might work for MPI is : For implicit solvers, just treat each partition as a small subdomain (like we do for explicit solvers), then exchange boundary information between partitions, then calculate the fluxes and, formulate the Ax=b set of linear equations from the cells in each partition, then solve the linear system separately on each core, and finally update the solution.

To each MPI process, the code looks single threaded for most of the part. We just exchange boundary information. We don't parallelize the linear solvers, like those shown in examples .

Is this the correct approach?

Not sure if this is the correct approach, since I never did it before.

I think it should work.

aerosayan May 21, 2022 18:30

1 Attachment(s)
Basically what I'm thinking ...

1. fork once
2. do work on each partition completely separately on each core (other than sharing boundary data),
3. join back at the end of the simulation

Ex : for this image, 5 independent processes would work as if they were single threaded

Is this correct?

sbaffini May 21, 2022 18:44

It is the correct approach only in the fact that you start by splitting the domain just like in the explicit case.

You made a wrong assumption in that you can solve each sub-problem independently, because you can't.

But the solution is very simple, you just need to exchange the system solution like your other variables, and you need to do that after every iteration of the linear solver.

This is still not exactly like solving on a single process, so it is typically named block jacobi (because each partition sees the others at the previous iteration, like in jacobi method), but it works.

For a Gauss-Seidel like method, this is basically all you need to know. For Krylov methods like GMRES, the above description typically apply to the preconditioner, while the rest of the algorithm has a straightforward extension to the parallel case as it just involves matrix vector products (yet, it involves several allreduce, which cost just like an mpi_barrier).

Algebraic multigrid methods are on a completely different level of parallel complexity, as you need to exchange different stuff at different grid levels. But for each level they just work in parallel as described above (you can't just exchange at the fine level otherwise you loose the main advantage of amg).

I mention amg because, if you are building your parallel infrastructure now, it might make sense to design the implementation around the amg needs, which are generally different from the grid ones (yet, it obviously depends from the specific implementation)

sbaffini May 21, 2022 18:54

An additional thing you might want to do, even if it formally introduces variations when running with different number of processes, is to split the linear iterations between two groups: those needing ghost cells and those that don't. You can then use non blocking send-recv to interleave the update of the two sets of cells.

aerosayan May 21, 2022 18:56

I'm trying to use LU-SGS or Block LU-SGS, which I think would be similar to block Jacobi.

aerosayan May 21, 2022 19:03

Quote:

Originally Posted by sbaffini (Post 828406)
You made a wrong assumption in that you can solve each sub-problem independently, because you can't.


Okay, I see that. In implicit solvers, the solution depends on the next timestep, and it won't be possible to solve the set of linear equations, if even a single equation was missing. So we can't solve them in separate parts. We need the whole set of equations.

aerosayan May 21, 2022 19:14

Quote:

Originally Posted by sbaffini (Post 828406)
Algebraic multigrid methods are on a completely different level of parallel complexity, as you need to exchange different stuff at different grid levels. But for each level they just work in parallel as described above (you can't just exchange at the fine level otherwise you loose the main advantage of amg).

I mention amg because, if you are building your parallel infrastructure now, it might make sense to design the implementation around the amg needs, which are generally different from the grid ones (yet, it obviously depends from the specific implementation)


Could you please briefly explain how the parallelization in AMG would work? and what challenges we might face? and/or how to solve them?

I won't understand that right now, but in future I'll come back to the answer.

sbaffini May 21, 2022 19:15

Let's leave out of the terminology the case of block systems, where you have multiple coupled equations solved at the same time and focus on a single scalar.

Both in serial and in parallel, a single sweep of jacobi or gauss-seidel are identical, except for the fact that to do jacobi you must explicitly save the updated solution in a different variable than the current one, while with gauss-seidel you simply overwrite the current solution with the new one (which makes jacobi slower and more costly).

What happens in parallel, with the strategy I depicted in my previous post, is that jacobi would still work the same, because you use variables from previous iterations both for the cells owned by the process and those that are not and were exchanged in parallel. The same doesn't happen for gauss-seidel, because you still need to treat the parallel exchanged values in a jacobi fashion (they are from the previous iteration). This is why a classical parallel implementation of gauss-seidel is said to be block jacobi, because each parallel block (i.e., the set of rows owned by a process) sees the others like in the jacobi method.

sbaffini May 21, 2022 19:25

Quote:

Originally Posted by aerosayan (Post 828409)
Okay, I see that. In implicit solvers, the solution depends on the next timestep, and it won't be possible to solve the set of linear equations, if even a single equation was missing. So we can't solve them in separate parts. We need the whole set of equations.

There are different, possibly overlapping, reasons for this, which are just as good as your explanation. So yes, in the original system (the one you would solve serially) each equation depends from all the others, if you split it in N separate systems it is no more the same, and information can, at most, travel in a single partition, limiting the effectiveness of the approach (and the same reason requires exchanging in parallel at each amg level).

sbaffini May 21, 2022 19:58

Quote:

Originally Posted by aerosayan (Post 828410)
Could you please briefly explain how the parallelization in AMG would work? and what challenges we might face? and/or how to solve them?

I won't understand that right now, but in future I'll come back to the answer.

AMG can be implemented in several ways, this must be clear.

What amg roughly does is to combine together multiple equations following certain rules, solve the resulting smaller system and put back, somehow, the solution for the larger system. This is done recursively, multiple times, starting from the original system.

The first issue, which is the one I was mostly referring to, is that, at each level you still need to solve a system and, as I mentioned, solving it in parallel requires exchanging variables. But, at each amg level your numbering of the equations is different, so you can't simply reuse the machinery you developed for the explicit case like you could do for a simple Krylov or gauss-seidel method.

Not only this, but each process must now know, at each level, what is the numbering of the ghost cells. That is, my row i had a coefficient in column j with row j on another process. After an agglomerarion of the cells, my row i of previous level will end up in row i', and row j of my neighbor will end up in row j'. As they are still coupled, they need to exchange this information. At each level, and each iteration if the system changes (as it does for non linear equations).

So, formally, with amg you need both a separate parallel exchange mechanism and a separate way to build the cell lists for the parallel exchanges than those used in the explicit case.

Now, if you actually end up having these two couples of separate mechanisms really depends from a lot of factors. It probably makes sense to indeed have them separate, but maybe not.

This is what I was referring to.

But I haven't yet figured it out completely neither [emoji6]

aerosayan May 21, 2022 20:33

1 Attachment(s)
wrote my first hello world mpi code after fixing cmake build script.

my laptop has only 2 cpus, and here i'm talking about making the code run on 10,000.


All times are GMT -4. The time now is 05:24.