CFD Online Discussion Forums - Best ways to distribute jobs

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Hardware (https://www.cfd-online.com/Forums/hardware/)

- - Best ways to distribute jobs (https://www.cfd-online.com/Forums/hardware/241916-best-ways-distribute-jobs.html)

Best ways to distribute jobs

I have hundreds of jobs to be run on a machine with 52 logical cores and 104 logical cores. The number of cells for each jobs range from 200k to 500k. I heard that there is a limit where you cannot get anymore advantages even after decomposing the cells further. Therefore, I am running 13 jobs simultaneously with each jobs occupying 8 processor. Basically, I am running "mpirun -np 8 simpleFoam -parallel" with openfoam on different terminals for each jobs. Is there a more computationally effective way to do this? I am attaching the machine specification below (machineA).

I heard that only physical cores matter for openfoam. I am not sure whether this also holds for multiple jobs running simultaneously. Therefore, I ran an experiment to compare the speed of running 13 jobs simultaneously (each job use 8 processors) vs running 6 jobs simultaneously (each job use 8 processors). It seems like there is not much difference in speed. Does this means there is no more improvement in speed once the total number of physical cores have been occupied, even for multiple jobs running simultaneously? Is this a problem with openfoam? Or is this just how CFD computation works?

Another question I have is regarding the speed of different machines. I also have access to another machine with 32 physical cores (machineB). I am running 4 jobs simultaneously (each job use 8 processors). It seems that machineB is slightly faster compared to machineA even though machineA have more cores. I am not sure whether this is due to my inappropriate way of distributing jobs or machineB is somehow more powerful even with less cores. I attached the specification of both machines. I hope someone can point out if there is any attributes in machineA that makes it less powerful.

Thanks!

Don't oversubscribe cores. If a machine has 52 physical CPU cores, that's the maximum amount of threads for your CFD simulations. This is not unique to OpenFOAM, but for pretty much every FV-CFD solver out there.
Matter of fact, SMT/hyperthreading is often disabled on machines that run exclusively CFD or FEA jobs, because it doesn't help in the best case, and can cause additional problems in the worst case.

Now, for the most efficient way of distributing hundreds of small jobs: my first guess would be to simply run them single-core only, with as many simulations simultaneously as there are CPU cores. Making sure that each run gets pinned to a different core. This of course requires enough memory to fit this many simulations simultaneously.
The reason behind that: parallelization always has an overhead. It can be small, but it is never zero for CFD solvers with domain decomposition. On top of that, you don't need to run a decomposition first, which also saves some time.

Beyond this simple approach, things can get tricky if you want the absolute best performance. For Intel, you probably want to enable "cluster-on-die", "sub-NUMA-clustering" or whatever it is called these days. And then run as many simulations as there are NUMA-nodes available, each simulation on its own exclusive node.
If you don't want it that complicated, at least make sure to pin your simulations in a way that each simulation runs on either NUMA node 0 OR 1, and does not span across both nodes. Communication across nodes is slower than communication within a node.

At the very least, you need to take control of core binding when running several simulations simultaneously. To prevent the system from running more than one thread per physical core. Maybe the OS is clever enough to prevent oversubscribing without being told, but you have to check.
Without it, comparing execution speed between different machines is a moot point. Who knows how many threads god pinned to the same cores on each machine, slowing down the whole process.

After spending some time to learn how to do core binding, I went back and experiment with different settings on machineA. Here's what I found. Seems like having too many mpirun processes with small number of cores hurt performance. I'm not sure whether this holds generally or it's due to the specification of the machine. On machineA with 2 NUMA nodes, 1 socket on each node, 26 physical cores on each socket, the best performance is achieved when I run 2 mpirun processes using 26 cores each. Even when I don't do any core binding (the system seems to be clever enough to prevent oversubscribe) and thus there are communication across NUMA nodes, the performance does not change. As flotus1 pointed out, communication across NUMA nodes is bad. So I am quite surprised that it doesn't seems to hurt the performance. Maybe this is due to the relatively small number of cells in my cases?

Quote:

Seems like having too many mpirun processes with small number of cores hurt performance

Just to be sure: the metric we are interested in is throughput, right? I.e. how fast you can get through the large pile of jobs you have lined up.
Running more and more simulations at the same time, each simulation individually will run slower. But overall throughput should still increase.

Quote:

As flotus1 pointed out, communication across NUMA nodes is bad. So I am quite surprised that it doesn't seems to hurt the performance

Not bad in absolute terms, but slower than communicating within a node. It usually doesn't affect solvers like OpenFOAM too much, because they are designed to run even on distributed memory platforms with much slower interconnects between nodes. Still, every little bit helps, and you asked for maximum efficiency.

Quote:

Originally Posted by flotus1 (Post 825058)

Yes, I am talking about total throughput. Here's what the simulation time looks like. If I run 1 simulation with 13 cores (many cores were left idle), the time taken is 138s. Similarly, if I run one simulation with 26 cores (many cores were left idle), the time taken is 84s. If the time doesn't change when multiple simulations were run together simultaneously, then running multiple simulations with 13 cores will obviously be better. But that's not the case. When I ran 4 simulations with 13 cores, the time taken for each case becomes 211s. On the other hand, in the case of 2 simulations with 26 cores each, the time taken for each simulation doesn't change.