|
[Sponsors] |
March 27, 2022, 13:34 |
Best ways to distribute jobs
|
#1 |
New Member
Join Date: Sep 2021
Posts: 10
Rep Power: 4 |
I have hundreds of jobs to be run on a machine with 52 logical cores and 104 logical cores. The number of cells for each jobs range from 200k to 500k. I heard that there is a limit where you cannot get anymore advantages even after decomposing the cells further. Therefore, I am running 13 jobs simultaneously with each jobs occupying 8 processor. Basically, I am running "mpirun -np 8 simpleFoam -parallel" with openfoam on different terminals for each jobs. Is there a more computationally effective way to do this? I am attaching the machine specification below (machineA).
I heard that only physical cores matter for openfoam. I am not sure whether this also holds for multiple jobs running simultaneously. Therefore, I ran an experiment to compare the speed of running 13 jobs simultaneously (each job use 8 processors) vs running 6 jobs simultaneously (each job use 8 processors). It seems like there is not much difference in speed. Does this means there is no more improvement in speed once the total number of physical cores have been occupied, even for multiple jobs running simultaneously? Is this a problem with openfoam? Or is this just how CFD computation works? Another question I have is regarding the speed of different machines. I also have access to another machine with 32 physical cores (machineB). I am running 4 jobs simultaneously (each job use 8 processors). It seems that machineB is slightly faster compared to machineA even though machineA have more cores. I am not sure whether this is due to my inappropriate way of distributing jobs or machineB is somehow more powerful even with less cores. I attached the specification of both machines. I hope someone can point out if there is any attributes in machineA that makes it less powerful. Thanks! |
|
March 27, 2022, 16:55 |
|
#2 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46 |
Don't oversubscribe cores. If a machine has 52 physical CPU cores, that's the maximum amount of threads for your CFD simulations. This is not unique to OpenFOAM, but for pretty much every FV-CFD solver out there.
Matter of fact, SMT/hyperthreading is often disabled on machines that run exclusively CFD or FEA jobs, because it doesn't help in the best case, and can cause additional problems in the worst case. Now, for the most efficient way of distributing hundreds of small jobs: my first guess would be to simply run them single-core only, with as many simulations simultaneously as there are CPU cores. Making sure that each run gets pinned to a different core. This of course requires enough memory to fit this many simulations simultaneously. The reason behind that: parallelization always has an overhead. It can be small, but it is never zero for CFD solvers with domain decomposition. On top of that, you don't need to run a decomposition first, which also saves some time. Beyond this simple approach, things can get tricky if you want the absolute best performance. For Intel, you probably want to enable "cluster-on-die", "sub-NUMA-clustering" or whatever it is called these days. And then run as many simulations as there are NUMA-nodes available, each simulation on its own exclusive node. If you don't want it that complicated, at least make sure to pin your simulations in a way that each simulation runs on either NUMA node 0 OR 1, and does not span across both nodes. Communication across nodes is slower than communication within a node. At the very least, you need to take control of core binding when running several simulations simultaneously. To prevent the system from running more than one thread per physical core. Maybe the OS is clever enough to prevent oversubscribing without being told, but you have to check. Without it, comparing execution speed between different machines is a moot point. Who knows how many threads god pinned to the same cores on each machine, slowing down the whole process. |
|
March 29, 2022, 11:55 |
|
#3 |
New Member
Join Date: Sep 2021
Posts: 10
Rep Power: 4 |
After spending some time to learn how to do core binding, I went back and experiment with different settings on machineA. Here's what I found. Seems like having too many mpirun processes with small number of cores hurt performance. I'm not sure whether this holds generally or it's due to the specification of the machine. On machineA with 2 NUMA nodes, 1 socket on each node, 26 physical cores on each socket, the best performance is achieved when I run 2 mpirun processes using 26 cores each. Even when I don't do any core binding (the system seems to be clever enough to prevent oversubscribe) and thus there are communication across NUMA nodes, the performance does not change. As flotus1 pointed out, communication across NUMA nodes is bad. So I am quite surprised that it doesn't seems to hurt the performance. Maybe this is due to the relatively small number of cells in my cases?
|
|
March 29, 2022, 13:53 |
|
#4 | ||
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46 |
Quote:
Running more and more simulations at the same time, each simulation individually will run slower. But overall throughput should still increase. Quote:
|
|||
March 29, 2022, 21:56 |
|
#5 | |
New Member
Join Date: Sep 2021
Posts: 10
Rep Power: 4 |
Quote:
|
||
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Multiple MPI jobs are not using all CPU's | OnePeople | OpenFOAM Running, Solving & CFD | 1 | May 8, 2017 15:49 |
multiple parallel jobs on one machine | joeybernard | CFX | 0 | December 16, 2010 10:10 |
Random machine freezes when running several OpenFoam jobs simultaneously | 2bias | OpenFOAM Installation | 5 | July 2, 2010 07:40 |
How to distribute no. of CPUs among parallel jobs? | SGI Altix | CFX | 3 | February 14, 2006 01:38 |