CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Best ways to distribute jobs

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree1Likes
  • 1 Post By newOpenfoamUser

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   March 27, 2022, 13:34
Default Best ways to distribute jobs
  #1
New Member
 
Join Date: Sep 2021
Posts: 10
Rep Power: 4
newOpenfoamUser is on a distinguished road
I have hundreds of jobs to be run on a machine with 52 logical cores and 104 logical cores. The number of cells for each jobs range from 200k to 500k. I heard that there is a limit where you cannot get anymore advantages even after decomposing the cells further. Therefore, I am running 13 jobs simultaneously with each jobs occupying 8 processor. Basically, I am running "mpirun -np 8 simpleFoam -parallel" with openfoam on different terminals for each jobs. Is there a more computationally effective way to do this? I am attaching the machine specification below (machineA).

I heard that only physical cores matter for openfoam. I am not sure whether this also holds for multiple jobs running simultaneously. Therefore, I ran an experiment to compare the speed of running 13 jobs simultaneously (each job use 8 processors) vs running 6 jobs simultaneously (each job use 8 processors). It seems like there is not much difference in speed. Does this means there is no more improvement in speed once the total number of physical cores have been occupied, even for multiple jobs running simultaneously? Is this a problem with openfoam? Or is this just how CFD computation works?

Another question I have is regarding the speed of different machines. I also have access to another machine with 32 physical cores (machineB). I am running 4 jobs simultaneously (each job use 8 processors). It seems that machineB is slightly faster compared to machineA even though machineA have more cores. I am not sure whether this is due to my inappropriate way of distributing jobs or machineB is somehow more powerful even with less cores. I attached the specification of both machines. I hope someone can point out if there is any attributes in machineA that makes it less powerful.

Thanks!
Attached Images
File Type: jpg machineA.jpg (96.4 KB, 25 views)
File Type: jpg machineB.jpg (100.5 KB, 16 views)
newOpenfoamUser is offline   Reply With Quote

Old   March 27, 2022, 16:55
Default
  #2
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Don't oversubscribe cores. If a machine has 52 physical CPU cores, that's the maximum amount of threads for your CFD simulations. This is not unique to OpenFOAM, but for pretty much every FV-CFD solver out there.
Matter of fact, SMT/hyperthreading is often disabled on machines that run exclusively CFD or FEA jobs, because it doesn't help in the best case, and can cause additional problems in the worst case.

Now, for the most efficient way of distributing hundreds of small jobs: my first guess would be to simply run them single-core only, with as many simulations simultaneously as there are CPU cores. Making sure that each run gets pinned to a different core. This of course requires enough memory to fit this many simulations simultaneously.
The reason behind that: parallelization always has an overhead. It can be small, but it is never zero for CFD solvers with domain decomposition. On top of that, you don't need to run a decomposition first, which also saves some time.

Beyond this simple approach, things can get tricky if you want the absolute best performance. For Intel, you probably want to enable "cluster-on-die", "sub-NUMA-clustering" or whatever it is called these days. And then run as many simulations as there are NUMA-nodes available, each simulation on its own exclusive node.
If you don't want it that complicated, at least make sure to pin your simulations in a way that each simulation runs on either NUMA node 0 OR 1, and does not span across both nodes. Communication across nodes is slower than communication within a node.

At the very least, you need to take control of core binding when running several simulations simultaneously. To prevent the system from running more than one thread per physical core. Maybe the OS is clever enough to prevent oversubscribing without being told, but you have to check.
Without it, comparing execution speed between different machines is a moot point. Who knows how many threads god pinned to the same cores on each machine, slowing down the whole process.
flotus1 is offline   Reply With Quote

Old   March 29, 2022, 11:55
Default
  #3
New Member
 
Join Date: Sep 2021
Posts: 10
Rep Power: 4
newOpenfoamUser is on a distinguished road
After spending some time to learn how to do core binding, I went back and experiment with different settings on machineA. Here's what I found. Seems like having too many mpirun processes with small number of cores hurt performance. I'm not sure whether this holds generally or it's due to the specification of the machine. On machineA with 2 NUMA nodes, 1 socket on each node, 26 physical cores on each socket, the best performance is achieved when I run 2 mpirun processes using 26 cores each. Even when I don't do any core binding (the system seems to be clever enough to prevent oversubscribe) and thus there are communication across NUMA nodes, the performance does not change. As flotus1 pointed out, communication across NUMA nodes is bad. So I am quite surprised that it doesn't seems to hurt the performance. Maybe this is due to the relatively small number of cells in my cases?
newOpenfoamUser is offline   Reply With Quote

Old   March 29, 2022, 13:53
Default
  #4
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
Seems like having too many mpirun processes with small number of cores hurt performance
Just to be sure: the metric we are interested in is throughput, right? I.e. how fast you can get through the large pile of jobs you have lined up.
Running more and more simulations at the same time, each simulation individually will run slower. But overall throughput should still increase.

Quote:
As flotus1 pointed out, communication across NUMA nodes is bad. So I am quite surprised that it doesn't seems to hurt the performance
Not bad in absolute terms, but slower than communicating within a node. It usually doesn't affect solvers like OpenFOAM too much, because they are designed to run even on distributed memory platforms with much slower interconnects between nodes. Still, every little bit helps, and you asked for maximum efficiency.
flotus1 is offline   Reply With Quote

Old   March 29, 2022, 21:56
Default
  #5
New Member
 
Join Date: Sep 2021
Posts: 10
Rep Power: 4
newOpenfoamUser is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
Just to be sure: the metric we are interested in is throughput, right? I.e. how fast you can get through the large pile of jobs you have lined up.
Running more and more simulations at the same time, each simulation individually will run slower. But overall throughput should still increase.
Yes, I am talking about total throughput. Here's what the simulation time looks like. If I run 1 simulation with 13 cores (many cores were left idle), the time taken is 138s. Similarly, if I run one simulation with 26 cores (many cores were left idle), the time taken is 84s. If the time doesn't change when multiple simulations were run together simultaneously, then running multiple simulations with 13 cores will obviously be better. But that's not the case. When I ran 4 simulations with 13 cores, the time taken for each case becomes 211s. On the other hand, in the case of 2 simulations with 26 cores each, the time taken for each simulation doesn't change.
wkernkamp likes this.
newOpenfoamUser is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Multiple MPI jobs are not using all CPU's OnePeople OpenFOAM Running, Solving & CFD 1 May 8, 2017 15:49
multiple parallel jobs on one machine joeybernard CFX 0 December 16, 2010 10:10
Random machine freezes when running several OpenFoam jobs simultaneously 2bias OpenFOAM Installation 5 July 2, 2010 07:40
How to distribute no. of CPUs among parallel jobs? SGI Altix CFX 3 February 14, 2006 01:38


All times are GMT -4. The time now is 19:22.