CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Main CFD Forum

OptimisationSwitches for parallel running of multiple cases

Register Blogs Community New Posts Updated Threads Search

Like Tree1Likes
  • 1 Post By flotus1

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   August 17, 2023, 21:01
Smile OptimisationSwitches for parallel running of multiple cases
  #1
Senior Member
 
Desh
Join Date: Mar 2021
Location: Sydney
Posts: 107
Rep Power: 5
dasith0001 is on a distinguished road
Hi Formers,

I am running multiple cases of chtMultiRegionFoam with OF10 on a AMD Ryzen Threadripper 3990X 64-Core Processor.

The problem is when I run more than 4 models, each having 8 different threads, the simulations drastically slow down.

I've followed the following links
HTML Code:
https://www.cfd-online.com/Forums/blogs/wyldckat/596-notes-about-running-openfoam-parallel.html
and
HTML Code:
https://www.cfd-online.com/Forums/hardware/100260-mpirun-best-parameters.html#post356954
And I think I have a fair idea what's limiting.

However,I saw this new post
//https://develop.openfoam.com/Development/openfoam/-/wikis/tuning#parallel-tuning

And thinking if there is any potential to increase my speed on multiple running or even running a single case. The 'OptimisationSwitches' they are recommending to look at are ;

PHP Code:
OptimisationSwitches
{

      
// Min number of processors to use non-blocking exchange (NBX) algorithm
      //   >0 : enabled
      
nbx.min         0.1;


       
// Additional non-blocking exchange (NBX) tuning parameters (experimental)
       //    0 : none
      //    1 : initial barrier
       
nbx.tuning      1.5;

       
// Additional PstreamBuffers tuning parameters (experimental)
       //   -1 : PEX with all-to-all for buffer sizes and point-to-point
       //        for contents (legacy approach)
       //    0 : hybrid PEX with NBX for buffer sizes and point-to-point
      //        for contents (proposed new approach)
       //    1 : full NBX for buffer sizes and contents (very experimental)
       
pbufs.tuning   0;
      
//https://develop.openfoam.com/Development/openfoam/-/wikis/tuning#parallel-tuning

      // Optional (experimental) feature in lduMatrixUpdate
       // to poll (processor) interfaces for individual readiness
       // instead of waiting for all to complete first.
       //   -1 : wait for any requests to finish and dispatch when possible
       //    0 : non-polling
       //   >0 : number of times to poll for requests (and dispatch) before
       //        reverting to non-polling (deprecated)
      
nPollProcInterfaces 0;

I am just wondering if anyone has success twerking those nobs ? where to start ?

Thank you,
Dasith
dasith0001 is offline   Reply With Quote

Old   August 17, 2023, 21:41
Default
  #2
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,675
Rep Power: 66
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
If you have a fair idea what is limiting it, then would you mind sharing what that is? Otherwise, how else can we recommend anything?

I don't recommend touching OptimisationSwitches when you have a single node. In fact, I don't recommend touching OptimsationSwitches until you have at least 100 nodes. You do not have a communication bottleneck with a single node. You don't have any communication for there to be a bottleneck!
LuckyTran is offline   Reply With Quote

Old   August 17, 2023, 22:11
Default
  #3
Senior Member
 
Desh
Join Date: Mar 2021
Location: Sydney
Posts: 107
Rep Power: 5
dasith0001 is on a distinguished road
Hi LuckyTran,

Please correct me if I am wrong but the issue I think is the limited ''memory bandwidth and caches sizes'' , by following the this link running multiple parallel cases in openfoam slows the simulation .

well, there is definitely some bottle neck coming to play if its not the bandwidth.

whenever I exceed using 30-34 cores, out of the 64 cores available, all the simulations drastically slows down.

It is much appreciate if one can provide an insight where to start my detective work ? I am relatively a newbie in the field, so any explanation is appreciated.
dasith0001 is offline   Reply With Quote

Old   August 18, 2023, 02:21
Default
  #4
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,675
Rep Power: 66
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
In order for it to be a bandwidth issue, first you must confirm that your RAM is not at capacity. How much RAM is on your machine? How much RAM is utilized when running 1 case, 2 cases, 3 cases, and 4 cases?

It is quite easy to verify that there is a memory bandwidth issue because your core utilization will be at less than 100%. If you are on unix then do top or equivalent. If you are on windows then open the task manager.
LuckyTran is offline   Reply With Quote

Old   August 21, 2023, 20:51
Default
  #5
Senior Member
 
Desh
Join Date: Mar 2021
Location: Sydney
Posts: 107
Rep Power: 5
dasith0001 is on a distinguished road
Hi LuckyTran,

Apologies for the late reply, but I am keen figuring this out!!!

Yep, I've been closely monitoring CPU and RAM usage for last couple of months.
The machine is with 64 GB of RAM (DDR4)

here's what happening when I employ first four OF cases each with 8 core parallelization.

-Case 1 CPU 10% + RAM 25%
-adding Case 2 CPU 18% + RAM 28%
-adding Case 3 CPU 26% + RAM 30%
-adding Case 4 CPU 35% + RAM 33%

No matter how many number of models I add to solve simultaneously, I cannot get the RAM working over 40% or so.

( Please not the percentages are from memory, there are only roughly approximated. I just want to shown the pattern)

The 'first case' always seems to increased the RAM up to 25-30% but adding more work would not increase it much.

Thank you

Last edited by dasith0001; August 21, 2023 at 20:54. Reason: spelling
dasith0001 is offline   Reply With Quote

Old   August 22, 2023, 00:31
Default
  #6
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,675
Rep Power: 66
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
And how do core temps look? Hot enough to cook pasta or a sea breeze?

If you really think it's a memory bandwidth issue then run 1 case on 8 cores but with 4x the mesh density. If you want to see ram utilization more than 40% then make a 1000x denser mesh, you can even go more than cells than your system can even handle and hit the real cap. Most likely when you do these though you will not see the predicted outcome. Each of your cases occupies only 4% of the RAM, these are baby numbers.


Are you running SSDs? How often are you writing to the hard disk? Is it a disk usage bottleneck? Because all the jobs share the same disk and you would be read/write limited running a bunch of jobs simultaneously.
LuckyTran is offline   Reply With Quote

Old   August 22, 2023, 19:31
Wink
  #7
Senior Member
 
Desh
Join Date: Mar 2021
Location: Sydney
Posts: 107
Rep Power: 5
dasith0001 is on a distinguished road
Hi LuckyTran,

The CPU actually was able to cook pasta initially, and then I move to liquid cooling system + 4 of heavy fans, now its runs between 78-82 C.

I don't necessarily want to see the RAM utilization increase, its just that it might not be the bottle neck as its capacity is not exceeded by far. I am curious though, why the RAM jumps to 28% or so from, say 6%, when I launch my first case ?

I am writing on very long time steps, the disk utilization never exceed 2-4 %. The simulation I am running writes in 4-5 hrs(world clock) intervals. In fact I have 6 different SSDs and I tried writing the case separately on to them. Still the same results, if I exceed '4 cases x 8 cores', it is drastically slowed down.

Still not bandwidth ? I am very Open to try anything to solve this. Having the CPU power to run a submarine and not being able to fully utilize it, drives me crazy!!

Thanks

Last edited by dasith0001; August 22, 2023 at 19:38. Reason: added additional information
dasith0001 is offline   Reply With Quote

Old   August 22, 2023, 22:11
Default
  #8
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,675
Rep Power: 66
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
I don't care how much RAM someone else is using on their machines, it's helps delay global warming when their machines sit idle. Remember Einstein's definition of insanity?

My point is you can force there to be a bandwidth issue by intentionally... using a lot of bandwidth. Right now you have 4 cases that use puny amounts of RAM. If you run 1 case that uses 4x the bandwidth and it doesn't slow down then you know it's not a bandwidth issue. Bandwidth is not some black magic, it has a very simple formula.

I'm obviously skeptical how you encounter bandwidth issues with one of the highest bandwidth machines in existence. Unless... who knows there might be an actual bug like a cockroach living in your RAM. But instead of doing the same 4x8 test over and over again, I encourage you to just simply prove that it's not a bandwidth problem by doing the bandwidth test so you can move on and actually figure out what it might be.
LuckyTran is offline   Reply With Quote

Old   August 23, 2023, 00:13
Default
  #9
Senior Member
 
Desh
Join Date: Mar 2021
Location: Sydney
Posts: 107
Rep Power: 5
dasith0001 is on a distinguished road
ok, make sense, and thank you.

I will do the test and update back.

But by some remote chance if its a bandwidth issue, what solutions should I be looking at ? is it good thing to follow the second link inserted on my very first post ?
dasith0001 is offline   Reply With Quote

Old   August 23, 2023, 07:28
Default
  #10
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,675
Rep Power: 66
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
You should look to solutions that are actually related to the problem.


It never occurred to me that you might be confused what bandwidth means, my bad. Memory bandwidth is a hardware spec in GiB/s which is 23.84 GiB/s for a single memory channel, 47.68 GiB/s for dual channel, and 95.37 GiB/s for 4 channels. You have one of these 3 configurations and most people have a dual channel or quad channel configuration. There is nothing you can type into a machine that will make a bandwidth issue become not a bandwidth issue. You need to buy a new computer if that is the case.
LuckyTran is offline   Reply With Quote

Old   August 23, 2023, 15:19
Default
  #11
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
General recommendations for CFD hardware [WIP]
From chapter 1 on CPU solver performance: "High-end HEDT parts with 4 memory channels (TR-3960x): A textbook example of a memory bandwidth bottleneck. There are just way too many cores for only 4 memory channels."
You just can not use all cores of a 3990X efficiently for OpenFOAM. It is not the ideal CPU for CFD, to say the least.

All you can do is make sure the memory is populated correctly. Take a look at the manual if you are not sure, or tell us more.
And avoid over-subscribing cores, especially when running multiple simulations at the same time. The 3990X is made up of 8 chiplets. When running 4 simulations at the same time, it would be ideal to distribute the threads evenly across all of them. It maximizes the L3 cache available for each thread.
With 4 simulations of 8 threads each: Simulation 1 needs to be pinned to core 0,2,4,6,8,10,12,14. Simulation 2 starts at core 16, again with stride 2. And so on.
All of this won't make a night-and-day difference, since memory bandwidth will still remain the bottleneck.
dasith0001 likes this.
flotus1 is offline   Reply With Quote

Old   August 24, 2023, 03:15
Default
  #12
Senior Member
 
Desh
Join Date: Mar 2021
Location: Sydney
Posts: 107
Rep Power: 5
dasith0001 is on a distinguished road
Quote:
Originally Posted by LuckyTran View Post
You should look to solutions that are actually related to the problem.
Absolutely, is that even if
Quote:
Originally Posted by LuckyTran View Post
There is nothing you can type into a machine that will
resolve the issue ?

Thank you clarifying what the 'bandwith' is. I was completely missing that. Now I see the solution, I should have read
HTML Code:
https://www.cfd-online.com/Forums/hardware/234076-general-recommendations-cfd-hardware-wip.html
by Alex before getting my machine.

Good luck.
dasith0001 is offline   Reply With Quote

Old   August 24, 2023, 03:42
Default
  #13
Senior Member
 
Desh
Join Date: Mar 2021
Location: Sydney
Posts: 107
Rep Power: 5
dasith0001 is on a distinguished road
Hi Alex,

Thank you very much for being super clear and talking right on to the point. If I may add my two cents, people like you contribute immensely to keeping this community healthy and exciting.

That being said, I think you have confirmed my fears, I really should have read your post way before.

yhea I try not to use too many cores, as a rule of thumb, I am using 1 core per 50k cells.

My study at the moments revolves a lot of calibration, thus the need to run multiple cases at the same time.


Coming back to my original question, can ''OptimisationSwitches'' tackle the issue by any degree ?

Cheers,
Dasith
dasith0001 is offline   Reply With Quote

Old   August 24, 2023, 04:00
Default
  #14
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
I can't claim that I fully understand what these switches do. But from the description, they all seem to affect some detail of the MPI communication.
As such, they will mostly affect strong scaling behaviour of the code. Imagine you are running OpenFOAM on a cluster, and get less-than-ideal speedup when increasing the number of nodes. That's when these switches might have some effect. Reducing parallelization overhead.

This won't change the arithmetic intensity of the code, which is responsible for the memory bandwidth bottleneck you are seeing here. So I don't think they are worth investigating at this point. You sure can play around with them, maybe gain a few percent of performance here and there. But it's not the core issue.

Edit: with memory installed and configured correctly, you could also try if you find NUMA settings in the bios.
If it is exposed, it should be somewhere in the AMD CBS sub-menu. What we want to do is set it to NPS2 (or even better: NPS4 if available). I am not familiar with what's available for these CPUs.
This will slightly increase maximum total memory bandwidth, and also decrease memory latency within a NUMA node.
If it can be set to NPS4, we can run each of your 4 simulations within the boundaries of a NUMA node. Maybe we get a 10% performance increase out of that. If you are on Linux, "numactl -H" will give you an idea how many NUMA nodes there are, and which cores belong to which node.

And if you haven't done so already: disable SMT, it will make things easier. We can't fully utilize 64 cores, so more than 1 thread per core is definitely not needed.

Last edited by flotus1; August 24, 2023 at 06:13.
flotus1 is offline   Reply With Quote

Reply

Tags
optimisationswitches


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
OpenFOAM Running error with multiple nodes dokeun OpenFOAM Running, Solving & CFD 1 June 28, 2019 00:04
multiple -parallel cases on single node with single mpirun mrishi OpenFOAM Running, Solving & CFD 1 June 3, 2019 13:26
[openFuelCell] Multiple errors running sofcFoam in Windows 10 muhammadlr95 OpenFOAM Community Contributions 0 March 26, 2019 03:27
Running Multiple cases over night Michael Bo Hansen FLUENT 9 May 7, 2009 03:10
Fluent cases in parallel across multiple machines Riaan FLUENT 3 April 11, 2005 11:51


All times are GMT -4. The time now is 21:18.