CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   Main CFD Forum (https://www.cfd-online.com/Forums/main/)
-   -   OptimisationSwitches for parallel running of multiple cases (https://www.cfd-online.com/Forums/main/251458-optimisationswitches-parallel-running-multiple-cases.html)

dasith0001 August 17, 2023 21:01

OptimisationSwitches for parallel running of multiple cases
 
Hi Formers,

I am running multiple cases of chtMultiRegionFoam with OF10 on a AMD Ryzen Threadripper 3990X 64-Core Processor.

The problem is when I run more than 4 models, each having 8 different threads, the simulations drastically slow down.

I've followed the following links
HTML Code:

https://www.cfd-online.com/Forums/blogs/wyldckat/596-notes-about-running-openfoam-parallel.html
and
HTML Code:

https://www.cfd-online.com/Forums/hardware/100260-mpirun-best-parameters.html#post356954
And I think I have a fair idea what's limiting.

However,I saw this new post
//https://develop.openfoam.com/Development/openfoam/-/wikis/tuning#parallel-tuning

And thinking if there is any potential to increase my speed on multiple running or even running a single case. The 'OptimisationSwitches' they are recommending to look at are ;

PHP Code:

OptimisationSwitches
{

      
// Min number of processors to use non-blocking exchange (NBX) algorithm
      //   >0 : enabled
      
nbx.min         0.1;


       
// Additional non-blocking exchange (NBX) tuning parameters (experimental)
       //    0 : none
      //    1 : initial barrier
       
nbx.tuning      1.5;

       
// Additional PstreamBuffers tuning parameters (experimental)
       //   -1 : PEX with all-to-all for buffer sizes and point-to-point
       //        for contents (legacy approach)
       //    0 : hybrid PEX with NBX for buffer sizes and point-to-point
      //        for contents (proposed new approach)
       //    1 : full NBX for buffer sizes and contents (very experimental)
       
pbufs.tuning   0;
      
//https://develop.openfoam.com/Development/openfoam/-/wikis/tuning#parallel-tuning

      // Optional (experimental) feature in lduMatrixUpdate
       // to poll (processor) interfaces for individual readiness
       // instead of waiting for all to complete first.
       //   -1 : wait for any requests to finish and dispatch when possible
       //    0 : non-polling
       //   >0 : number of times to poll for requests (and dispatch) before
       //        reverting to non-polling (deprecated)
      
nPollProcInterfaces 0;


I am just wondering if anyone has success twerking those nobs ? where to start ?

Thank you,
Dasith

LuckyTran August 17, 2023 21:41

If you have a fair idea what is limiting it, then would you mind sharing what that is? Otherwise, how else can we recommend anything?

I don't recommend touching OptimisationSwitches when you have a single node. In fact, I don't recommend touching OptimsationSwitches until you have at least 100 nodes. You do not have a communication bottleneck with a single node. You don't have any communication for there to be a bottleneck!

dasith0001 August 17, 2023 22:11

Hi LuckyTran,

Please correct me if I am wrong but the issue I think is the limited ''memory bandwidth and caches sizes'' , by following the this link http://https://www.cfd-online.com/Forums/openfoam-solving/158303-running-multiple-parallel-cases-openfoam-slows-simulation.html .

well, there is definitely some bottle neck coming to play if its not the bandwidth.

whenever I exceed using 30-34 cores, out of the 64 cores available, all the simulations drastically slows down.

It is much appreciate if one can provide an insight where to start my detective work ? I am relatively a newbie in the field, so any explanation is appreciated.

LuckyTran August 18, 2023 02:21

In order for it to be a bandwidth issue, first you must confirm that your RAM is not at capacity. How much RAM is on your machine? How much RAM is utilized when running 1 case, 2 cases, 3 cases, and 4 cases?

It is quite easy to verify that there is a memory bandwidth issue because your core utilization will be at less than 100%. If you are on unix then do top or equivalent. If you are on windows then open the task manager.

dasith0001 August 21, 2023 20:51

Hi LuckyTran,

Apologies for the late reply, but I am keen figuring this out!!!

Yep, I've been closely monitoring CPU and RAM usage for last couple of months.
The machine is with 64 GB of RAM (DDR4)

here's what happening when I employ first four OF cases each with 8 core parallelization.

-Case 1 CPU 10% + RAM 25%
-adding Case 2 CPU 18% + RAM 28%
-adding Case 3 CPU 26% + RAM 30%
-adding Case 4 CPU 35% + RAM 33%

No matter how many number of models I add to solve simultaneously, I cannot get the RAM working over 40% or so.

( Please not the percentages are from memory, there are only roughly approximated. I just want to shown the pattern)

The 'first case' always seems to increased the RAM up to 25-30% but adding more work would not increase it much.

Thank you

LuckyTran August 22, 2023 00:31

And how do core temps look? Hot enough to cook pasta or a sea breeze?

If you really think it's a memory bandwidth issue then run 1 case on 8 cores but with 4x the mesh density. If you want to see ram utilization more than 40% then make a 1000x denser mesh, you can even go more than cells than your system can even handle and hit the real cap. Most likely when you do these though you will not see the predicted outcome. Each of your cases occupies only 4% of the RAM, these are baby numbers.


Are you running SSDs? How often are you writing to the hard disk? Is it a disk usage bottleneck? Because all the jobs share the same disk and you would be read/write limited running a bunch of jobs simultaneously.

dasith0001 August 22, 2023 19:31

Hi LuckyTran,

The CPU actually was able to cook pasta initially, and then I move to liquid cooling system + 4 of heavy fans, now its runs between 78-82 C.

I don't necessarily want to see the RAM utilization increase, its just that it might not be the bottle neck as its capacity is not exceeded by far. I am curious though, why the RAM jumps to 28% or so from, say 6%, when I launch my first case ?

I am writing on very long time steps, the disk utilization never exceed 2-4 %. The simulation I am running writes in 4-5 hrs(world clock) intervals. In fact I have 6 different SSDs and I tried writing the case separately on to them. Still the same results, if I exceed '4 cases x 8 cores', it is drastically slowed down.

Still not bandwidth ? I am very Open to try anything to solve this. Having the CPU power to run a submarine and not being able to fully utilize it, drives me crazy!!

Thanks

LuckyTran August 22, 2023 22:11

I don't care how much RAM someone else is using on their machines, it's helps delay global warming when their machines sit idle. Remember Einstein's definition of insanity?

My point is you can force there to be a bandwidth issue by intentionally... using a lot of bandwidth. Right now you have 4 cases that use puny amounts of RAM. If you run 1 case that uses 4x the bandwidth and it doesn't slow down then you know it's not a bandwidth issue. Bandwidth is not some black magic, it has a very simple formula.

I'm obviously skeptical how you encounter bandwidth issues with one of the highest bandwidth machines in existence. Unless... who knows there might be an actual bug like a cockroach living in your RAM. But instead of doing the same 4x8 test over and over again, I encourage you to just simply prove that it's not a bandwidth problem by doing the bandwidth test so you can move on and actually figure out what it might be.

dasith0001 August 23, 2023 00:13

ok, make sense, and thank you.

I will do the test and update back.

But by some remote chance if its a bandwidth issue, what solutions should I be looking at ? is it good thing to follow the second link inserted on my very first post ?

LuckyTran August 23, 2023 07:28

You should look to solutions that are actually related to the problem.


It never occurred to me that you might be confused what bandwidth means, my bad. Memory bandwidth is a hardware spec in GiB/s which is 23.84 GiB/s for a single memory channel, 47.68 GiB/s for dual channel, and 95.37 GiB/s for 4 channels. You have one of these 3 configurations and most people have a dual channel or quad channel configuration. There is nothing you can type into a machine that will make a bandwidth issue become not a bandwidth issue. You need to buy a new computer if that is the case.

flotus1 August 23, 2023 15:19

https://www.cfd-online.com/Forums/ha...dware-wip.html
From chapter 1 on CPU solver performance: "High-end HEDT parts with 4 memory channels (TR-3960x): A textbook example of a memory bandwidth bottleneck. There are just way too many cores for only 4 memory channels."
You just can not use all cores of a 3990X efficiently for OpenFOAM. It is not the ideal CPU for CFD, to say the least.

All you can do is make sure the memory is populated correctly. Take a look at the manual if you are not sure, or tell us more.
And avoid over-subscribing cores, especially when running multiple simulations at the same time. The 3990X is made up of 8 chiplets. When running 4 simulations at the same time, it would be ideal to distribute the threads evenly across all of them. It maximizes the L3 cache available for each thread.
With 4 simulations of 8 threads each: Simulation 1 needs to be pinned to core 0,2,4,6,8,10,12,14. Simulation 2 starts at core 16, again with stride 2. And so on.
All of this won't make a night-and-day difference, since memory bandwidth will still remain the bottleneck.

dasith0001 August 24, 2023 03:15

Quote:

Originally Posted by LuckyTran (Post 855751)
You should look to solutions that are actually related to the problem.

Absolutely, is that even if
Quote:

Originally Posted by LuckyTran (Post 855751)
There is nothing you can type into a machine that will

resolve the issue ?

Thank you clarifying what the 'bandwith' is. I was completely missing that. Now I see the solution, I should have read
HTML Code:

https://www.cfd-online.com/Forums/hardware/234076-general-recommendations-cfd-hardware-wip.html
by Alex before getting my machine.

Good luck.

dasith0001 August 24, 2023 03:42

Hi Alex,

Thank you very much for being super clear and talking right on to the point. If I may add my two cents, people like you contribute immensely to keeping this community healthy and exciting.

That being said, I think you have confirmed my fears, I really should have read your post way before.

yhea I try not to use too many cores, as a rule of thumb, I am using 1 core per 50k cells.

My study at the moments revolves a lot of calibration, thus the need to run multiple cases at the same time.


Coming back to my original question, can ''OptimisationSwitches'' tackle the issue by any degree ?

Cheers,
Dasith

flotus1 August 24, 2023 04:00

I can't claim that I fully understand what these switches do. But from the description, they all seem to affect some detail of the MPI communication.
As such, they will mostly affect strong scaling behaviour of the code. Imagine you are running OpenFOAM on a cluster, and get less-than-ideal speedup when increasing the number of nodes. That's when these switches might have some effect. Reducing parallelization overhead.

This won't change the arithmetic intensity of the code, which is responsible for the memory bandwidth bottleneck you are seeing here. So I don't think they are worth investigating at this point. You sure can play around with them, maybe gain a few percent of performance here and there. But it's not the core issue.

Edit: with memory installed and configured correctly, you could also try if you find NUMA settings in the bios.
If it is exposed, it should be somewhere in the AMD CBS sub-menu. What we want to do is set it to NPS2 (or even better: NPS4 if available). I am not familiar with what's available for these CPUs.
This will slightly increase maximum total memory bandwidth, and also decrease memory latency within a NUMA node.
If it can be set to NPS4, we can run each of your 4 simulations within the boundaries of a NUMA node. Maybe we get a 10% performance increase out of that. If you are on Linux, "numactl -H" will give you an idea how many NUMA nodes there are, and which cores belong to which node.

And if you haven't done so already: disable SMT, it will make things easier. We can't fully utilize 64 cores, so more than 1 thread per core is definitely not needed.


All times are GMT -4. The time now is 12:45.