OpenFOAM benchmarks on various hardware

flotus1 · May 27, 2022, 14:30

Well, I had a new workstation to play around with. Unfortunately, I can't get the benchmark to run properly.
I tried both compiling 2112 from source, as well as using the OpenFOAM 2112 installation from the OpenSUSE science repository.
The solver runs, but the mesh is not created properly, leading to solver run times of ~16s on a single core. I used the bench_template_v02.zip provided by Simbelmynë
The problems are the same. Here are the mesh logs from the single-core directory:
blockMesh.txt
decomposePar.txt
snappyHexMesh.txt
surfaceFeatures.txt
Maybe one of you can point me in the right direction.

wkernkamp · May 28, 2022, 00:26

I have run OF v2112. My MeshQualityDict in system has this includeEtc:
#includeEtc "caseDicts/meshQualityDict"

You seem to have the one that calls out caseDicts/mesh/generation/meshQualityDict. As is shown in the snappyHexMesh.txt file. That may be for OpenFOAM v9. (Not sure).

If this doesn't solve it, I will upload my entire basecase directory. Just let me know.

flotus1 · May 28, 2022, 04:39

Thanks, I changed that line in meshQualityDict.
Unfortunately, that didn't do the trick. If you could provide me with a basecase and run script known to work with 2112, that would be great.

wkernkamp · May 28, 2022, 12:46

Here it is. Run it with run.tst The file has a list of numbers of nodes at the beginning. A little further down you can set prep=0 to avoid recalculating the mesh if you already have a valid mesh. In the loop for running openFOAM itself, I remove the simpleFoam log files, etc to allow a rerun to proceed. On the first try, these files are not there yet, so you see an error message that you can ignore.

flotus1 · May 28, 2022, 17:39

Phew, that finally worked. If you don't mind, I would like to add your script to the first post of this thread, or link to your post. Please let me know if you are ok with that.

Anyway, here is my new toy. Well not actually mine, but I still got to play with it for a while.
Hardware: 2x AMD Epyc 7543, Gigabyte MZ72-HB0, 16x64GB DDR4-3200 (RDIMM, 2Rx4)
Bios settings: SMT disabled, workload tuning: HPC optimized, power settings: default, ACPI SRAT L3 cache as NUMA domain: enabled (results in 16 NUMA nodes)
Software: OpenSUSE Leap 15.3 with backport kernel., OpenFOAM v2112 compiled via gcc 11.2.1, using march=znver3, OpenMPI 4.1.4, scaling governor: performance, cleared caches before each run using "echo 3 > /proc/sys/vm/drop_caches"

Code:

simpleFoam run times for 100 iterations:
#threads | runtime/s
====================
01       | 471.92
02       | 227.14
04       | 108.51
08       |  52.11
16       |  28.81
32       |  18.11
48       |  15.46
64       |  13.81

Compared to the same OpenFOAM version from the OpenSUSE science repo, this runs a little faster. On 64 cores, that version takes around 14.9s.
Also, using one NUMA node per CCX is still a little faster than the usual recommendation of NPS=4. But of course would have huge drawbacks for software that isn't NUMA agnostic.
Tweaking bios settings can be tricky. I got consistently worse performance when tweaking the power settings more towards performance. There is probably still a little more to gain, but I'd rather not overdo it with bios settings on someone else's hardware.
I should also note that some of the runs with intermediate thread counts needed some hand-holding. E.g. the threads for the 02 run got mapped to cores on the same memory controller with default settings. Running with "mpirun -np 2 --bind-to core --rank-by core --map-by socket" fixes that.

wkernkamp · May 28, 2022, 19:29

Go right ahead posting my modification of the original script. Before you do, you might include the mpirun you used for certain cases. I have been doing similar things as you can see from the number of mpirun versions that were commented out. I have a version somewhere that splits it out based on number of cores. The strategy to set run parameters will be different for each cpu. It is still nice to have it in the script so that people can develop their plan without having to reinvent the wheel.

Nice job evaluating the borrowed machine. I also found that bios tweaking does not do much, except the memory has to be set for performance (obviously). I also don't bother setting the fans to maximum. Some servers are very noisy that way. Plus, the fans will spin as needed.

flotus1 · May 29, 2022, 07:16

Well, the precise mpirun commands for consistent results vary with the number of threads. Someone else might be able to find a single command that works for all thread counts, but then there are still the variables of hardware, NUMA topology and MPI libraries. I don't think there is a "one size fits all" solution here.
I could try to go more into detail about what to look for, but it would end up being a rather lengthy post titled "how to benchmark correctly". Which -as pedants in the field may argue- we are all doing wrong anyway by leaving turbo boost enabled for such a short benchmark

Maybe another day.

wkernkamp · May 29, 2022, 17:53

Agreed, but what I meant was to leave the default as is, but add your special case commented out with a short description explaining the specific use.

flotus1 · May 30, 2022, 02:24

There are many ways to achieve the same result, most of them more elegant than what I did: https://www.open-mpi.org/doc/v4.1/man1/mpirun.1.php

What I ended up using:

Code:

mpirun -np  2 --bind-to core --rank-by core --map-by socket
mpirun -np  4 --bind-to core --rank-by core --cpu-list 0,16,32,48
mpirun -np  8 --bind-to core --rank-by core --cpu-list 0,8,16,24,32,40,48,56
mpirun -np 16 --bind-to core --rank-by core --map-by numa
same from here on

The goal with all of them being to spread the threads out as evenly as possible across the shared CPU resources. I recommend htop for a quick visual confirmation, otherwise check the output of report-bindings.
Also lscpu and lstopo to find out about NUMA topology and shared resources like L3 cache. Which cores reside on a shared IMC needs to be figured out the hard way as far as I know. Reading docs and such...

masb · June 1, 2022, 08:05

Quote:

Originally Posted by wkernkamp

This is still the same supermicro opteron server with the H8QG6-F Motherboard, 32x8Gb DDR3-1600 single rank, for 4x Opteron 6376.

I have found since the last post that the On Demand governor yields better results, because the opterons turbo higher when some cores are idling. Furthermore, I made some changes to the default openmpi process placement for np=2,12,24 and 48. The default tended to place processes together on adjacent integer cores. These cores share a single FPU, but also cache, so for cache this is good, but for openfoam it is not. (The difference is ~45% for the 2 core case.)

The baseline result before Overclock is:
1 2161.03
2 1045.07
4 506.82
8 249.7
12 193.92
16 145.46
24 110.93
32 93.86
48 87.21
64 85.53

After overclock using a motherboard base clock of 240 MHz instead of 200 MHz, the results are:
1 2112.27
2 1026.49
4 492.64
8 241.08
12 183.19
16 134.26
24 100.11
32 84.72
48 82.74
64 79.54

This overclock was accomplished with the OCNG5.3 BIOS. It is easy t do. Follow instructions here: https://hardforum.com/threads/ocng5-...forms.1836265/

The temperatures did not go high, so the board can still be clocked higher. The ram can also be overclocked. I will try 1866 MHz. In the past the execution time was about inversely proportional to RAM speed.

Hi Wkernkamp, thanks for the info. Would a machine with 4x AMD 16C 6282 also have a good performance as your system?

wkernkamp · June 2, 2022, 15:24

Quote:

Originally Posted by masb

Hi Wkernkamp, thanks for the info. Would a machine with 4x AMD 16C 6282 also have a good performance as your system?

Probably similar. The 6300 processors are an improvement over the 6200. With the cpus so cheap, I think you could try with yours and upgrade cpu if necessary.

Note that messing with the bios is risky. You might cause your machine to no longer boot! Performance without overclock is pretty decent due to the 16 available memory channels.

masb · June 3, 2022, 08:09

AMD Ryzen 4800H:

# cores Wall time (s):
------------------------
Meshing Times:
1 1003.94
2 707.64
4 500.12
6 396.02
8 364.08

Flow Calculation:
1 753.92
2 486.19
4 351.89
6 329.93
8 323.98

masb · June 3, 2022, 08:11

AMD Threadripper 1950X under WSL Ubuntu 20.04

# cores Wall time (s):
------------------------
Meshing Times:
1 1056.81
2 701.65
4 496.73
6 393.98
8 381.59
10 360.49
12 339.13
14 323.9
16 343.45

Flow Calculation:
1 822.07
2 498.66
4 350.45
6 326.8
8 324.14
10 319.38
12 314.45
14 315.73
16 324.57

Erdi · June 4, 2022, 07:07

OpenFOAM benchmark run on Laptop (Dell XPS 15) With i7-11800H and 2x8GB (3200MHZ) on WSL2 on Ubuntu 20.04 with openFOAMv9

Out of curiosity I wanted to try the benchmark on my laptop. First i tried with the default confiquration however it took a lot of time to run the one core version so I cnhanged the run.sh file and then run with 8 cores directly but it started to thermal throthle a lot so (what a suprise

) and then I tried to use 6 cores and got :

Code:

real    7m38.091s
user    45m16.834s
sys     0m11.617s
Run for 6...
# cores   Wall time (s):
------------------------
6 367.37

wkernkamp · June 4, 2022, 23:48

Quote:

Originally Posted by Erdi

OpenFOAM benchmark run on Laptop (Dell XPS 15) With i7-11800H and 2x8GB (3200MHZ) on WSL2 on Ubuntu 20.04 with openFOAMv9

6 367.37[/CODE]

Your performance is equal to my Dell r710 with dual E5649. That makes sense, because that server has six memory channels running at 1066 MT/s which is comparable to two at 3200 MT/s.

Quote:

Originally Posted by wkernkamp

Dell Poweredge R710
12x4Gb Rdimm 1067Mhz

2xE5649 2.53ghz 6 cores per cpu:
Flow Calculation:
1 1486.54
2 880.04
4 422.03
6 342.61
8 317.83
10 333.38
12 307.18

2xX5675 3.07ghz 6 cores per cpu
Flow Calculation:
1 1322.84
2 787.4
4 375.77
6 305.44
8 286.3
12 278.02

Your cpu must be thermal throttling otherwise you would get 305.44 sec like the 2x X5675 or better.

wkernkamp · June 4, 2022, 23:50

I don't know how the benchmark performs on WSL2. I have only run linux. So that might be another issue erdi.

masb · June 6, 2022, 04:31

Hi!

I was wondering if the run of 2 benchmarks simultaneously would be better than 1 run after another. So the results of the 2 runs were surprisingly:

# cores Wall time (s):
------------------------
1 2 4 6 8 10 12 14 16
Meshing Times:
1 1151.36
2 857.94
4 623.2
6 563.06
8 537.3
10 526.86
12 518.92
14 523.49
16 569
Flow Calculation:
1 1034.82
2 763.45
4 550.1
6 523.57
8 542.37
10 600.15
12 625.04
14 668.28
16 710.25

# cores Wall time (s):
------------------------
1 2 4 6 8 10 12 14 16
Meshing Times:
1 1126.39
2 861.46
4 622.28
6 558.93
8 539.72
10 527.49
12 518.96
14 521.65
16 564.03
Flow Calculation:
1 1032.88
2 762.35
4 548.58
6 526.27
8 559.72
10 606.09
12 633.89
14 682.39
16 683.36

2 x 1 runs separately took in the best case (12 cores) approximately 630 seconds (previous posts)

1 x 2 runs simultaneously took in the best case (6 cores) approximately 526 seconds

Concluding: 1 x 2 runs simultaneously was 20% faster.

Any comments?

Simbelmynë · June 6, 2022, 04:53

@masb, I do not understand your post. You have two recent posts, one of which is a 1950X with 16 cores that finish the benchmark in about 314 seconds. I do not see how this is slower than your latest test.

It should also be noted that 314 seconds is about twice as long time to finish the benchmark compared to my 1950X (specs available on the first page of this thread). WSL is not ideal, but if you do not access the file system through frequent saves then it should be fast enough. My guess is slow memory and/or timings.

Simbelmynë · June 6, 2022, 04:56

Quote:

Originally Posted by wkernkamp

Your performance is equal to my Dell r710 with dual E5649. That makes sense, because that server has six memory channels running at 1066 MT/s which is comparable to two at 3200 MT/s.

Your cpu must be thermal throttling otherwise you would get 305.44 sec like the 2x X5675 or better.

You cannot make comparisons like that. There is a huge difference between some systems with identical theoretical bandwidth.

masb · June 6, 2022, 06:42

Firstly I posted the benchmarks for 1950x and Ryzen 4800H just as an information. In the latest post I run two benchmarks simultaneously under WSL and 1950x. As I have to run lots of cases, I was just trying to analyze the performance for the both, sequentially and simultaneously. So, the run for two cases simultaneously using 6 cores was faster than the run for the same two cases running in 12 cores sequentially:

sequentially:

run1: 314.45 seconds
run2: 314.45 seconds

total: run1 +run2 = 629 seconds

simultaneously:

run1 || run2: 526.27 seconds

Is it clear now?

May 28, 2022, 17:39		#505
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	Phew, that finally worked. If you don't mind, I would like to add your script to the first post of this thread, or link to your post. Please let me know if you are ok with that. Anyway, here is my new toy. Well not actually mine, but I still got to play with it for a while. Hardware: 2x AMD Epyc 7543, Gigabyte MZ72-HB0, 16x64GB DDR4-3200 (RDIMM, 2Rx4) Bios settings: SMT disabled, workload tuning: HPC optimized, power settings: default, ACPI SRAT L3 cache as NUMA domain: enabled (results in 16 NUMA nodes) Software: OpenSUSE Leap 15.3 with backport kernel., OpenFOAM v2112 compiled via gcc 11.2.1, using march=znver3, OpenMPI 4.1.4, scaling governor: performance, cleared caches before each run using "echo 3 > /proc/sys/vm/drop_caches" Code: simpleFoam run times for 100 iterations: #threads \| runtime/s ==================== 01 \| 471.92 02 \| 227.14 04 \| 108.51 08 \| 52.11 16 \| 28.81 32 \| 18.11 48 \| 15.46 64 \| 13.81 Compared to the same OpenFOAM version from the OpenSUSE science repo, this runs a little faster. On 64 cores, that version takes around 14.9s. Also, using one NUMA node per CCX is still a little faster than the usual recommendation of NPS=4. But of course would have huge drawbacks for software that isn't NUMA agnostic. Tweaking bios settings can be tricky. I got consistently worse performance when tweaking the power settings more towards performance. There is probably still a little more to gain, but I'd rather not overdo it with bios settings on someone else's hardware. I should also note that some of the runs with intermediate thread counts needed some hand-holding. E.g. the threads for the 02 run got mapped to cores on the same memory controller with default settings. Running with "mpirun -np 2 --bind-to core --rank-by core --map-by socket" fixes that. bravebear, ErikAdr and Crowdion like this. Last edited by flotus1; May 29, 2022 at 07:19.

May 30, 2022, 02:24		#509
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	There are many ways to achieve the same result, most of them more elegant than what I did: https://www.open-mpi.org/doc/v4.1/man1/mpirun.1.php What I ended up using: Code: mpirun -np 2 --bind-to core --rank-by core --map-by socket mpirun -np 4 --bind-to core --rank-by core --cpu-list 0,16,32,48 mpirun -np 8 --bind-to core --rank-by core --cpu-list 0,8,16,24,32,40,48,56 mpirun -np 16 --bind-to core --rank-by core --map-by numa same from here on The goal with all of them being to spread the threads out as evenly as possible across the shared CPU resources. I recommend htop for a quick visual confirmation, otherwise check the output of report-bindings. Also lscpu and lstopo to find out about NUMA topology and shared resources like L3 cache. Which cores reside on a shared IMC needs to be figured out the hard way as far as I know. Reading docs and such... wkernkamp and Crowdion like this. Last edited by flotus1; May 30, 2022 at 03:29.

June 3, 2022, 08:09	AMD Ryzen 4800H under WSL Ubuntu 20.04	#512
masb Member Marco Bernardes Join Date: May 2009 Posts: 57 Rep Power: 16	AMD Ryzen 4800H: # cores Wall time (s): ------------------------ Meshing Times: 1 1003.94 2 707.64 4 500.12 6 396.02 8 364.08 Flow Calculation: 1 753.92 2 486.19 4 351.89 6 329.93 8 323.98

June 3, 2022, 08:11	AMD Threadripper 1950X under WSL Ubuntu 20.04	#513
masb Member Marco Bernardes Join Date: May 2009 Posts: 57 Rep Power: 16	AMD Threadripper 1950X under WSL Ubuntu 20.04 # cores Wall time (s): ------------------------ Meshing Times: 1 1056.81 2 701.65 4 496.73 6 393.98 8 381.59 10 360.49 12 339.13 14 323.9 16 343.45 Flow Calculation: 1 822.07 2 498.66 4 350.45 6 326.8 8 324.14 10 319.38 12 314.45 14 315.73 16 324.57

June 4, 2022, 07:07	Benchmark run on Laptop With i7-11800H and 2x8GB (3200MHZ) on WSL2 Ubuntu20.04	#514
Erdi New Member Erdi Join Date: Jun 2022 Posts: 2 Rep Power: 0	OpenFOAM benchmark run on Laptop (Dell XPS 15) With i7-11800H and 2x8GB (3200MHZ) on WSL2 on Ubuntu 20.04 with openFOAMv9 Out of curiosity I wanted to try the benchmark on my laptop. First i tried with the default confiquration however it took a lot of time to run the one core version so I cnhanged the run.sh file and then run with 8 cores directly but it started to thermal throthle a lot so (what a suprise ) and then I tried to use 6 cores and got : Code: real 7m38.091s user 45m16.834s sys 0m11.617s Run for 6... # cores Wall time (s): ------------------------ 6 367.37 Crowdion likes this.

May 27, 2022, 14:30		#501
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	Well, I had a new workstation to play around with. Unfortunately, I can't get the benchmark to run properly. I tried both compiling 2112 from source, as well as using the OpenFOAM 2112 installation from the OpenSUSE science repository. The solver runs, but the mesh is not created properly, leading to solver run times of ~16s on a single core. I used the bench_template_v02.zip provided by Simbelmynë The problems are the same. Here are the mesh logs from the single-core directory: blockMesh.txt decomposePar.txt snappyHexMesh.txt surfaceFeatures.txt Maybe one of you can point me in the right direction.

May 28, 2022, 00:26		#502
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 316 Rep Power: 12	I have run OF v2112. My MeshQualityDict in system has this includeEtc: #includeEtc "caseDicts/meshQualityDict" You seem to have the one that calls out caseDicts/mesh/generation/meshQualityDict. As is shown in the snappyHexMesh.txt file. That may be for OpenFOAM v9. (Not sure). If this doesn't solve it, I will upload my entire basecase directory. Just let me know.

May 28, 2022, 04:39		#503
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	Thanks, I changed that line in meshQualityDict. Unfortunately, that didn't do the trick. If you could provide me with a basecase and run script known to work with 2112, that would be great.

May 28, 2022, 19:29		#506
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 316 Rep Power: 12	Go right ahead posting my modification of the original script. Before you do, you might include the mpirun you used for certain cases. I have been doing similar things as you can see from the number of mpirun versions that were commented out. I have a version somewhere that splits it out based on number of cores. The strategy to set run parameters will be different for each cpu. It is still nice to have it in the script so that people can develop their plan without having to reinvent the wheel. Nice job evaluating the borrowed machine. I also found that bios tweaking does not do much, except the memory has to be set for performance (obviously). I also don't bother setting the fans to maximum. Some servers are very noisy that way. Plus, the fans will spin as needed.

May 29, 2022, 07:16		#507
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	Well, the precise mpirun commands for consistent results vary with the number of threads. Someone else might be able to find a single command that works for all thread counts, but then there are still the variables of hardware, NUMA topology and MPI libraries. I don't think there is a "one size fits all" solution here. I could try to go more into detail about what to look for, but it would end up being a rather lengthy post titled "how to benchmark correctly". Which -as pedants in the field may argue- we are all doing wrong anyway by leaving turbo boost enabled for such a short benchmark Maybe another day.

May 29, 2022, 17:53		#508
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 316 Rep Power: 12	Agreed, but what I meant was to leave the default as is, but add your special case commented out with a short description explaining the specific use.

June 4, 2022, 23:50	Wsl2	#516
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 316 Rep Power: 12	I don't know how the benchmark performs on WSL2. I have only run linux. So that might be another issue erdi.

June 6, 2022, 04:31	Two is better than one	#517
masb Member Marco Bernardes Join Date: May 2009 Posts: 57 Rep Power: 16	Hi! I was wondering if the run of 2 benchmarks simultaneously would be better than 1 run after another. So the results of the 2 runs were surprisingly: # cores Wall time (s): ------------------------ 1 2 4 6 8 10 12 14 16 Meshing Times: 1 1151.36 2 857.94 4 623.2 6 563.06 8 537.3 10 526.86 12 518.92 14 523.49 16 569 Flow Calculation: 1 1034.82 2 763.45 4 550.1 6 523.57 8 542.37 10 600.15 12 625.04 14 668.28 16 710.25 # cores Wall time (s): ------------------------ 1 2 4 6 8 10 12 14 16 Meshing Times: 1 1126.39 2 861.46 4 622.28 6 558.93 8 539.72 10 527.49 12 518.96 14 521.65 16 564.03 Flow Calculation: 1 1032.88 2 762.35 4 548.58 6 526.27 8 559.72 10 606.09 12 633.89 14 682.39 16 683.36 2 x 1 runs separately took in the best case (12 cores) approximately 630 seconds (previous posts) 1 x 2 runs simultaneously took in the best case (6 cores) approximately 526 seconds Concluding: 1 x 2 runs simultaneously was 20% faster. Any comments?

June 6, 2022, 04:53		#518
Simbelmynë Senior Member Join Date: May 2012 Posts: 546 Rep Power: 15	@masb, I do not understand your post. You have two recent posts, one of which is a 1950X with 16 cores that finish the benchmark in about 314 seconds. I do not see how this is slower than your latest test. It should also be noted that 314 seconds is about twice as long time to finish the benchmark compared to my 1950X (specs available on the first page of this thread). WSL is not ideal, but if you do not access the file system through frequent saves then it should be fast enough. My guess is slow memory and/or timings.

June 6, 2022, 06:42	Sorry for the confusing posts.	#520
masb Member Marco Bernardes Join Date: May 2009 Posts: 57 Rep Power: 16	Firstly I posted the benchmarks for 1950x and Ryzen 4800H just as an information. In the latest post I run two benchmarks simultaneously under WSL and 1950x. As I have to run lots of cases, I was just trying to analyze the performance for the both, sequentially and simultaneously. So, the run for two cases simultaneously using 6 cores was faster than the run for the same two cases running in 12 cores sequentially: sequentially: run1: 314.45 seconds run2: 314.45 seconds total: run1 +run2 = 629 seconds simultaneously: run1 \|\| run2: 526.27 seconds Is it clear now? Last edited by masb; June 6, 2022 at 10:40.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology	wyldckat	OpenFOAM	17	November 10, 2017 15:54
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days	joegi.geo	OpenFOAM Announcements from Other Sources	0	October 1, 2016 19:20
OpenFOAM Training Beijing 22-26 Aug 2016	cfd.direct	OpenFOAM Announcements from Other Sources	0	May 3, 2016 04:57
New OpenFOAM Forum Structure	jola	OpenFOAM	2	October 19, 2011 06:55
Hardware for OpenFOAM LES	LijieNPIC	Hardware	0	November 8, 2010 09:54