CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   Hardware (https://www.cfd-online.com/Forums/hardware/)
-   -   OpenFOAM benchmarks on various hardware (https://www.cfd-online.com/Forums/hardware/198378-openfoam-benchmarks-various-hardware.html)

eric February 5, 2018 06:10

OpenFOAM benchmarks on various hardware
 
5 Attachment(s)
** Update 2: I have created a page on the OpenFOAM wiki: https://openfoamwiki.net/index.php/Benchmarks . The updated plot will now be found there as I will eventually not be able to edit this post. But please continue to contribute further benchmarks in this thread! **

** Update: I have now added a plot with minimum time to solution for all hardware posted in this thread! I will try to keep this updated as more results are posted. Thank you for all the contributions! :) **

Hi,

I promised in another thread here to run some OpenFOAM benchmarks on different Intel hardware that I have available, so here they are. These are based on the motorBike benchmark, but I modified it to have more grid cells, run fewer iterations and to use scotch decomposition. You can find the full setup in the attached tar.gz-file. If you want to test on your hardware, you only need to run the run.sh script (you only need to change the number of cores in the three for loops if you want to run on a different number of cores). It would be interesting if more people could contribute to generate a modest database of benchmarks here.

The below table shows runtime in seconds. There is also a graph which shows the speedup.

Some observations, most relatively obvious :) :
  • There is an obvious correlation between speedup and number of memory controllers available. The old octocore E7 machine is very slow on single-core but shows great speedup. The other machines show modest speedup past ~2x the number of memory controllers.
  • A fast CPU helps for single-core simulations. The two processors with 3.7 GHz turbo frequency are the fastest here.
  • If you are buying new hardware, the Gold 6148 does not scale at all past ~16 cores so the 6130 or 6142 seem like better choices. Of course, this assumes you only have Intel available, if not AMD Epyc seems like a better choice based on the other threads in this forum.
Code:

#  Gold 6148  8x E7-8870  2x E5-2695 v2  2x E5-2643 v3  2x E5-2695 v4
1      874      2132        1451            883            1084
2      435      1124        597            468            578
4      225        476        281            215            273
6      164        297        205            153            189
8      136        203        178            126            146
12      111        148        150            101            104
16      101        104        140                              85
20      98        92        137                              76
24                  77        137                              71
36                  64                                          65

https://www.cfd-online.com/Forums/at...1&d=1517828325

https://www.cfd-online.com/Forums/at...1&d=1518557547

---MODERATOR NOTE---
The original bench template requires some tweaks in order to work with more recent versions of OpenFOAM. Try using bench_template_v02 for the openfoam.org versions instead, courtesy of Simbelmynė
For the openfoam.com versions (e.g. v2112) this script should work out of the box: https://www.cfd-online.com/Forums/ha...tml#post828825
Attachment 88397

Newer performance charts provided by naffrancois with much more entries
Maximum performance: https://ibb.co/MsQh94V
Single-core performance: https://ibb.co/GVnbYP5
MS Excel file with the numbers: Attachment 92318

flotus1 February 7, 2018 02:29

Edit: now with modified controldict to get proper results.
mpirun thoroughly disliked my attempts to pin it to certain cores, resulting in abysmal performance for most cases. So these results are just with plain mpirun -np N

2x AMD Epyc 7301, 16x16GB 2Rx4 DDR4-2133 reg ECC, of_v1712, Opensuse Tumbleweed, kernel 4.14.14-1
Code:

# cores  Wall time (s)  speedup:
------------------------------------------------
01        1016.6          1.0
02        480.5          2.1
04        231.9          4.4
08        125.4          8.1
12          79.9          12.7
16          66.4          15.3
20          60.5          16.8
24          52.0          19.6
28          49.1          20.7
32          42.6          23.9


eric February 7, 2018 05:25

Thanks for running this, flotus. The error happens at the very end of the simulation so it shouldn't affect the timings by much. If you still want to fix it, see below. Impressive performance, it scales a lot better than the Intel machines. Would be nice to also have a dual socket Gold machine to compare against.

The error is happening when trying to calculate streamlines at the end of the simulation. Looks like this is due to version differences, I see you are using the v1712 version while I use the 5.x version. The easiest way to fix this is to disable the streamline calculation. Just open the file basecase/system/controlDict and remove the lines
Code:

#include streamlines
#include wallBoundedStreamlines

You should also delete all the run_* folders before rerunning the run.sh script.

JBeilke February 7, 2018 05:34

Quote:

Originally Posted by flotus1 (Post 680714)
Bummer...
That benchmark script doesn't seem to run properly on my machine (dual AMD Epyc 7301)
I already get a "core dumped" message during the first serial run of simplefoam, then it halts while executing on 16 cores. I aborted this run with ctrl+c, the rest of the cases then finished, but somehow not all gave valid timing results

I attached the log files and shell output here, maybe you can tell me what went wrong.
Attachment 61229

Extrapolated from the 99th iteration in the log file you get :

Code:

1    1041.62
2    595
4    257
8    130
12    85
16    62
24    55
36    44

This means superlinear speedup with 16 cores (105%) and 74% with 32 cores. Not bad.

Only the single core performance is a bit low :-(

JBeilke February 7, 2018 05:47

Code:


#    i7-2600  i7-3960X    E5 1650 V3
 
1    1085      794        824
2    727      433        440
4              253        258
6              212        214

This is a bit strange since the E5 is normally about 10% faster than the 3960X. The first two configurations (2600 , 3960X) are run in a Virtual machine (Linux on top of Linux).

It is very interesting to see that the 3960X is the fastest processor for 1 or 2 core calculations.

flotus1 February 7, 2018 11:20

Thanks for your input, I will run the case again tonight and edit the results, maybe throw in some better core-binding options. Mpirun or Linux are not fully aware which cores form a NUMA-node.
Until then, results from the machine I used to test your suggestion: single Xeon W3670 (6 cores) with triple-channel DDR3-1333, of_v1712, Opensuse Leap 42.2, kernel 4.4.104-39:
Code:

# cores  Wall time (s):
------------------------
1        1262.5
2          849.8
4          649.6
6          622.7

I think that adding some system and software version information might be a good idea when submitting and comparing benchmark results.
The single-core result for Epyc was to be expected, it only uses a single-core turbo of 2.7GHz. I already stated this in my initial review, AMD missed the spot for medium core count CPUs with higher clock speeds. A 16-core variant with >=3.5GHz or at least higher single-core turbo would have been no problem from a TDP perspective. Forcing you to buy the most expensive SKU with lots of useless cores to get at least 3.2GHz single core is what Intel would do :rolleyes:

Edit: AMD Epyc results now edited in the second post.
Since there were no results for Xeon E5 "v1" yet: Dual Xeon E5-2687W, 16x8GB DDR3-1600, of_v1712, Opensuse leap 42.3, kernel 4.4.103-36
Code:

# cores  Wall time (s):
------------------------
01        898.8
02        502.1
04        235.1
06        169.7
08        141.6
10        128.4
12        119.3
14        116.3
16        112.6


Simbelmynė February 9, 2018 16:04

@eric

While the speedup of added cores is interesting, I also think that speedup vs other hardware is of great interest. Since this is present in this thread, perhaps you could also compile and maintain a plot in the first post (if the thread continues to grow that is)? I guess the metric is the lowest possible solution time on a given hardware. Possibly normalized against some system of choice.

I'll join in with 1950X, 7940X and 8700k soon, so you get some comparison for lower budget systems ;)

flotus1 February 9, 2018 18:16

The problem with that is that you can no longer edit older posts after a few weeks. Maintaining a thread like this becomes impossible. This restriction kept me from starting one or two related threads in the past.

Simbelmynė February 10, 2018 02:22

That's strange. A thread like this has definitely the possibility to be "sticky".

Oh, well, browsing to the last post only requires one extra mouse click :rolleyes:


7940X, 32 (4x8) GB 3200 MHz RAM, CentOS 7.x, kernel 3.10.0
Code:

# cores  Wall time (s):
------------------------
1 764.36
2 419.98
4 233.26
6 188.29
8 169
12 160.28
14 168.73

Threadripper 1950X, 32 (4x8) GB 3200 MHz RAM, CentOS 7.x, kernel 4.14.5 (SMT on)
Code:

# cores  Wall time (s):
------------------------
1 827.21
2 465.01
4 235.17
6 198.81
8 170.73
12 154.26
16 154.9

8700K, 32 (4x8) GB 3200 MHz RAM, Mint 18.3, kernel 4.13.0
Code:

# cores  Wall time (s):
------------------------
1 531.44
2 312.15
4 249.55
6 247.83

It is also interesting to analyze the meshing time.

For the 8700K system we have:
Code:

# cores  real time:
------------------------
1            16m35s
2            10m56s
4            07m01s
6            05m30s

While the 1950X performs as:
Code:

# cores  real time:
------------------------
1            23m32s
2            16m01s
4            08m44s
6            06m50s
8            05m48s
12          04m38s
16          04m12s

It seems that the meshing part is not as memory bound as the CFD solver.

The_Sle February 10, 2018 05:22

7820X@4,6Ghz, 4x8 GB 3400MHz RAM, Ubuntu 17.10, kernel 4.13.0-32

Code:

# cores  Wall time (s):  Speedup:
----------------------------------
1          756.42          1
2          376.09          2,0
4          205.46          3,7
6          168.24          4,5
8          160.05          4,7

Could it be that past 6 cores 4 channel memory is causing a bottleneck?

Code:

#Cores  Mesh time
-----------------
1      19m37s  1177s
2      13m3s  783s
4      7m23s  443s
6      5m30s  330s
8      5m8s    308s


wyldckat February 10, 2018 05:26

Greetings to all!

Quote:

Originally Posted by flotus1 (Post 681052)
The problem with that is that you can no longer edit older posts after a few weeks. Maintaining a thread like this becomes impossible. This restriction kept me from starting one or two related threads in the past.

edit: I forgot to remind people that there is a limit for forum members to edit their posts for 30 days; after that, the posts can only be edited by moderators.

Quote:

Originally Posted by Simbelmynė (Post 681064)
That's strange. A thread like this has definitely the possibility to be "sticky".

There are a few choices to solve this:
  1. The thread can be stickied if people ask for it from moderators (use the report button on the first post if you're feeling lazy in sending a PM to a moderator ;)).
  2. Blog posts can be edited forever by the original author, although there is a limit of 5000 characters, if I remember correctly, so you can post a link to it on the first post (I or any moderator can do that for you if you want).
  3. There is also the CFD-Online wiki: https://www.cfd-online.com/Wiki/Main_Page - this could be added as its own FAQ page.
  4. And in this specific case, since OpenFOAM is being used for benchmarking, it can be documented at openfoamwiki.net: https://openfoamwiki.net/index.php/Benchmarks
And many thanks for kicking off this thread with very valuable information!
Let me know if you want this thread stickied and/or want me to start a wiki page for this!

Best regards,
Bruno

Simbelmynė February 10, 2018 05:28

Quote:

Originally Posted by The_Sle (Post 681073)
7820X@4,6Ghz, 4x8 GB 3400MHz RAM, Ubuntu 17.10, kernel 4.13.0-32

Code:

# cores  Wall time (s):  Speedup:
----------------------------------
1          756.42          1
2          376.09          2,0
4          205.46          3,7
6          168.24          4,5
8          160.05          4,7

Could it be that past 6 cores 4 channel memory is causing a bottleneck?


That is really interesting. It seems that the 7940X is a terrible price/performance option compared to the 7820X (this was perhaps known, but not that the 7820X is actually as fast as the 7940X regardless of the number of cores being used). Your system is overclocked on all cores?

Perhaps you have some other processes running that interfere with the simulation so some extent?

Finally I do not understand why your system is so slow on 1 core, compared to the 8700K, which runs @4.7 GHz on one core (and slower memory). They should be quite similar.

The_Sle February 10, 2018 08:20

Quote:

Originally Posted by Simbelmynė (Post 681076)
That is really interesting. It seems that the 7940X is a terrible price/performance option compared to the 7820X (this was perhaps known, but not that the 7820X is actually as fast as the 7940X regardless of the number of cores being used). Your system is overclocked on all cores?

Perhaps you have some other processes running that interfere with the simulation so some extent?

Finally I do not understand why your system is so slow on 1 core, compared to the 8700K, which runs @4.7 GHz on one core (and slower memory). They should be quite similar.

I reran the tests 3 times, the best 1 core result was 736 seconds. Parallel results didn't show as much variance, only few seconds both ways here and there.

Yes it's running 4,6 GHz on all cores. I checked it with turbostat during runs, thermals are OK as well. The newer gen 8700K just is that much faster in single thread workloads I suppose. That 8700K is really impressive actually, and difference between X299 and TR is surprisingly small! :)

eric February 13, 2018 13:10

Thank you for all the contributions! I have made a new plot summarizing all the results, and asked Bruno to sticky the post so that I can keep updating it.

Interesting to see the performance of the "enthusiast" i7 and Threadripper processors, looks like good choices for workstations for testing/developing and pre/post-processing.

flotus1 February 13, 2018 13:28

Now I feel kind of sorry for adding Xeon W3670 and messing up the scaling in the diagram :rolleyes:
But seriously, I think the inverse (iterations per second) would be a better metric to compare in a diagram. Otherwise the huge differences in performance at the top end become indistinguishable.
On a side note: It would be helpful if new contributions gave more information about the actual setup. Software versions, memory configuration...but more importantly: clock speeds for over-clockable processors.

eric February 13, 2018 16:36

Quote:

Originally Posted by flotus1 (Post 681367)
Now I feel kind of sorry for adding Xeon W3670 and messing up the scaling in the diagram :rolleyes:
But seriously, I think the inverse (iterations per second) would be a better metric to compare in a diagram. Otherwise the huge differences in performance at the top end become indistinguishable.
On a side note: It would be helpful if new contributions gave more information about the actual setup. Software versions, memory configuration...but more importantly: clock speeds for over-clockable processors.

I agree, I have updated the plot now :)

havref February 14, 2018 02:34

Thank you for starting this thread. I got my hands on a couple of Epyc 7601 processors this week, so figured I'd do the same tests on it for comparison. Will post results with a dual Epyc 7351 when our server arrives in a couple of weeks and a 2 x dual Epyc 7351 when I've had the time to set them up with infiniband.

2x Epyc 7601, 16x 8GB DDR4 2666MHz, 1TB SSD, running OpenFOAM 5.0 on Ubuntu 16.04.
Code:

# Cores        Wall time [s]        Speedup
------------------------------------------------------------               
 1        971.64                  1
 2        577.18                  1.7
 4        234.01                  4.2
 6        169.8                  5.7
 8        132.41                  7.3
12        81.52                11.9
16        59.65                16.3
20        62.56                15.5
24        54.39                17.9
28        45.92                21.2
32          43.42                22.4
36        42.83                22.7
48        40.5                24.0
64        35                27.8

I removed streamlines and wallBoundedStreamlines from the controlDict. The rest of the case is identical to yours. Let me know if you want me to fill in the gaps between 36 and 64 cores.

eric February 14, 2018 14:21

Nice, havref. Looking forward to seeing the 7351 results as well.

It's worth noting that at 64 cores there is only ~30 000 cells per core so communication may start to become a bottleneck.

chad February 16, 2018 15:26

2x Intel Gold 5118, 12x 8GB DDR4 2400 MHz, M2 SSD, OpenFOAM 4.1, Ubuntu 17.10 Kernel 4.13.0-32

# cores Wall time (s):
------------------------
1: 1083.38
2: 558.41
4: 254.74
8: 131.22
16: 80.48
20: 73.1
24: 79.35

While still a novice when it comes to CFD, these results did surprise me as a bit slow. If anyone thinks I may have missed something, let me know and I'll gladly re-run these.

flotus1 February 18, 2018 04:26

You could try to run a newer version of OpenFOAM. And since it is mostly the parallel performance >16cores which seems a bit low you could check if RAM came configured properly. Some of the Skylake-SP dual-socket motherboards have more than 12 DIMM-slots, populating memory correctly is crucial here.


All times are GMT -4. The time now is 22:11.