CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

OpenFOAM benchmarks on various hardware

Register Blogs Community New Posts Updated Threads Search

Like Tree495Likes

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   June 6, 2022, 22:59
Default
  #521
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 339
Rep Power: 12
wkernkamp is on a distinguished road
Quote:
Originally Posted by Simbelmynë View Post
You cannot make comparisons like that. There is a huge difference between some systems with identical theoretical bandwidth.

Yes, but only when there is something wrong in the setup so that the possible bandwidth is not achieved. Otherwise the bandwidth is a key factor that translates directly into OpenFOAM performance.


What I was saying is that his performance is in the ball park correct, except that considering the more modern cpu and higher clock, I would expect a bit better. Maybe it is thermal throttling, maybe WSL2. Maybe his cpu was having a slow day. I don't know.
wkernkamp is offline   Reply With Quote

Old   June 7, 2022, 01:23
Default AMD Threadripper 1950X Ubuntu 20.04, no WSL
  #522
Member
 
Marco Bernardes
Join Date: May 2009
Posts: 57
Rep Power: 17
masb is on a distinguished road
Quote:
Originally Posted by masb View Post
AMD Threadripper 1950X under WSL Ubuntu 20.04

# cores Wall time (s):
------------------------
Meshing Times:
1 1056.81
2 701.65
4 496.73
6 393.98
8 381.59
10 360.49
12 339.13
14 323.9
16 343.45

Flow Calculation:
1 822.07
2 498.66
4 350.45
6 326.8
8 324.14
10 319.38
12 314.45
14 315.73
16 324.57
Ubuntu 20.04, no WSL:

# cores Wall time (s):
------------------------
1 2 4 6 8 10 12 14 16
Meshing Times:
1 1026.86
2 697.82
4 397
6 294.65
8 251.36
10 231.26
12 210.35
14 201.72
16 207.07
Flow Calculation:
1 852.77
2 510.34
4 220.9
6 181.68
8 160.85
10 153.79
12 144.88
14 145.64
16 143.53

wkernkamp and Crowdion like this.
masb is offline   Reply With Quote

Old   June 7, 2022, 03:04
Default
  #523
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,400
Rep Power: 47
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
That's A LOT of performance left on the table with WSL. I wonder if it can be tweaked in any way to yield better results, or if that's just price for convenience.
masb likes this.
flotus1 is offline   Reply With Quote

Old   June 7, 2022, 20:08
Default
  #524
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 339
Rep Power: 12
wkernkamp is on a distinguished road
Seems that WSL is OK for 1 or 2 cores, but looses performance as you go beyond that. Is there some limitation on the amount of resource that gets allocated to WSL (looks like 50% in your case masb)
masb likes this.
wkernkamp is offline   Reply With Quote

Old   June 9, 2022, 17:52
Default
  #525
Senior Member
 
Simbelmynë's Avatar
 
Join Date: May 2012
Posts: 548
Rep Power: 15
Simbelmynë is on a distinguished road
Quote:
Originally Posted by wkernkamp View Post
Yes, but only when there is something wrong in the setup so that the possible bandwidth is not achieved. Otherwise the bandwidth is a key factor that translates directly into OpenFOAM performance.

And I am saying this is not true.


As a general indicator, bandwidth is by far the most important metric for CFD.


However, recent CPUs from AMD (and possibly Intel) has shown that bandwidth is not the entire story.


Check out results from 5800X3D for instance. It is really good in terms of performance per bandwidth.


It started to be visible with Zen 2, most likely since Intel just produced minor upgrades to new desktop CPUs for several years.

Last edited by Simbelmynë; June 10, 2022 at 01:39.
Simbelmynë is offline   Reply With Quote

Old   June 12, 2022, 20:03
Default
  #526
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 339
Rep Power: 12
wkernkamp is on a distinguished road
Quote:
Originally Posted by Simbelmynë View Post
5800X3D, 2 x 8 GB DDR4 Rank1 @ 3200 MT/s (14-14-14-14-28,1T)
OFv9, OpenSUSE Tumbleweed, GCC 11.2, kernel 5.17.4

Code:
 cores       Simulation     Meshing
#                (s)      (min.sec)
1             314.21        12m23s
2             201.98        8m21s
4             149.98        5m05s
6             138.55        4m02s
Will update if I manage to push the memory and IF to 1800 MHz.

EDIT:
2 x 8 GB DDR4 Rank1 @3800 MT/s (16-16-16-16-32, 1T)

Code:
cores    Simulation         Meshing
#           (s)             (min.sec)
1            304              12m14
2            188              8m12
4            135              4m58
6            124              3m55
8            122              3m28

The 5800 itself gets almost proportionally better with bandwidth.

Quote:
Originally Posted by wkernkamp View Post
2xE5-2697 v2 16x 8GB DDR-1866 MHz OF v2112

Flow:
20 85.73
22 84.44
24 84.02

The memory bandwith of the 2xE5-2687v2 is just under twice the bandwith of your 5800X. The performance ratio is 122/84=1.4 so there has been improvement probably related to cache organization and cache capacity. The more cache can be utilized, the more your effective bandwidth goes up. So the improvement you are talking about is 40% in ten years.
wkernkamp is offline   Reply With Quote

Old   June 13, 2022, 01:48
Default
  #527
Senior Member
 
Simbelmynë's Avatar
 
Join Date: May 2012
Posts: 548
Rep Power: 15
Simbelmynë is on a distinguished road
Quote:
Originally Posted by wkernkamp View Post
The memory bandwith of the 2xE5-2687v2 is just under twice the bandwith of your 5800X. The performance ratio is 122/84=1.4 so there has been improvement probably related to cache organization and cache capacity. The more cache can be utilized, the more your effective bandwidth goes up. So the improvement you are talking about is 40% in ten years.

I think more recent CPUs should be compared as well.




Quote:
Originally Posted by Simbelmynë View Post
5800X3D, 2 x 8 GB DDR4 Rank1 @ 3200 MT/s (14-14-14-14-28,1T)
OFv9, OpenSUSE Tumbleweed, GCC 11.2, kernel 5.17.4

Code:
 cores       Simulation     Meshing
#                (s)      (min.sec)
1             314.21        12m23s
2             201.98        8m21s
4             149.98        5m05s
6             138.55        4m02s

Here are some CPUs from 2017. All of them have Rank 2 memory (compared to rank 1 of the 5800X3D). If we look at the 3200 MT/s results then the first two HEDT CPUs have double theoretical bandwidth and the 8700k has identical theoretical bandwidth.





Quote:
Originally Posted by Simbelmynë View Post

7940X, 32 (4x8) GB 3200 MHz RAM, CentOS 7.x, kernel 3.10.0
Code:
# cores   Wall time (s):
------------------------
1 764.36
2 419.98
4 233.26
6 188.29
8 169
12 160.28
14 168.73
Threadripper 1950X, 32 (4x8) GB 3200 MHz RAM, CentOS 7.x, kernel 4.14.5 (SMT on)
Code:
# cores   Wall time (s):
------------------------
1 827.21
2 465.01
4 235.17
6 198.81
8 170.73
12 154.26
16 154.9
8700K, 32 (4x8) GB 3200 MHz RAM, Mint 18.3, kernel 4.13.0
Code:
# cores   Wall time (s):
------------------------
1 531.44
2 312.15
4 249.55
 6 247.83


Clearly there is a huge improvement where bandwidth is not the only answer. Memory latency and cache size likely plays an important role as well.


If you wish to compare HEDT with HEDT then look at the results from the 3990X. This also gives an indication of how good the architecture is even if it is one gen older compared to the 5800X3D.



Quote:
Originally Posted by Geon-Hong View Post
My testing environment is as follows.

- CPU: AMD Ryzen Threadripper 3990x
- RAM: 128GB (32GB x 4 / DDR4 / 2,666MHz)
- M/B : TRX40 (Gigabyte TRX40 AORUS Pro Wifi)
- SSD: SAMSUNG 1TB M.2
- OF : OpenFOAM-v2006
(function objects for generating stream lines were deactivated)

And the results are here:

Code:
# cores   Wall time (s):
------------------------
1      620.9
2      355.72
4      177.92
8      110.08
16     66
24     66.08
32     62.7
40     63.79
48     63.18
56     63.97
64     63.11
As you can see, the parallel performance was saturated around 16 cores.

Many thanks.

With similar architecture and a huge cache then bandwidth is king.
Simbelmynë is offline   Reply With Quote

Old   June 14, 2022, 03:35
Default
  #528
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 339
Rep Power: 12
wkernkamp is on a distinguished road
I think Geon-Hong misstated his configuration. He must have 8 channels active. There is a comparable threadripper 3960x in the data. It's single core performance is better than Geon-Hong's, but he is bandwidth limited at 93 seconds. That one has four channels:


Quote:
Originally Posted by spwater View Post
Here is my result. Newlt configured workstation with Threadripper 3960x, 3.8 GHz 24C, 64 G memory (4 channel)

# cores Wall time (s):
------------------------
1 550.49
2 299.15
4 161.65
6 120.55
8 101.56
12 99.13
16 93.74
20 93.71
24 93.65
wkernkamp is offline   Reply With Quote

Old   June 14, 2022, 05:05
Default
  #529
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 339
Rep Power: 12
wkernkamp is on a distinguished road
I was wrong about the 3990x: it has only 4 memory channels.


8x1866 = 14928 MT/s for E5-2697v2(x2)
4x3200 = 12800 MT/s for 3960x
2x3200 = 6400 MT/s for 5800X3D
4x2666 = 10666 MT/s for 3990x

MT to complete benchmark:
Code:
CPU        DIMM  CH     MT/s    Benchm.     MT       
E5-2697v2  1866   8    14928  x   84s  =  1253952
3960x      3200   4    12800  x   93s  =  1190400
5800X3D    3200   2     6400  x  139s  =   889600
3990x      2666   4    10666  x   63s  =   671832

E5-2697v2 = 1.41 x more MT to complete than 5800X3D
3960x = 1.33 x more MT to complete than 5800X3D 3990x = 1.32 x fewer MT to complete than 5800X3D


Level 3 Caches are:
Code:


CPU        Cache  Cores   Cache per   Work per
                     at Sat. Core at Sat. Core at Sat.

E5-2697v2   60 MB    24      2.5 MB      4.1%
3960x      128 MB    16      8   MB      6.2%
5800X3D     96 MB     6     16   MB     16.7%
3990x      256 MB    32      8   MB      3.1%


Last edited by wkernkamp; June 20, 2022 at 12:34. Reason: Added x2 for dual E5-2697v2
wkernkamp is offline   Reply With Quote

Old   June 14, 2022, 11:11
Default
  #530
Senior Member
 
Simbelmynë's Avatar
 
Join Date: May 2012
Posts: 548
Rep Power: 15
Simbelmynë is on a distinguished road
@wkernkamp


I like the idea of total MT to run the benchmark. Even if we have no idea what the actual bandwidth usage was during the simulation, this at least gives a relation that is based on theoretical bandwidth as well as actual simulation time. It also illustrates the, sometimes subtle, differences between different architectures.


I was surprised by the large difference between the 3960X and 3990X, they both have the same L3 per core and the same architecture. I would have guessed that the 3960X is faster due to the faster memory, but there may be other factors also in play here. My guess is on RAM timings and perhaps also on rank as well as on the Linux kernel being used.
wkernkamp likes this.
Simbelmynë is offline   Reply With Quote

Old   June 14, 2022, 11:28
Default
  #531
Member
 
Kailee
Join Date: Dec 2019
Posts: 35
Rep Power: 6
Kailee71 is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
That's A LOT of performance left on the table with WSL. I wonder if it can be tweaked in any way to yield better results, or if that's just price for convenience.
I did some rough comparisons with respect to VMs/containers. LXC was the clear winner (using proxmox) which cost only a couple of % in performance when compared to bare-metal. Next was VMWare which did a surprisingly good job, and was very nicely tweakable through the GUI. Performance almost on par with Proxmox/LXC, with a loss of 5-7%. Behind that came TrueNAS scale (KVM) but this really suffered from the NFS implementation (ganesha performance really sucks at the moment, but I understand why Scale uses it). The pricetag was somewhere around 15% if I remember correctly.

Way behind (not just on a different field, but in a different park) came WSL. Admittedly, this was about a year ago and I understand stuff probably has moved along, but it was clearly not a viable alternative unless you're just interested in tinkering.

Out of my 60 cores total, 20 live on my VMWare (data-)server which runs TrueNAS Core (4 cores) for the data, and a compute VM with 16 cores, 32 cores on a dedicated 4-socket bare-metal compute node, and a further 8 in my workstation. This is a compromise that works surprisingly well in a 10Gb environment.

Sorry for the anecdotal-only data. I'll try to find actual numbers.

Kai.
flotus1 and wkernkamp like this.
Kailee71 is offline   Reply With Quote

Old   June 14, 2022, 22:10
Default for dual E5 2683 v4
  #532
New Member
 
Alexander Kazantcev
Join Date: Sep 2019
Posts: 23
Rep Power: 6
AlexKaz is on a distinguished road
Quote:
Originally Posted by AlexKaz View Post
Dual e5 2683v4, JGINYUE X99-D8 Server from Aliexpress, DDR4 RDIMM 2133 8x8 default timings
v1806, Linux Mint 19.3

HT on, NUMA off, CoD off
Code:
cores    speedup mesh     speedup flow     mesh sec.    flow sec    power
1     1         1          1649.57    1256.06  94.77
2     1.48    1.782    1117.49     705.03    97.73
4     2.78    4.034     593.14      311.35    111.14
6     3.63    5.960     454.04      210.75    122.62
8     4.42    7.524     372.84      166.95    129.69
12    5.31    9.708     310.83     129.38    147.65
16    6.07    11.23     271.89     111.89    161.83
20    6.66    11.98     247.87     104.88    175.18
24    7.76    12.52     212.62     100.29    186.94
28    7.96    12.62     207.12     99.53      198.46
30    7.19    12.55     229.57     100.07    203.85
HT off, NUMA on, CoD on
Code:
cores    speedup mesh     speedup flow     mesh sec.        flow sec
1    1            1         1649.57   1256.06
2                
4                
6                
8                
12                
16    6.47      14.09    254.92    89.17
20    7.15      15.40    230.72    81.56
24    8.41      16.19    196.11    77.57
28    8.55      16.62    193.05    75.59
30    7.69      15.67    214.58    80.18
Quote:
Originally Posted by AlexKaz View Post
After reset BIOS to default settings, ht on, numa on, cod off, timings 12-11-11-24...

Code:
cores    speedup speedup flow  mesh sec.    flow sec    power
1.00    0.88    0.81    1455.79    1017.61    88.44
2.00                    
4.00                    
6.00                    
8.00                    
12.00                    
16.00    5.79    11.37    251.33    89.52    166.46
20.00    6.40    12.40    227.45    82.08    179.20
24.00    7.51    13.04    193.94    78.06    191.52
28.00    7.70    13.21    189.06    77.03    204.56
30.00    6.99    13.17    208.26    77.26    208.33


After some optimizations, dual 2683v4 run 32-threads solution with 67-68 seconds. HT on, Numa on, COD on, 2133 2 rank 8 dimms, foam v1812 (for v2112 ~ the same). I think, mainly reason in Numa on and the most early microcode for CPUID 406F1 0x0B00000B.
wkernkamp likes this.

Last edited by AlexKaz; June 17, 2022 at 08:31.
AlexKaz is offline   Reply With Quote

Old   June 14, 2022, 22:42
Default
  #533
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 339
Rep Power: 12
wkernkamp is on a distinguished road
Can you publish the full curve for the optimized machine. By the way, you should use 2400 MHz RDIMMs for best performance. I am interested in the result for 24 cores for comparison to the 2xE5-2697v2.
wkernkamp is offline   Reply With Quote

Old   June 15, 2022, 03:19
Default
  #534
New Member
 
Alexander Kazantcev
Join Date: Sep 2019
Posts: 23
Rep Power: 6
AlexKaz is on a distinguished road
Quote:
Originally Posted by wkernkamp View Post
Can you publish the full curve for the optimized machine. By the way, you should use 2400 MHz RDIMMs for best performance. I am interested in the result for 24 cores for comparison to the 2xE5-2697v2.
Sorry, in my case it does not running at 2400 with 8 dimms. Only 7 dimms are working with 2400. It is a such silicone lottery for used cpus

Last edited by AlexKaz; June 15, 2022 at 11:24.
AlexKaz is offline   Reply With Quote

Old   June 15, 2022, 11:24
Default
  #535
New Member
 
Alexander Kazantcev
Join Date: Sep 2019
Posts: 23
Rep Power: 6
AlexKaz is on a distinguished road
I can add only times for 2133, 2 rank, 13-12-12-....
1 1535.27 1098.81
2 1018.75 550.63
4 573.74 257.45
8 364 135.52
10 339.37 101.29
12 321.41 97.4
14 266.07 94.89
16 258.09 82.39
18 237.39 84.1
20 210.51 75.66
22 236.61 78.39
24 200.13 71.75
26 213.59 76.73
28 186.62 69.07
30 189.12 73.23
31 195.98 70.99
32 182.98 68.03
wkernkamp likes this.
AlexKaz is offline   Reply With Quote

Old   June 15, 2022, 15:53
Default
  #536
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 339
Rep Power: 12
wkernkamp is on a distinguished road
Thanks for posting. Interesting that there is quite a bit of fluctuation up and down as the number of cores goes up.
wkernkamp is offline   Reply With Quote

Old   June 16, 2022, 07:58
Default System76 Galago Ultrapro (2014 Laptop)
  #537
New Member
 
Daniel
Join Date: Jun 2010
Posts: 12
Rep Power: 15
DVSoares is on a distinguished road
Hey guys,

Kudos to all for keeping this thread active. I am looking to (finally) replace my Galago Ultrapro bought in 2014 - have been using it until it gets too close to be fubar, decided to run the benchmark on it to get a sense of upgrade with today's options.

System has an Intel(R) Core(TM) i7-4750HQ (clock 2GHz - 3.2GHz), data from lscpu:

Code:
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
    CPU family:          6
    Model:               70
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            1
    CPU max MHz:         3200.0000
    CPU min MHz:         800.0000
...
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    1 MiB (4 instances)
  L3:                    6 MiB (1 instance)
  L4:                    128 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7

Memory is DDR3 1600MHz 2x4GB in dual channel, supported by that large L4 cache. Bench results are:

Code:
Meshing Times:
1 1522.67
2 971.47
3 740.59
4 584.75
Flow Calculation:
1 914.75
2 512.87
3 236.5
4 363.65

Cache hierarchy plays a central role in guaranteeing cores are properly fed and saturated with correct data (increased prefetching performance, etc.) - see how this cpu gets best fed with 3 threads, showcasing that no rule is 100% applicable to each cpu, in terms of OF performance.

Now moving to some of these DDR5 equipped notebooks with a reasonable gpu and let this guy here rest in pieces

Cheers
Simbelmynë likes this.
DVSoares is offline   Reply With Quote

Old   June 17, 2022, 05:49
Default
  #538
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,400
Rep Power: 47
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Sorry for barging right into the middle of this conversation, but the benchmark running faster on 3 cores than on 4 cores on a laptop can have so many other reasons. "Cache hierarchy" would be way down on my list for checking potential causes.
  • thread placement, especially since Hyperthreading is enabled
  • Thermal throttloing
  • TDP throttling
  • Background processes
  • General variance of benchmark results
  • ...
  • Anything related to CPU caches
DVSoares likes this.
flotus1 is offline   Reply With Quote

Old   June 17, 2022, 10:07
Default
  #539
New Member
 
Daniel
Join Date: Jun 2010
Posts: 12
Rep Power: 15
DVSoares is on a distinguished road
Hey flotus1, your comments are always most welcome, no need to apologize

I’ve repeated the runs at least 5 times, without even X11 running and in separate (in order to control temperatures), results didn’t vary more than 5% - just took the last run and put here.

At the end of the day, one has to assess the entire platform (hardware and host software) - Simbelmynë’s last post is all about that too.

I confess that laptop still serves my coding needs very well (no local compiling/running on it though) but it’s time has come
DVSoares is offline   Reply With Quote

Old   June 20, 2022, 12:31
Default
  #540
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 339
Rep Power: 12
wkernkamp is on a distinguished road
Quote:
Originally Posted by DVSoares View Post
Hey guys,

Kudos to all for keeping this thread active. I am looking to (finally) replace my Galago Ultrapro bought in 2014 - have been using it until it gets too close to be fubar, decided to run the benchmark on it to get a sense of upgrade with today's options.

System has an Intel(R) Core(TM) i7-4750HQ (clock 2GHz - 3.2GHz), data from lscpu:........
Cheers

Your machine is very interesting for the current discussion, because it has an exceptionally large cache. If we analyze the number of transactions required to complete the benchmark same as I did above, we get:
Code:
CPU         DIMM MT/s  Channels MT/s    Benchm.      MT
i7-4750HQ     1600        2     3200    236.5s    756800
The low value of required transactions to complete (MT) is in line with the modern "large cache" AMD cpus. Nice confirmation of the effect of cache on benchmark completion from an older cpu.
DVSoares likes this.
wkernkamp is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology wyldckat OpenFOAM 17 November 10, 2017 15:54
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days joegi.geo OpenFOAM Announcements from Other Sources 0 October 1, 2016 19:20
OpenFOAM Training Beijing 22-26 Aug 2016 cfd.direct OpenFOAM Announcements from Other Sources 0 May 3, 2016 04:57
New OpenFOAM Forum Structure jola OpenFOAM 2 October 19, 2011 06:55
Hardware for OpenFOAM LES LijieNPIC Hardware 0 November 8, 2010 09:54


All times are GMT -4. The time now is 18:49.