CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Xeon Gold Cascade Lake vs Epyc Rome - CFX & Fluent - Benchmarks (Windows Server 2019)

Register Blogs Community New Posts Updated Threads Search

Like Tree22Likes
  • 8 Post By SLC
  • 9 Post By SLC
  • 1 Post By Sixkillers
  • 1 Post By SLC
  • 1 Post By SLC
  • 2 Post By SLC

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   February 2, 2020, 01:57
Default Xeon Gold Cascade Lake vs Epyc Rome - CFX & Fluent - Benchmarks (Windows Server 2019)
  #1
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 14
SLC is on a distinguished road
I have been benchmarking two PowerEdge machines from Dell. One 24 core R640 (Intel Cascade Lake) and one 32 core R6525 (Epyc Rome). Both running Windows Server 2019.

Specs:

R640
2 x Intel Xeon Gold 6246 (Cascade Lake) 12c, 4.1 GHz all core turbo
12 x 16GB 2933 MHz RAM (Dual rank)
Sub-Numa cluster enabled

R6525
2 x Epyc Rome 7302 16c, 3.3 GHz all core turbo
16 x 16GB 3200 MHz RAM (Dual rank)
NPS set to 4

The R6525 machine is 15 % cheaper than the R640 in the above spec. The rest of the specification list between the two machines is identical.

I've run a bunch of the different official Fluent and CFX benchmarks from ANSYS. For CFX I've used Intel MPI, and Fluent the default ibmmpi.

Average across the different benchmarks I've run:

The Epyc Rome system is:
  • On a core-for-core basis: 6.5 % faster in Fluent and 28 % faster in CFX(!). This is when the R6525 has been run with 24 cores (so as to compare to the Intel machine). It appears CFX is much more dependent on memory bandwidth (and the Epyc's 8 memory channels) compared to Fluent.
  • On a machine-for-machine basis: 28 % faster in Fluent and a whopping 48 % faster in CFX. This is when running on all 32 cores (compared to the 24 core load on the Intel system).
  • Changing from NPS=1 (default) to 4 for the AMD Epyc was a roughly 10 % gain in CFD performance. Enabling sub-Numa clustering on the Intel system was a roughly 3 % gain in performance.


Here's an example of my result.

Fluent
CFX
(See post below, forum spam filter is breaking my balls)

Something that's interesting to note is the scaling on the AMD Epyc - there's a very clear improvement in performance in every multiple of 8 number of cores. Look at the aircraft_wing_14m fluent benchmark for example, there are scaling and performance peaks at 16, 24 and 32 cores. You "do not" want to run the AMD system at 26 cores - it is slower than at 24 cores.

I'm guessing this is related to the CPU architecture and the splitting of cores into CCXs.

Other interesting observations are that the Intel system runs both hot and power hungry - approx. 550W at full load with CPU temps of 80 C, compared to approx. 400 W at full load with CPU temps of 60 C for the AMD system.

The decision is clear for me - I'll be building a mini-cluster consisting of four AMD Epyc Rome machines for a total of 128 nodes.
The alternative would be to purchase five Intel Xeon Gold Cascade Lake systems (for a total of 120 nodes). The Intel setup would be 30 % more expensive and 10 % slower overall! I could also go for 6 machines which ought to theoretically match the 4 AMD machines, but then for a dizzying 50 % price premium.

AMD Epyc Rome really is EPIC for CFD applications!
Amiga500, Blanco, evcelica and 5 others like this.
SLC is offline   Reply With Quote

Old   February 2, 2020, 02:00
Default
  #2
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 14
SLC is on a distinguished road
Fluent:



CFX:
Attached Images
File Type: jpg bench - fluent 2019 R3 - aircraft_wing_14m.jpg (143.7 KB, 517 views)
File Type: jpg bench - cfx 2019 R3 - airfoil_10m.jpg (142.9 KB, 506 views)
SLC is offline   Reply With Quote

Old   February 2, 2020, 07:23
Default
  #3
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Those are some very thorough investigations, with interesting results. Thanks for publishing.

Just out of curiosity, could you run one more comparison with the AMD system? It doesn't have to be a full scaling analysis, just one more data point with max cores would suffice.
The change I am interested in: drop down the memory transfer speed to 2933 MT/s. I recently learned that this is the maximum frequency on Epyc Rome CPUs where infinity fabric and memory can run in sync. Compared to 3200 MT/s, you should get a little less bandwidth, but much better memory access times.
Since you are on Windows, it should be easy to check IF speed with HWinfo or CPU-Z.
See bottom of page 10 in this documentation for reference: https://developer.amd.com/wp-content...56745_0.80.pdf

Last edited by flotus1; February 2, 2020 at 08:46.
flotus1 is offline   Reply With Quote

Old   February 2, 2020, 18:52
Default
  #4
Member
 
EM
Join Date: Sep 2019
Posts: 58
Rep Power: 6
gnwt4a is on a distinguished road
Did u run the same executable on each machine or did u use two different executables specifically compiled for each machine?


What compilers were used and what math libraries? Did u use mkl on intel?
What flag options and what optimizations?
--
gnwt4a is offline   Reply With Quote

Old   February 2, 2020, 21:39
Default
  #5
Member
 
Join Date: Nov 2011
Location: Czech Republic
Posts: 97
Rep Power: 14
Sixkillers is on a distinguished road
Quote:
Did u run the same executable on each machine or did u use two different executables specifically compiled for each machine?


What compilers were used and what math libraries? Did u use mkl on intel?
What flag options and what optimizations?
Both Fluent and CFX are commercial products so you cannot recompile it with different flags
You've got precompiled binaries for Windows/Linux and that's all.

As far as I know both are using MKL and CFX solver is compile by Intel Fortran Compiler.

I might be interesting to set an environment variable MKL_DEBUG_CPU_TYPE=5 on AMD system to see if there is any impact on performance.
Details can be found here.

Thank you for your post!
Habib-CFD likes this.
Sixkillers is offline   Reply With Quote

Old   February 3, 2020, 03:25
Default
  #6
Member
 
EM
Join Date: Sep 2019
Posts: 58
Rep Power: 6
gnwt4a is on a distinguished road
Ok. You do not have special access to these commercial codes.


Here is a suggestion: try nektar++. It comes as a precompiled binary or u can download the source and compile it yourself. Run any 3d case (u have to set it up yourself) - channel/duct/pipe/lid-driven cavity - for (say) 100 steps and, if u can, use ~200 million nodes or the highest u can. Use polynomials of at least order 10 (20 or more would be nice).
--
gnwt4a is offline   Reply With Quote

Old   February 3, 2020, 04:44
Default
  #7
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 14
SLC is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
Those are some very thorough investigations, with interesting results. Thanks for publishing.

Just out of curiosity, could you run one more comparison with the AMD system? It doesn't have to be a full scaling analysis, just one more data point with max cores would suffice.
The change I am interested in: drop down the memory transfer speed to 2933 MT/s. I recently learned that this is the maximum frequency on Epyc Rome CPUs where infinity fabric and memory can run in sync. Compared to 3200 MT/s, you should get a little less bandwidth, but much better memory access times.
Since you are on Windows, it should be easy to check IF speed with HWinfo or CPU-Z.
See bottom of page 10 in this documentation for reference: https://developer.amd.com/wp-content...56745_0.80.pdf

I've just tried, and it appears I am not able to change the memory speed on this PowerEdge R6525. It is locked at 3200 in the BIOS/iDRAC.
SLC is offline   Reply With Quote

Old   February 3, 2020, 05:03
Default
  #8
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 14
SLC is on a distinguished road
Quote:
Originally Posted by Sixkillers View Post
I might be interesting to set an environment variable MKL_DEBUG_CPU_TYPE=5 on AMD system to see if there is any impact on performance.
Details can be found here.

I just tried this, there was no performance change in either Fluent or CFX.
Sixkillers likes this.
SLC is offline   Reply With Quote

Old   February 3, 2020, 05:04
Default
  #9
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 14
SLC is on a distinguished road
Quote:
Originally Posted by gnwt4a View Post
Ok. You do not have special access to these commercial codes.


Here is a suggestion: try nektar++. It comes as a precompiled binary or u can download the source and compile it yourself. Run any 3d case (u have to set it up yourself) - channel/duct/pipe/lid-driven cavity - for (say) 100 steps and, if u can, use ~200 million nodes or the highest u can. Use polynomials of at least order 10 (20 or more would be nice).
--

Sorry, am running Windows and it looks like it's quite a lot of work to compile nektra++.
flotus1 likes this.
SLC is offline   Reply With Quote

Old   February 3, 2020, 05:40
Default
  #10
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
Originally Posted by SLC View Post
I've just tried, and it appears I am not able to change the memory speed on this PowerEdge R6525. It is locked at 3200 in the BIOS/iDRAC.
Bummer. There is usually a setting somewhere in the vicinity that unlocks memory speed from auto to manual. But maybe Dells bios is even more locked down than what I am used to.

Would still be interesting to see what IF is clocked at in your system. In CPU-Z, it should be the value of "NB frequency". And HWInfo should have an entry for Infinity fabric.
flotus1 is offline   Reply With Quote

Old   February 3, 2020, 07:22
Default
  #11
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 14
SLC is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
Bummer. There is usually a setting somewhere in the vicinity that unlocks memory speed from auto to manual. But maybe Dells bios is even more locked down than what I am used to.

You were right!



I had to change the power profile from "Maximum Performance" to "Custom" which was on a different page from the memory settings.


Fluent benchmark Aircraft_wing_14m @ 32 cores


3200 Mhz - 122.4 s

2932 Mhz - 160.8 s


HOWEVER! It would seem there is a bug in the Dell BIOS. When selecting 2932 memory speed, it actually clocked the memory all the way down to 1600 Mhz (memory clock reported at 800 Mhz in CPU-Z/HWinfo).


I've searched high and low and can't find an entry for NB frequency or Infinity fabric...
SLC is offline   Reply With Quote

Old   February 4, 2020, 16:31
Default
  #12
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Nice catch with that bug.

HWInfo needs to be a recent version, maybe even a beta. Don't know if they already implemented this in a release version. Of course, the sensor reading could just fail because it is unfamiliar with your server hardware.
In CPU-Z, you should find it in the memory tab.

cpuz_nb.png hwinfo_nb.png
flotus1 is offline   Reply With Quote

Old   February 11, 2020, 12:01
Default
  #13
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,167
Rep Power: 23
evcelica is on a distinguished road
It is interesting that the AMD system performs most efficiently in CFX when running at a core count that is a balanced factor/multiple of the memory channel numbers. 8, 16, 24. It always drops in performance at a core count slightly higher than these numbers, like it has unbalanced the memory load.
evcelica is offline   Reply With Quote

Old   February 11, 2020, 12:25
Default
  #14
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
My take on this: it is caused by the chiplet design.
The 7302 should have 4 active chiplets with 4 cores each. Maybe SLC could verify that with a look at lstopo under in Linux...
So running on 16 cores, each chiplet should have 2 threads assigned to it, and each thread has access to the same amount of shared resources: L3 cache, chiplet to I/O die bandwidth, memory bandwidth.
Going to 17 cores, one chiplet has to take on 3 threads, which is 50% more than all other chiplets. Leaving these 3 threads with significantly less shared resources compared to the other threads. In addition to that, boost frequency is determined by the amount of threads per chiplet. So frequency of the cores on this chiplet might drop lower than the others. Since the slowest thread determines overall performance, this imbalance leads to a drop in performance.
A more traditional dual-socket system using monolithic CPU dies experiences similar contention of shared CPU resources, but the imbalance is much less pronounced.
flotus1 is offline   Reply With Quote

Old   February 12, 2020, 07:10
Default
  #15
Member
 
Join Date: Dec 2016
Posts: 44
Rep Power: 9
Duke711 is on a distinguished road
Quote:
Originally Posted by SLC View Post
You were right!Fluent benchmark Aircraft_wing_14m @ 32 cores


3200 Mhz - 122.4 s

2932 Mhz - 160.8 s


HOWEVER! It would seem there is a bug in the Dell BIOS. When selecting 2932 memory speed, it actually clocked the memory all the way down to 1600 Mhz (memory clock reported at 800 Mhz in CPU-Z/HWinfo).

Also realy


3200 Mhz - 122.4 s

1600 Mhz - 160.8 s


So the speedup of memory frequency is rated to (3200/1600)^0.4 ->


2^0.4 = 1.32 --> 160.8/122.4 = 1.31


I have also noticed a similar situation on a DDR 3 System


Epyc 7551 vs 6850K; Fluent Bench




Speedup only +30% // (2400/1600) + 50 % /// Scaling 1.5^0.66 with Memory Bandwidth
Duke711 is offline   Reply With Quote

Old   June 11, 2020, 10:00
Default single vs dual cpu
  #16
New Member
 
sida
Join Date: Dec 2019
Posts: 6
Rep Power: 6
sida is on a distinguished road
Quote:
Originally Posted by SLC View Post
I have been benchmarking two PowerEdge machines from Dell. One 24 core R640 (Intel Cascade Lake) and one 32 core R6525 (Epyc Rome). Both running Windows Server 2019.

Specs:

R640
2 x Intel Xeon Gold 6246 (Cascade Lake) 12c, 4.1 GHz all core turbo
12 x 16GB 2933 MHz RAM (Dual rank)
Sub-Numa cluster enabled

R6525
2 x Epyc Rome 7302 16c, 3.3 GHz all core turbo
16 x 16GB 3200 MHz RAM (Dual rank)
NPS set to 4
Thanks for such a clear comparison. Would you please help me decide between 2x EPYC 7302 or one EPYC 7452?
sida is offline   Reply With Quote

Old   June 11, 2020, 10:07
Default
  #17
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 14
SLC is on a distinguished road
Quote:
Originally Posted by sida View Post
Thanks for such a clear comparison. Would you please help me decide between 2x EPYC 7302 or one EPYC 7452?

2x7302 will be much, much quicker than one 7452.


Like 50 % quicker.
Freewill1 and sida like this.
SLC is offline   Reply With Quote

Old   June 13, 2020, 16:35
Default
  #18
Member
 
Ivan
Join Date: Oct 2017
Location: 3rd planet
Posts: 34
Rep Power: 8
Noco is on a distinguished road
We bought new 2X7301 in 2018 for Ansys CFX tasks

We have long calculations 200+ hours (one task)

And during each 10 task (in average, it is like 10% chance) CPUs stopped and switch out in first 15-60h. MB still working. And we lose all progress.

It is not temperature problem or BIOS - we reinstalled and rechecked it many times.

It is very uncomfortable for job, when you have deadlines.

Because of this, during this year, we want to buy new 2 CPU xeon based cluster. Too afraid of buying AMD Rome, even they are looking faster and cost effective in paper and in general tests.

Very tired of 'dancing with drums under this 2X7301'.
Noco is offline   Reply With Quote

Old   June 13, 2020, 16:48
Default
  #19
New Member
 
sida
Join Date: Dec 2019
Posts: 6
Rep Power: 6
sida is on a distinguished road
Quote:
Originally Posted by Noco View Post
We bought new 2X7301 in 2018 for Ansys CFX tasks
...
Very tired of 'dancing with drums under this 2X7301'.
Did you use ECC memories?
sida is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
CFX vs. FLUENT turbo CFX 4 April 13, 2021 08:08
Running Fluent & CFX directly using MS MPI on Windows? SLC ANSYS 2 January 8, 2020 02:02
Fluent on Microsoft Windows Server 2003, possible? IvanCFD FLUENT 0 February 10, 2011 04:45
Fluent on windows server Shamoon Jamshed ANSYS 0 November 15, 2009 12:52
Fluent on windows server Shamoon Jamshed FLUENT 0 November 14, 2009 12:32


All times are GMT -4. The time now is 02:37.