Xeon Gold Cascade Lake vs Epyc Rome - CFX & Fluent - Benchmarks (Windows Server 2019)

SLC · February 2, 2020, 01:57

I have been benchmarking two PowerEdge machines from Dell. One 24 core R640 (Intel Cascade Lake) and one 32 core R6525 (Epyc Rome). Both running Windows Server 2019.

Specs:

R640
2 x Intel Xeon Gold 6246 (Cascade Lake) 12c, 4.1 GHz all core turbo
12 x 16GB 2933 MHz RAM (Dual rank)
Sub-Numa cluster enabled

R6525
2 x Epyc Rome 7302 16c, 3.3 GHz all core turbo
16 x 16GB 3200 MHz RAM (Dual rank)
NPS set to 4

The R6525 machine is 15 % cheaper than the R640 in the above spec. The rest of the specification list between the two machines is identical.

I've run a bunch of the different official Fluent and CFX benchmarks from ANSYS. For CFX I've used Intel MPI, and Fluent the default ibmmpi.

Average across the different benchmarks I've run:

The Epyc Rome system is:

On a core-for-core basis: 6.5 % faster in Fluent and 28 % faster in CFX(!). This is when the R6525 has been run with 24 cores (so as to compare to the Intel machine). It appears CFX is much more dependent on memory bandwidth (and the Epyc's 8 memory channels) compared to Fluent.
On a machine-for-machine basis: 28 % faster in Fluent and a whopping 48 % faster in CFX. This is when running on all 32 cores (compared to the 24 core load on the Intel system).
Changing from NPS=1 (default) to 4 for the AMD Epyc was a roughly 10 % gain in CFD performance. Enabling sub-Numa clustering on the Intel system was a roughly 3 % gain in performance.

Here's an example of my result.

Fluent
CFX
(See post below, forum spam filter is breaking my balls)

Something that's interesting to note is the scaling on the AMD Epyc - there's a very clear improvement in performance in every multiple of 8 number of cores. Look at the aircraft_wing_14m fluent benchmark for example, there are scaling and performance peaks at 16, 24 and 32 cores. You "do not" want to run the AMD system at 26 cores - it is slower than at 24 cores.

I'm guessing this is related to the CPU architecture and the splitting of cores into CCXs.

Other interesting observations are that the Intel system runs both hot and power hungry - approx. 550W at full load with CPU temps of 80 C, compared to approx. 400 W at full load with CPU temps of 60 C for the AMD system.

The decision is clear for me - I'll be building a mini-cluster consisting of four AMD Epyc Rome machines for a total of 128 nodes.
The alternative would be to purchase five Intel Xeon Gold Cascade Lake systems (for a total of 120 nodes). The Intel setup would be 30 % more expensive and 10 % slower overall! I could also go for 6 machines which ought to theoretically match the 4 AMD machines, but then for a dizzying 50 % price premium.

AMD Epyc Rome really is EPIC for CFD applications!

SLC · February 2, 2020, 02:00

Fluent:

CFX:

flotus1 · February 2, 2020, 07:23

Those are some very thorough investigations, with interesting results. Thanks for publishing.

Just out of curiosity, could you run one more comparison with the AMD system? It doesn't have to be a full scaling analysis, just one more data point with max cores would suffice.
The change I am interested in: drop down the memory transfer speed to 2933 MT/s. I recently learned that this is the maximum frequency on Epyc Rome CPUs where infinity fabric and memory can run in sync. Compared to 3200 MT/s, you should get a little less bandwidth, but much better memory access times.
Since you are on Windows, it should be easy to check IF speed with HWinfo or CPU-Z.
See bottom of page 10 in this documentation for reference: https://developer.amd.com/wp-content...56745_0.80.pdf

gnwt4a · February 2, 2020, 18:52

Did u run the same executable on each machine or did u use two different executables specifically compiled for each machine?

What compilers were used and what math libraries? Did u use mkl on intel?
What flag options and what optimizations?
--

Sixkillers · February 2, 2020, 21:39

Quote:

Did u run the same executable on each machine or did u use two different executables specifically compiled for each machine?

What compilers were used and what math libraries? Did u use mkl on intel?
What flag options and what optimizations?

Both Fluent and CFX are commercial products so you cannot recompile it with different flags

You've got precompiled binaries for Windows/Linux and that's all.

As far as I know both are using MKL and CFX solver is compile by Intel Fortran Compiler.

I might be interesting to set an environment variable MKL_DEBUG_CPU_TYPE=5 on AMD system to see if there is any impact on performance.
Details can be found here.

Thank you for your post!

gnwt4a · February 3, 2020, 03:25

Ok. You do not have special access to these commercial codes.

Here is a suggestion: try nektar++. It comes as a precompiled binary or u can download the source and compile it yourself. Run any 3d case (u have to set it up yourself) - channel/duct/pipe/lid-driven cavity - for (say) 100 steps and, if u can, use ~200 million nodes or the highest u can. Use polynomials of at least order 10 (20 or more would be nice).
--

SLC · February 3, 2020, 04:44

Quote:

Originally Posted by flotus1

Those are some very thorough investigations, with interesting results. Thanks for publishing.

Just out of curiosity, could you run one more comparison with the AMD system? It doesn't have to be a full scaling analysis, just one more data point with max cores would suffice.
The change I am interested in: drop down the memory transfer speed to 2933 MT/s. I recently learned that this is the maximum frequency on Epyc Rome CPUs where infinity fabric and memory can run in sync. Compared to 3200 MT/s, you should get a little less bandwidth, but much better memory access times.
Since you are on Windows, it should be easy to check IF speed with HWinfo or CPU-Z.
See bottom of page 10 in this documentation for reference: https://developer.amd.com/wp-content...56745_0.80.pdf

I've just tried, and it appears I am not able to change the memory speed on this PowerEdge R6525. It is locked at 3200 in the BIOS/iDRAC.

SLC · February 3, 2020, 05:03

Quote:

Originally Posted by Sixkillers

I might be interesting to set an environment variable MKL_DEBUG_CPU_TYPE=5 on AMD system to see if there is any impact on performance.
Details can be found here.

I just tried this, there was no performance change in either Fluent or CFX.

SLC · February 3, 2020, 05:04

Quote:

Originally Posted by gnwt4a

Ok. You do not have special access to these commercial codes.

Here is a suggestion: try nektar++. It comes as a precompiled binary or u can download the source and compile it yourself. Run any 3d case (u have to set it up yourself) - channel/duct/pipe/lid-driven cavity - for (say) 100 steps and, if u can, use ~200 million nodes or the highest u can. Use polynomials of at least order 10 (20 or more would be nice).
--

Sorry, am running Windows and it looks like it's quite a lot of work to compile nektra++.

flotus1 · February 3, 2020, 05:40

Quote:

Originally Posted by SLC

I've just tried, and it appears I am not able to change the memory speed on this PowerEdge R6525. It is locked at 3200 in the BIOS/iDRAC.

Bummer. There is usually a setting somewhere in the vicinity that unlocks memory speed from auto to manual. But maybe Dells bios is even more locked down than what I am used to.

Would still be interesting to see what IF is clocked at in your system. In CPU-Z, it should be the value of "NB frequency". And HWInfo should have an entry for Infinity fabric.

SLC · February 3, 2020, 07:22

Quote:

Originally Posted by flotus1

Bummer. There is usually a setting somewhere in the vicinity that unlocks memory speed from auto to manual. But maybe Dells bios is even more locked down than what I am used to.

You were right!

I had to change the power profile from "Maximum Performance" to "Custom" which was on a different page from the memory settings.

Fluent benchmark Aircraft_wing_14m @ 32 cores

3200 Mhz - 122.4 s

2932 Mhz - 160.8 s

HOWEVER! It would seem there is a bug in the Dell BIOS. When selecting 2932 memory speed, it actually clocked the memory all the way down to 1600 Mhz (memory clock reported at 800 Mhz in CPU-Z/HWinfo).

I've searched high and low and can't find an entry for NB frequency or Infinity fabric...

flotus1 · February 4, 2020, 16:31

Nice catch with that bug.

HWInfo needs to be a recent version, maybe even a beta. Don't know if they already implemented this in a release version. Of course, the sensor reading could just fail because it is unfamiliar with your server hardware.
In CPU-Z, you should find it in the memory tab.

cpuz_nb.png hwinfo_nb.png

evcelica · February 11, 2020, 12:01

It is interesting that the AMD system performs most efficiently in CFX when running at a core count that is a balanced factor/multiple of the memory channel numbers. 8, 16, 24. It always drops in performance at a core count slightly higher than these numbers, like it has unbalanced the memory load.

flotus1 · February 11, 2020, 12:25

My take on this: it is caused by the chiplet design.
The 7302 should have 4 active chiplets with 4 cores each. Maybe SLC could verify that with a look at lstopo under in Linux...
So running on 16 cores, each chiplet should have 2 threads assigned to it, and each thread has access to the same amount of shared resources: L3 cache, chiplet to I/O die bandwidth, memory bandwidth.
Going to 17 cores, one chiplet has to take on 3 threads, which is 50% more than all other chiplets. Leaving these 3 threads with significantly less shared resources compared to the other threads. In addition to that, boost frequency is determined by the amount of threads per chiplet. So frequency of the cores on this chiplet might drop lower than the others. Since the slowest thread determines overall performance, this imbalance leads to a drop in performance.
A more traditional dual-socket system using monolithic CPU dies experiences similar contention of shared CPU resources, but the imbalance is much less pronounced.

Duke711 · February 12, 2020, 07:10

Quote:

Originally Posted by SLC

You were right!Fluent benchmark Aircraft_wing_14m @ 32 cores

3200 Mhz - 122.4 s

2932 Mhz - 160.8 s

HOWEVER! It would seem there is a bug in the Dell BIOS. When selecting 2932 memory speed, it actually clocked the memory all the way down to 1600 Mhz (memory clock reported at 800 Mhz in CPU-Z/HWinfo).

Also realy

3200 Mhz - 122.4 s

1600 Mhz - 160.8 s

So the speedup of memory frequency is rated to (3200/1600)^0.4 ->

2^0.4 = 1.32 --> 160.8/122.4 = 1.31

I have also noticed a similar situation on a DDR 3 System

Epyc 7551 vs 6850K; Fluent Bench

Speedup only +30% // (2400/1600) + 50 % /// Scaling 1.5^0.66 with Memory Bandwidth

sida · June 11, 2020, 10:00

Quote:

Originally Posted by SLC

I have been benchmarking two PowerEdge machines from Dell. One 24 core R640 (Intel Cascade Lake) and one 32 core R6525 (Epyc Rome). Both running Windows Server 2019.

Specs:

R640
2 x Intel Xeon Gold 6246 (Cascade Lake) 12c, 4.1 GHz all core turbo
12 x 16GB 2933 MHz RAM (Dual rank)
Sub-Numa cluster enabled

R6525
2 x Epyc Rome 7302 16c, 3.3 GHz all core turbo
16 x 16GB 3200 MHz RAM (Dual rank)
NPS set to 4

Thanks for such a clear comparison. Would you please help me decide between 2x EPYC 7302 or one EPYC 7452?

SLC · June 11, 2020, 10:07

Quote:

Originally Posted by sida

Thanks for such a clear comparison. Would you please help me decide between 2x EPYC 7302 or one EPYC 7452?

2x7302 will be much, much quicker than one 7452.

Like 50 % quicker.

Noco · June 13, 2020, 16:35

We bought new 2X7301 in 2018 for Ansys CFX tasks

We have long calculations 200+ hours (one task)

And during each 10 task (in average, it is like 10% chance) CPUs stopped and switch out in first 15-60h. MB still working. And we lose all progress.

It is not temperature problem or BIOS - we reinstalled and rechecked it many times.

It is very uncomfortable for job, when you have deadlines.

Because of this, during this year, we want to buy new 2 CPU xeon based cluster. Too afraid of buying AMD Rome, even they are looking faster and cost effective in paper and in general tests.

Very tired of 'dancing with drums under this 2X7301'.

sida · June 13, 2020, 16:48

Quote:

Originally Posted by Noco

We bought new 2X7301 in 2018 for Ansys CFX tasks
...
Very tired of 'dancing with drums under this 2X7301'.

Did you use ECC memories?

February 2, 2020, 01:57	Xeon Gold Cascade Lake vs Epyc Rome - CFX & Fluent - Benchmarks (Windows Server 2019)	#1
SLC Member Join Date: Jul 2011 Posts: 53 Rep Power: 14	I have been benchmarking two PowerEdge machines from Dell. One 24 core R640 (Intel Cascade Lake) and one 32 core R6525 (Epyc Rome). Both running Windows Server 2019. Specs: R640 2 x Intel Xeon Gold 6246 (Cascade Lake) 12c, 4.1 GHz all core turbo 12 x 16GB 2933 MHz RAM (Dual rank) Sub-Numa cluster enabled R6525 2 x Epyc Rome 7302 16c, 3.3 GHz all core turbo 16 x 16GB 3200 MHz RAM (Dual rank) NPS set to 4 The R6525 machine is 15 % cheaper than the R640 in the above spec. The rest of the specification list between the two machines is identical. I've run a bunch of the different official Fluent and CFX benchmarks from ANSYS. For CFX I've used Intel MPI, and Fluent the default ibmmpi. Average across the different benchmarks I've run: The Epyc Rome system is: On a core-for-core basis: 6.5 % faster in Fluent and 28 % faster in CFX(!). This is when the R6525 has been run with 24 cores (so as to compare to the Intel machine). It appears CFX is much more dependent on memory bandwidth (and the Epyc's 8 memory channels) compared to Fluent. On a machine-for-machine basis: 28 % faster in Fluent and a whopping 48 % faster in CFX. This is when running on all 32 cores (compared to the 24 core load on the Intel system). Changing from NPS=1 (default) to 4 for the AMD Epyc was a roughly 10 % gain in CFD performance. Enabling sub-Numa clustering on the Intel system was a roughly 3 % gain in performance. Here's an example of my result. Fluent CFX (See post below, forum spam filter is breaking my balls) Something that's interesting to note is the scaling on the AMD Epyc - there's a very clear improvement in performance in every multiple of 8 number of cores. Look at the aircraft_wing_14m fluent benchmark for example, there are scaling and performance peaks at 16, 24 and 32 cores. You "do not" want to run the AMD system at 26 cores - it is slower than at 24 cores. I'm guessing this is related to the CPU architecture and the splitting of cores into CCXs. Other interesting observations are that the Intel system runs both hot and power hungry - approx. 550W at full load with CPU temps of 80 C, compared to approx. 400 W at full load with CPU temps of 60 C for the AMD system. The decision is clear for me - I'll be building a mini-cluster consisting of four AMD Epyc Rome machines for a total of 128 nodes. The alternative would be to purchase five Intel Xeon Gold Cascade Lake systems (for a total of 120 nodes). The Intel setup would be 30 % more expensive and 10 % slower overall! I could also go for 6 machines which ought to theoretically match the 4 AMD machines, but then for a dizzying 50 % price premium. AMD Epyc Rome really is EPIC for CFD applications! Amiga500, Blanco, evcelica and 5 others like this.

February 2, 2020, 07:23		#3
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	Those are some very thorough investigations, with interesting results. Thanks for publishing. Just out of curiosity, could you run one more comparison with the AMD system? It doesn't have to be a full scaling analysis, just one more data point with max cores would suffice. The change I am interested in: drop down the memory transfer speed to 2933 MT/s. I recently learned that this is the maximum frequency on Epyc Rome CPUs where infinity fabric and memory can run in sync. Compared to 3200 MT/s, you should get a little less bandwidth, but much better memory access times. Since you are on Windows, it should be easy to check IF speed with HWinfo or CPU-Z. See bottom of page 10 in this documentation for reference: https://developer.amd.com/wp-content...56745_0.80.pdf Last edited by flotus1; February 2, 2020 at 08:46.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
CFX vs. FLUENT	turbo	CFX	4	April 13, 2021 08:08
Running Fluent & CFX directly using MS MPI on Windows?	SLC	ANSYS	2	January 8, 2020 02:02
Fluent on Microsoft Windows Server 2003, possible?	IvanCFD	FLUENT	0	February 10, 2011 04:45
Fluent on windows server	Shamoon Jamshed	ANSYS	0	November 15, 2009 12:52
Fluent on windows server	Shamoon Jamshed	FLUENT	0	November 14, 2009 12:32

February 2, 2020, 18:52		#4
gnwt4a Member EM Join Date: Sep 2019 Posts: 58 Rep Power: 6	Did u run the same executable on each machine or did u use two different executables specifically compiled for each machine? What compilers were used and what math libraries? Did u use mkl on intel? What flag options and what optimizations? --

February 3, 2020, 03:25		#6
gnwt4a Member EM Join Date: Sep 2019 Posts: 58 Rep Power: 6	Ok. You do not have special access to these commercial codes. Here is a suggestion: try nektar++. It comes as a precompiled binary or u can download the source and compile it yourself. Run any 3d case (u have to set it up yourself) - channel/duct/pipe/lid-driven cavity - for (say) 100 steps and, if u can, use ~200 million nodes or the highest u can. Use polynomials of at least order 10 (20 or more would be nice). --

February 4, 2020, 16:31		#12
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	Nice catch with that bug. HWInfo needs to be a recent version, maybe even a beta. Don't know if they already implemented this in a release version. Of course, the sensor reading could just fail because it is unfamiliar with your server hardware. In CPU-Z, you should find it in the memory tab. cpuz_nb.png hwinfo_nb.png

February 11, 2020, 12:01		#13
evcelica Senior Member Erik Join Date: Feb 2011 Location: Earth (Land portion) Posts: 1,167 Rep Power: 23	It is interesting that the AMD system performs most efficiently in CFX when running at a core count that is a balanced factor/multiple of the memory channel numbers. 8, 16, 24. It always drops in performance at a core count slightly higher than these numbers, like it has unbalanced the memory load.

February 11, 2020, 12:25		#14
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	My take on this: it is caused by the chiplet design. The 7302 should have 4 active chiplets with 4 cores each. Maybe SLC could verify that with a look at lstopo under in Linux... So running on 16 cores, each chiplet should have 2 threads assigned to it, and each thread has access to the same amount of shared resources: L3 cache, chiplet to I/O die bandwidth, memory bandwidth. Going to 17 cores, one chiplet has to take on 3 threads, which is 50% more than all other chiplets. Leaving these 3 threads with significantly less shared resources compared to the other threads. In addition to that, boost frequency is determined by the amount of threads per chiplet. So frequency of the cores on this chiplet might drop lower than the others. Since the slowest thread determines overall performance, this imbalance leads to a drop in performance. A more traditional dual-socket system using monolithic CPU dies experiences similar contention of shared CPU resources, but the imbalance is much less pronounced.

June 13, 2020, 16:35		#18
Noco Member Ivan Join Date: Oct 2017 Location: 3rd planet Posts: 34 Rep Power: 8	We bought new 2X7301 in 2018 for Ansys CFX tasks We have long calculations 200+ hours (one task) And during each 10 task (in average, it is like 10% chance) CPUs stopped and switch out in first 15-60h. MB still working. And we lose all progress. It is not temperature problem or BIOS - we reinstalled and rechecked it many times. It is very uncomfortable for job, when you have deadlines. Because of this, during this year, we want to buy new 2 CPU xeon based cluster. Too afraid of buying AMD Rome, even they are looking faster and cost effective in paper and in general tests. Very tired of 'dancing with drums under this 2X7301'.