CFD Online Discussion Forums - AMD Epyc CFD benchmarks with Ansys Fluent

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Hardware (https://www.cfd-online.com/Forums/hardware/)

- - AMD Epyc CFD benchmarks with Ansys Fluent (https://www.cfd-online.com/Forums/hardware/196400-amd-epyc-cfd-benchmarks-ansys-fluent.html)

AMD Epyc CFD benchmarks with Ansys Fluent

Earlier this year, AMD introduced a new CPU architecture called "Zen".
The most interesting CPUs in their lineup from a CFD perspective are definitely the "Epyc" CPUs. They consist of 4 dies connected through an interconnect called "infinity fabric".
Each die has its own dual-channel memory controller resulting in 8 memory channels per CPU. Some of these Epyc CPUs are 2S scalable, which means 16 memory channels and a theoretical memory bandwidth of 341GB/s (DDR4-2666) in a dual-socket node.
Now that these CPUs and motherboards are finally available, it is time to run some CFD benchmarks and compare them to an Intel system.

System Specifications

System "AMD"
CPU: 2x AMD Epyc 7301 (16 cores, 2.2GHz base, 2.7GHz all-core, 2.7GHz single-core)
RAM: 16x 16GB Samsung 2Rx4 DDR4-2133 reg ECC (use DDR4-2666 if you buy a system)
Mainboard: Supermicro H11DSi
GPU: Nvidia Geforce GTX 960 4GB
SSD: Intel S3500 800GB
PSU: Seasonic Focus Plus Platinum 850W (80+ platinum)

System "INTEL"
CPU: 2x Intel Xeon E5-2650v4 (12 cores, 2.2GHz base, 2.5GHz all-core, 2.9GHz single-core)
RAM: 8x 16GB Samsung 2Rx4 DDR4-2400 reg ECC
Mainboard: Supermicro X10DAX
GPU: Nvidia Quadro 2000 1GB
SSD: Intel S3500 800GB
PSU: Super Flower Golden Green HX 750W (80+ gold)

A note on memory: I would have liked to equip the AMD system with faster RAM, but there is no way I am buying memory with the current prices. So I work with what I have. The difference in memory size is irrelevant, all benchmarks shown here fit in the memory and caches were cleared before each run.

Software:
Operating system: CentOS 7
Linux Kernel: 4.14.3-1
Fluent version: 18.2
CPU governor: performance
SMT/Hyperthreading: off

Fluent performance
I used some of the official Fluent benchmarks provided by Ansys. For detailed description of the cases see here: http://www.ansys.com/solutions/solut...ent-benchmarks
These benchmark results should be representative for many finite volume solvers with MPI parallelization. The results given are solver wall-time in seconds.

1) External Flow Over an Aircraft Wing (aircraft_2m), single precision

AMD, 1 core, 10 iterations: 179.3 s
INTEL, 1 core, 10 iterations: 194.6 s

AMD, 24 cores, 100 iterations: 92.6 s
INTEL, 24 cores, 100 iterations: 121.9 s

AMD, 32 cores, 100 iterations: 78.4 s

2) External Flow Over an Aircraft Wing (aircraft_14m), double precision (note that the default setting for this benchmark is single precision)
This is the benchmark AMD used for their demonstration video: https://www.youtube.com/watch?v=gdYYRRDJDUc
Since they apparently used a two-node setup and different processors, I decided to drop comparability and use double precision to mix things up.

AMD, 24 cores, 10 iterations: 93.8 s
INTEL, 24 cores, 10 iterations: 118.2 s

AMD, 32 cores, 10 iterations: 72.2 s

3) 4-Stroke spray guided Gasoline Direct Injection model (ice_2m), double precision

AMD, 24 cores, 100 iterations: 220.2 s
INTEL, 24 cores, 100 iterations: 258.7 s

AMD, 32 cores, 100 iterations: 172.4 s

4) Flow through a combustor (combustor_12m)

AMD, 24 cores, 10 iterations: 339.6 s
INTEL, 24 cores, 10 iterations: 386.0 s

AMD, 32 cores, 10 iterations: 269.4 s

A note on power consumption
Since the systems differ in terms of GPU and PSU, take my values with a grain of salt. Measuring power draw at the wall (using a Brennenstuhl PM231 E), the systems are actually pretty similar.
AMD, idle: ~115W
INTEL, idle: ~125W

AMD, solving aircraft_2m on 32 cores: ~350W
INTEL, solving aircraft_2m on 24 cores: ~320W

The Verdict
A quite compelling comeback for AMD in terms of CFD performance. Their new Epyc lineup delivers exactly what Intel has only increased incrementally over the past few years: memory bandwidth.
Although the AMD system in this benchmark ran slower DDR4-2133 instead of the maximum supported 2666MT/s, it beats the Intel-system even in terms of per-core performance. Using all its cores, it pulls ahead of Intel by up to 63%.
Quite surprisingly, AMD even takes the lead in single-core performance. This might have to do with relatively large caches (512KB L2 per core instead of 256KB) and partially low latencies for cache access. However, with a different single-core in-house code that is more compute-bound (results not shown above) Intel pulls slightly ahead thanks to its higher clock speed.
Speaking of clock speed: in my opinion AMD is missing a spot in its lineup: a medium core-count CPU with higher clock speeds. The 16-core CPUs don't seem to be using their TDP entirely, so there should have been headroom for a higher-clocked variant to tackle Intel in the "per-core performance" sector. Which would have made sense because Intel has not been idle in the meantime: Their new Skylake-SP architecture offers 6 DDR4-2666 memory channels per CPU, variants with high clock speed and 4S scalability and beyond. So for users with high costs for per-core licenses, Intel probably still has an edge over AMD.
Which brings us to the costs: Epyc 7301 CPUs cost ~920€ - if they are available which is still a problem. For that kind of money all Intel has to offer are Xeon Silver with 12 cores and support for DDR4-2400. So when you are on a limited budget for a CFD workstation or need cost-efficient cluster nodes, you should consider AMD. They are back!

awesome!
many thanks for your work and the great documentation of details on what is compared here :-)

would be very interesting to see if Skylake (Xeon SP) bring big changes...

broadwell (v4) -> skylake (v5)
- 4 -> 6 memory channels
- ring bus -> mesh connection: for cores, L3, memory, i/o
- different L2/L3 architecture

It definitely would have been nice to have a Skylake-SP platform for comparison. But I won't get my hands on one any time soon. So all we can do is extrapolate their performance based on specifications, different benchmarks and the numbers that Intel is advertising:
https://www.intel.com/content/www/us...pc-fluent.html
Here they claim "up to 60% improvement" over the Haswell-EP (v3) platform. This translates to roughly 42% improvement compared to Broadwell-EP (v4).

I'm curious, did you try setting the core affinity on the the 24 core EPYC simulation to ensure that it is using 6 cores per CCX? If the system decided to use 8 cores on one CCX, then you wouldn't be fully utilizing the memory bandwidth.

There is more improvement from 24 cores to 32 cores that I would have expected.

I did not mess with affinity settings or verify which of the 32 cores Fluent was using while solving on 24 cores.
But to me the values look quite ok. For the aircraft_2m benchmark, parallel efficiency is a whopping 81% on 24 cores and drops to 71% on 32 cores. Remember, running on 32 cores here means only two cores per memory channel. So there is some room for improvement even on high core counts.

I am mainly curious to see if going for a 32 core EPYC was really worth the extra cost over a 24 core chip. Your analysis certainly suggests that it is, but such a huge improvement for the 32 over the 24 doesn't smell right to me (even recognizing that the parallel efficiency has drops quite a bit). You are only adding cores and not memory bandwidth, so I would expect the difference to be much smaller.

I'm hypothesizing that the 24 core tests could be improved by setting the core affinity, bringing those results closer to the 32 core.

Edit - I just realized you're using 2x 16 core chips, not a single 32 core. In this case there is no way to use the memory bandwidth efficiently with 24 threads since you will have some cores share a memory controller and others will have their own. Hopefully someone else gets their hands on some of the bigger chips. A 16 thread benchmark would be interesting to see on your setup.

There is a total of 8 memory controllers in the system. One for each die.
So it is no problem to use the full memory bandwidth efficiently with 24 cores active. 3 cores per die. Both CCX on a die have access to the dies memory controller with no performance penalty.

Did the benchmark aircraft_14m with double precison on our 32-core cluster:
- 4 x (dual E5-2637v3 (4-core, 3.5 GHz), 64GB DDR4-2133)
- interconnect = FDR infiniband
- Red Hat Enterprise Linux 6.7

10 iterations took 74.8 sec.

Would have never bet AMD would match this (actually beats it out a bit with still room for DDR4-2666), pretty good news.

That is indeed an interesting result. Based on the specifications of your cluster, I would have expected that it performs a bit better than my Epyc workstation. Mainly because it is a pretty perfect setup for CFD and I would expect parallel efficiency to be above 100% with that kind of hardware. Did you clear caches before running the benchmark? I found this to be essential for consistent results. If you have the time you could try running the benchmark again on a single core.

Yes I did clear the cache with (flush-cache).

Didn't had time for a single core run, but did a single node run with 8-core: 477 sec. That was using 50GB of ram out of the 64 available on the node.

Now a more interesting result would be comparison with scalable xeon, most notably 6144 which might be the fastest one for FLUENT.

So scaling is in fact super-linear but the individual nodes are a little on the slow side...
I second that a direct comparison with a Skylake-SP Xeon would be interesting. But a single Xeon gold 6144 costs about as much as I paid for the whole Epyc workstation. So I am not the one running these tests :D

Many thanks flotus1 for sharing your results, it gives a good idea of the cpu capabilities. Did you have the opportunity to test it further with other cfd softwares ? I am considering in the next months ordering these as well, in the end are you convinced ?

I am completely convinced that AMD Epyc is currently the best processor for parallel CFD, at least with a price/performance ratio in mind.
I did not run any other commercial CFD codes for testing. Do you have anything specific in mind?
Our in-house OpenMP parallel LB code also runs pretty well, more than 50% faster than on the Intel platform.
The Palabos benchmark results for AMD (higher is better):

Code:

#threads          msu_100   msu_400   msu_1000

01(1 die)           9.369    12.720      7.840

02(2 dies)         17.182    24.809     19.102

04(4 dies)         33.460    48.814     49.291

08(8 dies)         56.289    95.870    105.716

16(8 dies)        102.307   158.212    158.968

32(8 dies)        169.955   252.729    294.178

And Intel

Code:

#threads   msu_100   msu_400

01           8.412    11.747

24          88.268   154.787

The only thing you need to be aware of is the "unorthodox" CPU architecture. The small NUMA nodes on AMD Epyc hinder performance for low core count jobs that use a lot of memory. You can see it in the results for problem size 1000. It uses >100GB of memory which does not fit into the memory region of one single NUMA node (32GB in my case). Hence the poor results for low core counts. I made the same observation with my single-core grid generator. And it will probably be the same problem for most grid generators that are not MPI parallel.

Thanks for sharing these results, flotus, impressive performance for sure. Some OpenFOAM benchmark cases would also be interesting to see. I have just ordered a new workstation myself and had to get an Intel-based system due to a variety of reasons, it would be nice to compare the difference.

I would need rather specific directions what to test with OpenFOAM and how to do it exactly. I never really used it apart from running some tutorials a few years ago. So I don't feel confident to provide reliable results.

I will try to run some benchmarks after I receive my workstation, then I will post the results along with the setup here.

What will be the current optimum price/performance configuration to buy?

My understanding:

System "AMD"
CPU: 2x AMD Epyc 7601 (32 cores, 2.2GHz base, 3,2 GHz Turbo) - do 32 cores will give me extra power worse of money? or 16 cores is more/less optimum because of DDR channels?
RAM: 16 x 16GB Samsung 2Rx4 DDR4-2666 reg ECC (maybe some DDR4-3600 if there is for Supermicro H11DSi-NT motherboard) Which amount and speed I need for optimum for 2x16 and 2x32 cores?
Mainboard: Supermicro H11DSi-NT (for Ethernet speed to add some more computers using ansys cfx parallel solver)
GPU: No
SSD: Samsung 850 - 512 Gb for system
HD: 4x8 Tb Seagate Enterprice SATA III 3,5 - RAID 5 or 6 to make it safe (if 1 HD go down you can make comeback with this type of RAID)
PSU: Be Quite 1200W

As alternative:

Maybe try to invest in 1 CPU (72 cores 1,5 Ghz, 1,7 Turbo) Intel Xeon Phi 7290F? Does someone run/own such computer with CFX/Fluent?

https://www.intel.com/content/www/us...548.1516002678

https://www.youtube.com/watch?v=I0U6ZMeVrB4

Quote:

CPU: 2x AMD Epyc 7601 (32 cores, 2.2GHz base, 3,2 GHz Turbo) - do 32 cores will give me extra power worse of money? or 16 cores is more/less optimum because of DDR channels?
RAM: 16 x 16GB Samsung 2Rx4 DDR4-2666 reg ECC (maybe some DDR4-3600 if there is for Supermicro H11DSi-NT motherboard) Which amount and speed I need for optimum for 2x16 and 2x32 cores?
Mainboard: Supermicro H11DSi-NT (for Ethernet speed to add some more computers using ansys cfx parallel solver)
GPU: No
SSD: Samsung 850 - 512 Gb for system
HD: 4x8 Tb Seagate Enterprice SATA III 3,5 - RAID 5 or 6 to make it safe (if 1 HD go down you can make comeback with this type of RAID)
PSU: Be Quite 1200W

Price/Performance: AMD Epyc 7301 over anything else.
It costs less than 900$ compared to nearly 4000$ for the 32-core variant. You could build two systems with the same total amount of cores and much higher total performance.
You need 16 DIMMS for this platform. Overclocking memory is no longer a thing with server platforms, so stiick to DDR4-2666 maximum. There is no faster reg ECC available anyway.
Unless this is supposed to be a headless node, put in at least a small GPU like a GTX 1050TI.
A 1200W power supply is a bit on the high side, the system as configured will never draw more than 400W. My power supply is rated for 850W (Seasonic Focus Plus Platinum) only because it has more connectors than the 750W variant.
Speaking of 10G Ethernet: You could give it a try, but in the end you might want to switch to infiniband if you connect more nodes.

Xeon Phi are not an alternative if you are not running code developed specifically for this platform. Commercial software like Fluent and CFX does not make full use the potential of this architecture, this is still under development. And even if they did, I highly doubt that it would outperform dual-Epyc for CFD workloads.
There may not be many CFD benchmarks available for this type of processor, but this already tells a lot: If it were actually faster than normal platforms for CFD, Ansys and Intel marketing would not stop bragging about it.

Thank you!

Some small issues:
1. Do I need water cooling?
2. Server-like horizontal or vertical large tower?
3. If water cooling - 4 120 mm radiators with water per each CPU?

My i9 18 cores with 2 120 mm radiators with water are up 110 C for after 1 hour solving.

I am using Noctua NH-U14s TR4-SP3 air coolers. The CPUs themselves run pretty cool thanks to the large surface area (and soldered heatspreader which Intel I9 are lacking), so water cooling is completely unnecessary from the thermal point of view. If you do it for aesthetics or some other reason, go ahead :D

The type of case is up to you, depends if you prefer rackmount or workstation. I have a normal E-ATX workstation case. Currently Nanoxia deep silence 2, but switching to Fractal design Define XL R2 for better build quality.