CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   Hardware (https://www.cfd-online.com/Forums/hardware/)
-   -   AMD Epyc CFD benchmarks with Ansys Fluent (https://www.cfd-online.com/Forums/hardware/196400-amd-epyc-cfd-benchmarks-ansys-fluent.html)

flotus1 December 3, 2017 16:50

AMD Epyc CFD benchmarks with Ansys Fluent
 
Earlier this year, AMD introduced a new CPU architecture called "Zen".
The most interesting CPUs in their lineup from a CFD perspective are definitely the "Epyc" CPUs. They consist of 4 dies connected through an interconnect called "infinity fabric".
Each die has its own dual-channel memory controller resulting in 8 memory channels per CPU. Some of these Epyc CPUs are 2S scalable, which means 16 memory channels and a theoretical memory bandwidth of 341GB/s (DDR4-2666) in a dual-socket node.
Now that these CPUs and motherboards are finally available, it is time to run some CFD benchmarks and compare them to an Intel system.

System Specifications

System "AMD"
CPU: 2x AMD Epyc 7301 (16 cores, 2.2GHz base, 2.7GHz all-core, 2.7GHz single-core)
RAM: 16x 16GB Samsung 2Rx4 DDR4-2133 reg ECC (use DDR4-2666 if you buy a system)
Mainboard: Supermicro H11DSi
GPU: Nvidia Geforce GTX 960 4GB
SSD: Intel S3500 800GB
PSU: Seasonic Focus Plus Platinum 850W (80+ platinum)

System "INTEL"
CPU: 2x Intel Xeon E5-2650v4 (12 cores, 2.2GHz base, 2.5GHz all-core, 2.9GHz single-core)
RAM: 8x 16GB Samsung 2Rx4 DDR4-2400 reg ECC
Mainboard: Supermicro X10DAX
GPU: Nvidia Quadro 2000 1GB
SSD: Intel S3500 800GB
PSU: Super Flower Golden Green HX 750W (80+ gold)

A note on memory: I would have liked to equip the AMD system with faster RAM, but there is no way I am buying memory with the current prices. So I work with what I have. The difference in memory size is irrelevant, all benchmarks shown here fit in the memory and caches were cleared before each run.

Software:
Operating system: CentOS 7
Linux Kernel: 4.14.3-1
Fluent version: 18.2
CPU governor: performance
SMT/Hyperthreading: off

Fluent performance
I used some of the official Fluent benchmarks provided by Ansys. For detailed description of the cases see here: http://www.ansys.com/solutions/solut...ent-benchmarks
These benchmark results should be representative for many finite volume solvers with MPI parallelization. The results given are solver wall-time in seconds.

1) External Flow Over an Aircraft Wing (aircraft_2m), single precision

AMD, 1 core, 10 iterations: 179.3 s
INTEL, 1 core, 10 iterations: 194.6 s

AMD, 24 cores, 100 iterations: 92.6 s
INTEL, 24 cores, 100 iterations: 121.9 s

AMD, 32 cores, 100 iterations: 78.4 s


2) External Flow Over an Aircraft Wing (aircraft_14m), double precision (note that the default setting for this benchmark is single precision)
This is the benchmark AMD used for their demonstration video: https://www.youtube.com/watch?v=gdYYRRDJDUc
Since they apparently used a two-node setup and different processors, I decided to drop comparability and use double precision to mix things up.

AMD, 24 cores, 10 iterations: 93.8 s
INTEL, 24 cores, 10 iterations: 118.2 s

AMD, 32 cores, 10 iterations: 72.2 s


3) 4-Stroke spray guided Gasoline Direct Injection model (ice_2m), double precision

AMD, 24 cores, 100 iterations: 220.2 s
INTEL, 24 cores, 100 iterations: 258.7 s

AMD, 32 cores, 100 iterations: 172.4 s


4) Flow through a combustor (combustor_12m)

AMD, 24 cores, 10 iterations: 339.6 s
INTEL, 24 cores, 10 iterations: 386.0 s

AMD, 32 cores, 10 iterations: 269.4 s


A note on power consumption
Since the systems differ in terms of GPU and PSU, take my values with a grain of salt. Measuring power draw at the wall (using a Brennenstuhl PM231 E), the systems are actually pretty similar.
AMD, idle: ~115W
INTEL, idle: ~125W

AMD, solving aircraft_2m on 32 cores: ~350W
INTEL, solving aircraft_2m on 24 cores: ~320W


The Verdict
A quite compelling comeback for AMD in terms of CFD performance. Their new Epyc lineup delivers exactly what Intel has only increased incrementally over the past few years: memory bandwidth.
Although the AMD system in this benchmark ran slower DDR4-2133 instead of the maximum supported 2666MT/s, it beats the Intel-system even in terms of per-core performance. Using all its cores, it pulls ahead of Intel by up to 63%.
Quite surprisingly, AMD even takes the lead in single-core performance. This might have to do with relatively large caches (512KB L2 per core instead of 256KB) and partially low latencies for cache access. However, with a different single-core in-house code that is more compute-bound (results not shown above) Intel pulls slightly ahead thanks to its higher clock speed.
Speaking of clock speed: in my opinion AMD is missing a spot in its lineup: a medium core-count CPU with higher clock speeds. The 16-core CPUs don't seem to be using their TDP entirely, so there should have been headroom for a higher-clocked variant to tackle Intel in the "per-core performance" sector. Which would have made sense because Intel has not been idle in the meantime: Their new Skylake-SP architecture offers 6 DDR4-2666 memory channels per CPU, variants with high clock speed and 4S scalability and beyond. So for users with high costs for per-core licenses, Intel probably still has an edge over AMD.
Which brings us to the costs: Epyc 7301 CPUs cost ~920€ - if they are available which is still a problem. For that kind of money all Intel has to offer are Xeon Silver with 12 cores and support for DDR4-2400. So when you are on a limited budget for a CFD workstation or need cost-efficient cluster nodes, you should consider AMD. They are back!

hpvd December 4, 2017 04:49

awesome!
many thanks for your work and the great documentation of details on what is compared here :-)

would be very interesting to see if Skylake (Xeon SP) bring big changes...

broadwell (v4) -> skylake (v5)
- 4 -> 6 memory channels
- ring bus -> mesh connection: for cores, L3, memory, i/o
- different L2/L3 architecture

flotus1 December 4, 2017 11:37

It definitely would have been nice to have a Skylake-SP platform for comparison. But I won't get my hands on one any time soon. So all we can do is extrapolate their performance based on specifications, different benchmarks and the numbers that Intel is advertising:
https://www.intel.com/content/www/us...pc-fluent.html
Here they claim "up to 60% improvement" over the Haswell-EP (v3) platform. This translates to roughly 42% improvement compared to Broadwell-EP (v4).

kyle December 4, 2017 11:52

I'm curious, did you try setting the core affinity on the the 24 core EPYC simulation to ensure that it is using 6 cores per CCX? If the system decided to use 8 cores on one CCX, then you wouldn't be fully utilizing the memory bandwidth.

There is more improvement from 24 cores to 32 cores that I would have expected.

flotus1 December 4, 2017 13:09

I did not mess with affinity settings or verify which of the 32 cores Fluent was using while solving on 24 cores.
But to me the values look quite ok. For the aircraft_2m benchmark, parallel efficiency is a whopping 81% on 24 cores and drops to 71% on 32 cores. Remember, running on 32 cores here means only two cores per memory channel. So there is some room for improvement even on high core counts.

kyle December 4, 2017 14:01

I am mainly curious to see if going for a 32 core EPYC was really worth the extra cost over a 24 core chip. Your analysis certainly suggests that it is, but such a huge improvement for the 32 over the 24 doesn't smell right to me (even recognizing that the parallel efficiency has drops quite a bit). You are only adding cores and not memory bandwidth, so I would expect the difference to be much smaller.

I'm hypothesizing that the 24 core tests could be improved by setting the core affinity, bringing those results closer to the 32 core.

Edit - I just realized you're using 2x 16 core chips, not a single 32 core. In this case there is no way to use the memory bandwidth efficiently with 24 threads since you will have some cores share a memory controller and others will have their own. Hopefully someone else gets their hands on some of the bigger chips. A 16 thread benchmark would be interesting to see on your setup.

flotus1 December 4, 2017 14:23

There is a total of 8 memory controllers in the system. One for each die.
So it is no problem to use the full memory bandwidth efficiently with 24 cores active. 3 cores per die. Both CCX on a die have access to the dies memory controller with no performance penalty.

Micael December 5, 2017 15:40

Did the benchmark aircraft_14m with double precison on our 32-core cluster:
- 4 x (dual E5-2637v3 (4-core, 3.5 GHz), 64GB DDR4-2133)
- interconnect = FDR infiniband
- Red Hat Enterprise Linux 6.7

10 iterations took 74.8 sec.

Would have never bet AMD would match this (actually beats it out a bit with still room for DDR4-2666), pretty good news.

flotus1 December 6, 2017 04:23

That is indeed an interesting result. Based on the specifications of your cluster, I would have expected that it performs a bit better than my Epyc workstation. Mainly because it is a pretty perfect setup for CFD and I would expect parallel efficiency to be above 100% with that kind of hardware. Did you clear caches before running the benchmark? I found this to be essential for consistent results. If you have the time you could try running the benchmark again on a single core.

Micael December 6, 2017 10:34

Yes I did clear the cache with (flush-cache).

Didn't had time for a single core run, but did a single node run with 8-core: 477 sec. That was using 50GB of ram out of the 64 available on the node.

Now a more interesting result would be comparison with scalable xeon, most notably 6144 which might be the fastest one for FLUENT.

flotus1 December 6, 2017 10:51

So scaling is in fact super-linear but the individual nodes are a little on the slow side...
I second that a direct comparison with a Skylake-SP Xeon would be interesting. But a single Xeon gold 6144 costs about as much as I paid for the whole Epyc workstation. So I am not the one running these tests :D

naffrancois December 13, 2017 15:52

Many thanks flotus1 for sharing your results, it gives a good idea of the cpu capabilities. Did you have the opportunity to test it further with other cfd softwares ? I am considering in the next months ordering these as well, in the end are you convinced ?

flotus1 December 14, 2017 04:27

I am completely convinced that AMD Epyc is currently the best processor for parallel CFD, at least with a price/performance ratio in mind.
I did not run any other commercial CFD codes for testing. Do you have anything specific in mind?
Our in-house OpenMP parallel LB code also runs pretty well, more than 50% faster than on the Intel platform.
The Palabos benchmark results for AMD (higher is better):
Code:

#threads          msu_100  msu_400  msu_1000
01(1 die)          9.369    12.720      7.840
02(2 dies)        17.182    24.809    19.102
04(4 dies)        33.460    48.814    49.291
08(8 dies)        56.289    95.870    105.716
16(8 dies)        102.307  158.212    158.968
32(8 dies)        169.955  252.729    294.178

And Intel
Code:

#threads  msu_100  msu_400
01          8.412    11.747
24          88.268  154.787

The only thing you need to be aware of is the "unorthodox" CPU architecture. The small NUMA nodes on AMD Epyc hinder performance for low core count jobs that use a lot of memory. You can see it in the results for problem size 1000. It uses >100GB of memory which does not fit into the memory region of one single NUMA node (32GB in my case). Hence the poor results for low core counts. I made the same observation with my single-core grid generator. And it will probably be the same problem for most grid generators that are not MPI parallel.

eric December 14, 2017 10:12

Thanks for sharing these results, flotus, impressive performance for sure. Some OpenFOAM benchmark cases would also be interesting to see. I have just ordered a new workstation myself and had to get an Intel-based system due to a variety of reasons, it would be nice to compare the difference.

flotus1 December 14, 2017 10:26

I would need rather specific directions what to test with OpenFOAM and how to do it exactly. I never really used it apart from running some tutorials a few years ago. So I don't feel confident to provide reliable results.

eric December 18, 2017 03:53

I will try to run some benchmarks after I receive my workstation, then I will post the results along with the setup here.

Noco January 15, 2018 03:04

What will be the current optimum price/performance configuration to buy?

My understanding:

System "AMD"
CPU: 2x AMD Epyc 7601 (32 cores, 2.2GHz base, 3,2 GHz Turbo) - do 32 cores will give me extra power worse of money? or 16 cores is more/less optimum because of DDR channels?
RAM: 16 x 16GB Samsung 2Rx4 DDR4-2666 reg ECC (maybe some DDR4-3600 if there is for Supermicro H11DSi-NT motherboard) Which amount and speed I need for optimum for 2x16 and 2x32 cores?
Mainboard: Supermicro H11DSi-NT (for Ethernet speed to add some more computers using ansys cfx parallel solver)
GPU: No
SSD: Samsung 850 - 512 Gb for system
HD: 4x8 Tb Seagate Enterprice SATA III 3,5 - RAID 5 or 6 to make it safe (if 1 HD go down you can make comeback with this type of RAID)
PSU: Be Quite 1200W

As alternative:

Maybe try to invest in 1 CPU (72 cores 1,5 Ghz, 1,7 Turbo) Intel Xeon Phi 7290F? Does someone run/own such computer with CFX/Fluent?

https://www.intel.com/content/www/us...548.1516002678

https://www.youtube.com/watch?v=I0U6ZMeVrB4

flotus1 January 15, 2018 03:48

Quote:

CPU: 2x AMD Epyc 7601 (32 cores, 2.2GHz base, 3,2 GHz Turbo) - do 32 cores will give me extra power worse of money? or 16 cores is more/less optimum because of DDR channels?
RAM: 16 x 16GB Samsung 2Rx4 DDR4-2666 reg ECC (maybe some DDR4-3600 if there is for Supermicro H11DSi-NT motherboard) Which amount and speed I need for optimum for 2x16 and 2x32 cores?
Mainboard: Supermicro H11DSi-NT (for Ethernet speed to add some more computers using ansys cfx parallel solver)
GPU: No
SSD: Samsung 850 - 512 Gb for system
HD: 4x8 Tb Seagate Enterprice SATA III 3,5 - RAID 5 or 6 to make it safe (if 1 HD go down you can make comeback with this type of RAID)
PSU: Be Quite 1200W
Price/Performance: AMD Epyc 7301 over anything else.
It costs less than 900$ compared to nearly 4000$ for the 32-core variant. You could build two systems with the same total amount of cores and much higher total performance.
You need 16 DIMMS for this platform. Overclocking memory is no longer a thing with server platforms, so stiick to DDR4-2666 maximum. There is no faster reg ECC available anyway.
Unless this is supposed to be a headless node, put in at least a small GPU like a GTX 1050TI.
A 1200W power supply is a bit on the high side, the system as configured will never draw more than 400W. My power supply is rated for 850W (Seasonic Focus Plus Platinum) only because it has more connectors than the 750W variant.
Speaking of 10G Ethernet: You could give it a try, but in the end you might want to switch to infiniband if you connect more nodes.

Xeon Phi are not an alternative if you are not running code developed specifically for this platform. Commercial software like Fluent and CFX does not make full use the potential of this architecture, this is still under development. And even if they did, I highly doubt that it would outperform dual-Epyc for CFD workloads.
There may not be many CFD benchmarks available for this type of processor, but this already tells a lot: If it were actually faster than normal platforms for CFD, Ansys and Intel marketing would not stop bragging about it.

Noco January 15, 2018 04:08

Thank you!

Some small issues:
1. Do I need water cooling?
2. Server-like horizontal or vertical large tower?
3. If water cooling - 4 120 mm radiators with water per each CPU?

My i9 18 cores with 2 120 mm radiators with water are up 110 C for after 1 hour solving.

flotus1 January 15, 2018 04:18

I am using Noctua NH-U14s TR4-SP3 air coolers. The CPUs themselves run pretty cool thanks to the large surface area (and soldered heatspreader which Intel I9 are lacking), so water cooling is completely unnecessary from the thermal point of view. If you do it for aesthetics or some other reason, go ahead :D

The type of case is up to you, depends if you prefer rackmount or workstation. I have a normal E-ATX workstation case. Currently Nanoxia deep silence 2, but switching to Fractal design Define XL R2 for better build quality.

Noco January 16, 2018 01:05

We found this benchmark:

http://www.ansys.com/solutions/solut...ntrifugal-pump

So as I understand 16 cores AMD Epyc CPU has best efficiency - 100% for CFD. No need to buy CPU with more cores, better to buy second computer.

For Intel - better to buy 32 cores CPUs.

But I do not understand why there is no 6,8,12 cores CPUs in this benchmark.

flotus1 January 16, 2018 04:50

You might have misinterpreted the benchmark results.
Those "100%" are just the baseline parallel efficiency. The other results are normalized with the performance at this data point. Parallel efficiency less than 100% is a normal result for scaling on a single node. Look at the core solver rating for a less confusing performance metric. Higher numbers here mean better performance, comparable across all data points.
All you can take from the parallel efficiency: if you pay for your licenses, do not buy the high core count CPU models, neither from Intel nor from AMD. Instead, get more nodes with low to medium core count CPUs.
Overall, I would take this benchmark result with a grain of salt. The platform description is pretty minimalistic to say the least.

JBeilke March 7, 2018 05:04

Hi Alex,

are there new observations when running this machine?

I plan to buy a 2 processor 7351 epyc machine. What is your experience with the noise? Is it a machine to place it under the desk?

flotus1 March 7, 2018 05:25

I would say "so far, so good". I have nothing negative to say about AMD Epyc in general. Of course with the exception that the CPU architecture has one or two drawbacks aside from its benefits. single-threaded workloads that require more RAM than one NUMA-Node has will run rather slow. This is generally the same on dual-socket Intel machines, but here one NUMA node spans half of the total memory, for AMD it is only one eight. An issue you should be aware of before deciding which CPU to buy.

In terms of noise, you get what you make of it. It can be just as silent or annoying as any other workstation. I prefer quiet, so I picked the largest CPU coolers from Noctua and also their highest-quality 140mm case fans. The machine is less noisy than any of our pre-built Dell and HP workstations in the office.
There are some issues with Supermicro boards and slow-spinning fans. If the fan rpm drops below 500, the board detects it as "stalled" and revs up all the fans to maximum in a cycle of a few seconds. There is no solution from supermicro for this (other than the recommendation of buying high-rpm fans from supermicro :rolleyes:), but a workaround that lets you lower the fan thresholds: https://calvin.me/quick-how-to-decre...fan-threshold/
Worked for me.
By the way: I am planning to replace the motherboard as soon as any other brand releases dual-socket SP3 boards. ASRock rack appears to be working on it. The reason being that I have had some issues with it (see above) and the experiences I had with their end-user support were pretty unpleasant. And then there are some ridiculous design decisions, like using only 2 PCIe-Lanes for the m.2 port.
But don't let that discourage you, I tend to have negative experience with customer support from many companies when they can not go off-script.

JBeilke March 7, 2018 05:50

Many thanks for the info. I'm in contact with Delta Computer. Let's see what they can put together.

I can still use the i7-3960X for single core jobs. Even after so many years it is still a very fast machine. I also replaced all water coolings with Noctuas some years ago. It's much much better in terms of reliability and noise.

Viele Grüße
Jörn

mpoit April 7, 2018 18:29

Hello,

Thanks Flotus1 for this interesting topic.

I note that the "Samsung 2Rx4 DDR4-2133 reg ECC" is quite expenssive.
Do you know the performance with 8*16Go RAM ? I know it would be better to install 16*8Go than 8*16Go for a total of 128 Go on 16 slots but the second option let the choice to upgrade easily .

mathieu

flotus1 April 8, 2018 04:39

In the first post I wrote that I do not recommend buying this particular memory, but DDR4-2666 instead. The reason I used it was simply that I already had this RAM, bought it back when prices were lower.
Any DDR4 is quite expensive nowadays, but you will get significantly lower parallel performance with only 8 DIMMs installed.

mpoit April 8, 2018 09:22

Ok thanks for your answer.
I suppose this is particular true here because the epyc 7301 has a 8 memory channels and not a quad. I have a workstation with a 4 sockets motherboard and 4 cpu amd octeron 6380 which are quad channels. For each CPU i have 4*16Go RAM (there is 8 slots RAM/cpu) but I suppose my performance on this machine will not increase if i put 8*16Go because my cpu are quad channels and not octo channels like epyc are. Do you agree ?

Other question : do you think performance with your epyc 7301 will decrease with a 16*4Go RAM configuration (because only 2 Go Ram/core whereas it is commended to have 4 to 8 Go/core)?

thanks for help
mathieu

flotus1 April 8, 2018 09:34

Quote:

I suppose my performance on this machine will not increase if i put 8*16Go because my cpu are quad channels and not octo channels like epyc are. Do you agree ?
I agree. Unless you have your memory not populated properly in which case just filling all the slots will solve this issue ;)
Then again, fully populating all slots on this particular machine with dual (or quad?) rank DIMMs will probably decrease performance because it reduces the memory speed. See the manual of your motherboard.

Quote:

do you think performance with your epyc 7301 will decrease with a 16*4Go RAM configuration (because only 2 Go Ram/core whereas it is commended to have 4 to 8 Go/core)?
I am not a huge fan of this "use at least X GB of RAM per core" recommendation. I wonder who came up with it in the first place. All you need is enough total memory so your simulations fit into RAM.
But one issue specific to AMD Epyc you should be aware of: single-threaded workloads will run slower if they require more memory than available on one NUMA node. So putting only 64GB total on a dual-socket Epyc workstation, you will run into this problem more often because one NUMA node only addresses 8GB of RAM.
I really would not drop down to 4GB DIMMs because they are significantly more expensive per GB than 8GB or 16GB DIMMs.

mpoit April 8, 2018 11:26

Thank you for your answer.

Quote:

I agree. Unless you have your memory not populated properly in which case just filling all the slots will solve this issue ;)
Then again, fully populating all slots on this particular machine with dual (or quad?) rank DIMMs will probably decrease performance because it reduces the memory speed. See the manual of your motherboard.
Yes manual recommands 4 RAM BAR per cpu but performances are so disappointing that i wonder if i'm not missing something. Anyway that's an other subject i have to deal with :)...

Quote:

single-threaded workloads will run slower if they require more memory than available on one NUMA node. So putting only 64GB total on a dual-socket Epyc workstation, you will run into this problem more often because one NUMA node only addresses 8GB of RAM
Ok good to know that but what the difference with a xeon cpu ? If you have to run a 40M cell on your dual epyc 7301, i understand that each of the eigh NUMA node manages 5M cells and 8Go RAM per Numa node should be enougth for that ? Am i wrong ?

Quote:

I really would not drop down to 4GB DIMMs because they are significantly more expensive per GB than 8GB or 16GB DIMMs.
I see a good price linearity between 4, 8 and 16 Go ddr4 dual rank RAM on internet. Respectively about 50, 100 and 200 euros.

Other question : do you know scalability with cluster made of two nodes (2*(2*epyc 7301)) and the type of connection you would use (min 10Gb/s i suppose) ?

Thanks
Mathieu

flotus1 April 8, 2018 11:38

Quote:

Yes manual recommands 4 RAM BAR per cpu but performances are so disappointing that i wonder if i'm not missing something. Anyway that's an other subject i have to deal with ...
Not only 4 sticks per CPU, but it is also important which slots to use. https://www.cfd-online.com/Forums/ha...ation-cfd.html

Quote:

Ok good to know that but what the difference with a xeon cpu ? If you have to run a 40M cell on your dual epyc 7301, i understand that each of the eigh NUMA node manages 5M cells and 8Go RAM per Numa node should be enougth for that ? Am i wrong ?
Again, the difference is not in parallel workloads, but in single-threaded workloads.
Compare two dual-socket machines with Xeon and Epyc, both with 128GB of RAM total.
On the Xeon machine each of the 2 NUMA nodes has 64GB of RAM. So no worries running a single-threaded workload that requires 40GB of RAM. On Epyc, each of the 8 NUMA node has 16GB of RAM. Running a single-threaded 40GB job here will result in high inter-node communication, slowing down the process significantly due to lower bandwidth and higher latency.

Quote:

Other question : do you know scalability with cluster made of two nodes (2*(2*epyc 7301)) and the type of connection you would use (min 10Gb/s i suppose) ?
I really would not bother with Ethernet here. All you need are two Infiniband cards and a cable, these can be bought dirt-cheap on ebay. With this setup, scalability in terms of hardware should be perfect. The only thing that could hold you back is software that does not scale well or simulations that are too small to scale properly on 64 cores.

By the way, at the current price Epyc 7281 is a very attractive option if money is tight. Before sacrificing memory channels on Epyc 7301, I would consider this CPU instead.

mpoit April 8, 2018 12:40

Quote:

Not only 4 sticks per CPU, but it is also important which slots to use. https://www.cfd-online.com/Forums/ha...ation-cfd.html
Yes thanks i have seen this topic and have the same problem. The slots used are those recommanded in the notice so that why i'm interested in selling my composants to restard with a epyc 7301 configuration.

Quote:

Again, the difference is not in parallel workloads, but in single-threaded workloads.
Compare two dual-socket machines with Xeon and Epyc, both with 128GB of RAM total.
On the Xeon machine each of the 2 NUMA nodes has 64GB of RAM. So no worries running a single-threaded workload that requires 40GB of RAM. On Epyc, each of the 8 NUMA node has 16GB of RAM. Running a single-threaded 40GB job here will result in high inter-node communication, slowing down the process significantly due to lower bandwidth and higher latency.
Ok thanks for the explination. For my point of wiew i will not run on a single-threaded workload if i can run on more without extra licence cost (with openfoam software for exemple). But you right it can be a problem with a commercial code.

Quote:

I really would not bother with Ethernet here. All you need are two Infiniband cards and a cable, these can be bought dirt-cheap on ebay. With this setup, scalability in terms of hardware should be perfect. The only thing that could hold you back is software that does not scale well or simulations that are too small to scale properly on 64 cores.
Ok good to know :). I thought infinit band was expensive. Do you have an exemple of material/mark to buy for these cards and cable?

Quote:

By the way, at the current price Epyc 7281 is a very attractive option if money is tight. Before sacrificing memory channels on Epyc 7301, I would consider this CPU instead.
I noted a small price difference bewteen 7281 and 7301 so i don't know if i would buy 7281 which look a bit less performant.

If my simulations need a lot of cpu and not to much RAM (long unsteady simulation for ex) i calculated that a 2 nodes cluster with 64 RAM/ node is only 20-30% more expenssive than 1 node with 256 Go RAM.

Thanks
Mathieu

flotus1 April 8, 2018 13:11

Infiniband cards: https://www.ebay.de/itm/40Gbps-low-p...-/152380156089
Cable: https://www.ebay.de/itm/Mellanox-MC2...gAAOSwWG5aoBMl
This is the first cable I found, there might be cheaper ones if you search a little bit more.

The price difference between Epyc 7301 and 7281 became pretty substantial. It was less than 100€ when I bought, now it is more than 300€
https://geizhals.eu/amd-epyc-7301-ps...-a1743454.html
https://geizhals.eu/amd-epyc-7281-ps...-a1743436.html
Apart from the lower amount of L3 cache, these CPUs are identical. If I had to buy now, I would be very tempted to get the 7281 instead.

If your jobs are really small in terms of cell count, I would start with one workstation first and see if it scales properly on 32 cores. Then you can decide if a second workstation is worth it.

mpoit April 9, 2018 04:29

Quote:

Infiniband cards: https://www.ebay.de/itm/40Gbps-low-p...-/152380156089
Cable: https://www.ebay.de/itm/Mellanox-MC2...gAAOSwWG5aoBMl
This is the first cable I found, there might be cheaper ones if you search a little bit more.
Ok thanks a lot for your advices :). I thought infiniband was expenssive because some people hesitate beetween infiniband and ethernet but maybe it becomes to cost a lot when you have more than 2 nodes because you need other expenssive components (switch or other.. i'm not used with cluster architecture)? Do you have link or advices for configuring and piloting such a 2 nodes cluster ?

Quote:

The price difference between Epyc 7301 and 7281 became pretty substantial. It was less than 100€ when I bought, now it is more than 300€
https://geizhals.eu/amd-epyc-7301-ps...-a1743454.html
https://geizhals.eu/amd-epyc-7281-ps...-a1743436.html
Apart from the lower amount of L3 cache, these CPUs are identical. If I had to buy now, I would be very tempted to get the 7281 instead.
Yes i see that's right. I'm so dissapointing with my opteron configuration that i really whant to be sure that my future configuration will be performant. So that's a dilemma because i know 7301 works well thanks to you !!

Quote:

If your jobs are really small in terms of cell count, I would start with one workstation first and see if it scales properly on 32 cores. Then you can decide if a second workstation is worth it.
Yes good approach, i'm going to to that.

flotus1 April 9, 2018 05:05

There are quite a few tutorials on how to setup Infiniband interconnects with various operating systems. Just use the forum search.

I think people hesitate to go Infiniband for three reasons
  1. new Infiniband hardware is expensive indeed
  2. not everyone is comfortable with used hardware
  3. setting it up properly can be tedious without a guide to follow

If you are worried about hardware failures when buying used: my opinion is that you could easily buy additional spare parts and store them in a drawer for quick replacement. Still much cheaper and faster than waiting for a warranty replacement part.

mpoit April 9, 2018 12:21

Ok thanks for your answer.

Concerning the Mainboard is it a difference between your Supermicro H11DSi and the H11DSi-NT ?

Is the second one better if i want to upgrade to a second node later?

Thanks

flotus1 April 9, 2018 12:42

NT has built-in 10Gigabit Ethernet instead of 1Gigabit.
If you want to give Ethernet a try before going Infiniband, NT is the version you want.

mpoit April 9, 2018 15:39

Quote:

Originally Posted by flotus1 (Post 688151)
NT has built-in 10Gigabit Ethernet instead of 1Gigabit.
If you want to give Ethernet a try before going Infiniband, NT is the version you want.

Ok good to know ! Thank you very much for help :cool:

SLC April 27, 2018 07:19

I've purchased a new compute setup.

It consists of two nodes, each with dual Intel Xeon Gold 6146 CPUs.

I don't normally run Fluent, but I want to benchmark the CPUs to give a comparison.

See below:

System
CPU: 2x Intel Xeon Gold 6146 (12 cores, 3.9 GHz all-core turbo, 4.2 GHz single-core turbo)
RAM: 12 x 8GB DDR4-2666 ECC (single rank)
Interconnect: 10 GbE
OS: Windows 10 Pro
Fluent: 19.0

1) External Flow Over an Aircraft Wing (aircraft_2m), single precision

INTEL Single Node, 1 core, 10 iterations: 234 s

INTEL Single Node, 24 cores, 100 iterations: 107 s

INTEL Dual Node, 32 cores, 100 iterations: 87 s


2) External Flow Over an Aircraft Wing (aircraft_14m), double precision

INTEL Dual Node, 24 cores, 10 iterations: 101 s

INTEL Dual Node, 32 cores, 10 iterations: 84 s

INCORRECT BENCHMARKS, SEE UPDATED POST https://www.cfd-online.com/Forums/ha...tml#post691234




I'm a little surprised by the poor single core performance of the aircraft_2m benchmark? Could this be a result of using single rank memory...? The systems are Dell Precision 7920 racks, and unfortunately Dell could only deliver dual rank memory in 32GB sticks (stupidly expensive!). The memory sticks are properly distributed/installed across the memory slots.

As far as I can tell from benchmarking, the system is performing well for CFX, both compared to my old compute setup and compared to published CFX benchmark results.

What do you guys think?

flotus1 April 27, 2018 10:45

To be honest, these numbers are lower (i.e. higher execution times) than I would expect for the kind of hardware you have. Both single-core and parallel. At least when comparing against the results in the initial post here. We used different operating systems and different software versions, so there is that...

I hardly think that poor single-threaded performance is linked to the choice of memory. A single thread usually can not saturate memory bandwidth on such a system. And even if this was the cause of the issue, performance difference between single- and dual-rank is less than 10%.

Checklist
  • disable SMT in bios
  • also in bios: rank interleaving enabled (edit: well kind of pointless for single-rank DIMMs); channel interleaving enabled; socket interleaving disabled
  • do the processors run at their expected frequencies both in single- and multi-threaded workloads, even for longer periods of time? In Windows you can use CPU-Z for this
  • do the systems reach their expected performance in synthetic benchmarks? I would recommend AIDA64 memory benchmark and Cinebench R15 respectively.
Scaling on two nodes is a different topic that would have to be addressed once we are sure that each node individually performs as expected.


All times are GMT -4. The time now is 02:56.