CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   Hardware (https://www.cfd-online.com/Forums/hardware/)
-   -   AMD Epyc CFD benchmarks with Ansys Fluent (https://www.cfd-online.com/Forums/hardware/196400-amd-epyc-cfd-benchmarks-ansys-fluent.html)

Noco January 16, 2018 02:05

We found this benchmark:

http://www.ansys.com/solutions/solut...ntrifugal-pump

So as I understand 16 cores AMD Epyc CPU has best efficiency - 100% for CFD. No need to buy CPU with more cores, better to buy second computer.

For Intel - better to buy 32 cores CPUs.

But I do not understand why there is no 6,8,12 cores CPUs in this benchmark.

flotus1 January 16, 2018 05:50

You might have misinterpreted the benchmark results.
Those "100%" are just the baseline parallel efficiency. The other results are normalized with the performance at this data point. Parallel efficiency less than 100% is a normal result for scaling on a single node. Look at the core solver rating for a less confusing performance metric. Higher numbers here mean better performance, comparable across all data points.
All you can take from the parallel efficiency: if you pay for your licenses, do not buy the high core count CPU models, neither from Intel nor from AMD. Instead, get more nodes with low to medium core count CPUs.
Overall, I would take this benchmark result with a grain of salt. The platform description is pretty minimalistic to say the least.

JBeilke March 7, 2018 06:04

Hi Alex,

are there new observations when running this machine?

I plan to buy a 2 processor 7351 epyc machine. What is your experience with the noise? Is it a machine to place it under the desk?

flotus1 March 7, 2018 06:25

I would say "so far, so good". I have nothing negative to say about AMD Epyc in general. Of course with the exception that the CPU architecture has one or two drawbacks aside from its benefits. single-threaded workloads that require more RAM than one NUMA-Node has will run rather slow. This is generally the same on dual-socket Intel machines, but here one NUMA node spans half of the total memory, for AMD it is only one eight. An issue you should be aware of before deciding which CPU to buy.

In terms of noise, you get what you make of it. It can be just as silent or annoying as any other workstation. I prefer quiet, so I picked the largest CPU coolers from Noctua and also their highest-quality 140mm case fans. The machine is less noisy than any of our pre-built Dell and HP workstations in the office.
There are some issues with Supermicro boards and slow-spinning fans. If the fan rpm drops below 500, the board detects it as "stalled" and revs up all the fans to maximum in a cycle of a few seconds. There is no solution from supermicro for this (other than the recommendation of buying high-rpm fans from supermicro :rolleyes:), but a workaround that lets you lower the fan thresholds: https://calvin.me/quick-how-to-decre...fan-threshold/
Worked for me.
By the way: I am planning to replace the motherboard as soon as any other brand releases dual-socket SP3 boards. ASRock rack appears to be working on it. The reason being that I have had some issues with it (see above) and the experiences I had with their end-user support were pretty unpleasant. And then there are some ridiculous design decisions, like using only 2 PCIe-Lanes for the m.2 port.
But don't let that discourage you, I tend to have negative experience with customer support from many companies when they can not go off-script.

JBeilke March 7, 2018 06:50

Many thanks for the info. I'm in contact with Delta Computer. Let's see what they can put together.

I can still use the i7-3960X for single core jobs. Even after so many years it is still a very fast machine. I also replaced all water coolings with Noctuas some years ago. It's much much better in terms of reliability and noise.

Viele Grüße
Jörn

mpoit April 7, 2018 19:29

Hello,

Thanks Flotus1 for this interesting topic.

I note that the "Samsung 2Rx4 DDR4-2133 reg ECC" is quite expenssive.
Do you know the performance with 8*16Go RAM ? I know it would be better to install 16*8Go than 8*16Go for a total of 128 Go on 16 slots but the second option let the choice to upgrade easily .

mathieu

flotus1 April 8, 2018 05:39

In the first post I wrote that I do not recommend buying this particular memory, but DDR4-2666 instead. The reason I used it was simply that I already had this RAM, bought it back when prices were lower.
Any DDR4 is quite expensive nowadays, but you will get significantly lower parallel performance with only 8 DIMMs installed.

mpoit April 8, 2018 10:22

Ok thanks for your answer.
I suppose this is particular true here because the epyc 7301 has a 8 memory channels and not a quad. I have a workstation with a 4 sockets motherboard and 4 cpu amd octeron 6380 which are quad channels. For each CPU i have 4*16Go RAM (there is 8 slots RAM/cpu) but I suppose my performance on this machine will not increase if i put 8*16Go because my cpu are quad channels and not octo channels like epyc are. Do you agree ?

Other question : do you think performance with your epyc 7301 will decrease with a 16*4Go RAM configuration (because only 2 Go Ram/core whereas it is commended to have 4 to 8 Go/core)?

thanks for help
mathieu

flotus1 April 8, 2018 10:34

Quote:

I suppose my performance on this machine will not increase if i put 8*16Go because my cpu are quad channels and not octo channels like epyc are. Do you agree ?
I agree. Unless you have your memory not populated properly in which case just filling all the slots will solve this issue ;)
Then again, fully populating all slots on this particular machine with dual (or quad?) rank DIMMs will probably decrease performance because it reduces the memory speed. See the manual of your motherboard.

Quote:

do you think performance with your epyc 7301 will decrease with a 16*4Go RAM configuration (because only 2 Go Ram/core whereas it is commended to have 4 to 8 Go/core)?
I am not a huge fan of this "use at least X GB of RAM per core" recommendation. I wonder who came up with it in the first place. All you need is enough total memory so your simulations fit into RAM.
But one issue specific to AMD Epyc you should be aware of: single-threaded workloads will run slower if they require more memory than available on one NUMA node. So putting only 64GB total on a dual-socket Epyc workstation, you will run into this problem more often because one NUMA node only addresses 8GB of RAM.
I really would not drop down to 4GB DIMMs because they are significantly more expensive per GB than 8GB or 16GB DIMMs.

mpoit April 8, 2018 12:26

Thank you for your answer.

Quote:

I agree. Unless you have your memory not populated properly in which case just filling all the slots will solve this issue ;)
Then again, fully populating all slots on this particular machine with dual (or quad?) rank DIMMs will probably decrease performance because it reduces the memory speed. See the manual of your motherboard.
Yes manual recommands 4 RAM BAR per cpu but performances are so disappointing that i wonder if i'm not missing something. Anyway that's an other subject i have to deal with :)...

Quote:

single-threaded workloads will run slower if they require more memory than available on one NUMA node. So putting only 64GB total on a dual-socket Epyc workstation, you will run into this problem more often because one NUMA node only addresses 8GB of RAM
Ok good to know that but what the difference with a xeon cpu ? If you have to run a 40M cell on your dual epyc 7301, i understand that each of the eigh NUMA node manages 5M cells and 8Go RAM per Numa node should be enougth for that ? Am i wrong ?

Quote:

I really would not drop down to 4GB DIMMs because they are significantly more expensive per GB than 8GB or 16GB DIMMs.
I see a good price linearity between 4, 8 and 16 Go ddr4 dual rank RAM on internet. Respectively about 50, 100 and 200 euros.

Other question : do you know scalability with cluster made of two nodes (2*(2*epyc 7301)) and the type of connection you would use (min 10Gb/s i suppose) ?

Thanks
Mathieu

flotus1 April 8, 2018 12:38

Quote:

Yes manual recommands 4 RAM BAR per cpu but performances are so disappointing that i wonder if i'm not missing something. Anyway that's an other subject i have to deal with ...
Not only 4 sticks per CPU, but it is also important which slots to use. https://www.cfd-online.com/Forums/ha...ation-cfd.html

Quote:

Ok good to know that but what the difference with a xeon cpu ? If you have to run a 40M cell on your dual epyc 7301, i understand that each of the eigh NUMA node manages 5M cells and 8Go RAM per Numa node should be enougth for that ? Am i wrong ?
Again, the difference is not in parallel workloads, but in single-threaded workloads.
Compare two dual-socket machines with Xeon and Epyc, both with 128GB of RAM total.
On the Xeon machine each of the 2 NUMA nodes has 64GB of RAM. So no worries running a single-threaded workload that requires 40GB of RAM. On Epyc, each of the 8 NUMA node has 16GB of RAM. Running a single-threaded 40GB job here will result in high inter-node communication, slowing down the process significantly due to lower bandwidth and higher latency.

Quote:

Other question : do you know scalability with cluster made of two nodes (2*(2*epyc 7301)) and the type of connection you would use (min 10Gb/s i suppose) ?
I really would not bother with Ethernet here. All you need are two Infiniband cards and a cable, these can be bought dirt-cheap on ebay. With this setup, scalability in terms of hardware should be perfect. The only thing that could hold you back is software that does not scale well or simulations that are too small to scale properly on 64 cores.

By the way, at the current price Epyc 7281 is a very attractive option if money is tight. Before sacrificing memory channels on Epyc 7301, I would consider this CPU instead.

mpoit April 8, 2018 13:40

Quote:

Not only 4 sticks per CPU, but it is also important which slots to use. https://www.cfd-online.com/Forums/ha...ation-cfd.html
Yes thanks i have seen this topic and have the same problem. The slots used are those recommanded in the notice so that why i'm interested in selling my composants to restard with a epyc 7301 configuration.

Quote:

Again, the difference is not in parallel workloads, but in single-threaded workloads.
Compare two dual-socket machines with Xeon and Epyc, both with 128GB of RAM total.
On the Xeon machine each of the 2 NUMA nodes has 64GB of RAM. So no worries running a single-threaded workload that requires 40GB of RAM. On Epyc, each of the 8 NUMA node has 16GB of RAM. Running a single-threaded 40GB job here will result in high inter-node communication, slowing down the process significantly due to lower bandwidth and higher latency.
Ok thanks for the explination. For my point of wiew i will not run on a single-threaded workload if i can run on more without extra licence cost (with openfoam software for exemple). But you right it can be a problem with a commercial code.

Quote:

I really would not bother with Ethernet here. All you need are two Infiniband cards and a cable, these can be bought dirt-cheap on ebay. With this setup, scalability in terms of hardware should be perfect. The only thing that could hold you back is software that does not scale well or simulations that are too small to scale properly on 64 cores.
Ok good to know :). I thought infinit band was expensive. Do you have an exemple of material/mark to buy for these cards and cable?

Quote:

By the way, at the current price Epyc 7281 is a very attractive option if money is tight. Before sacrificing memory channels on Epyc 7301, I would consider this CPU instead.
I noted a small price difference bewteen 7281 and 7301 so i don't know if i would buy 7281 which look a bit less performant.

If my simulations need a lot of cpu and not to much RAM (long unsteady simulation for ex) i calculated that a 2 nodes cluster with 64 RAM/ node is only 20-30% more expenssive than 1 node with 256 Go RAM.

Thanks
Mathieu

flotus1 April 8, 2018 14:11

Infiniband cards: https://www.ebay.de/itm/40Gbps-low-p...-/152380156089
Cable: https://www.ebay.de/itm/Mellanox-MC2...gAAOSwWG5aoBMl
This is the first cable I found, there might be cheaper ones if you search a little bit more.

The price difference between Epyc 7301 and 7281 became pretty substantial. It was less than 100€ when I bought, now it is more than 300€
https://geizhals.eu/amd-epyc-7301-ps...-a1743454.html
https://geizhals.eu/amd-epyc-7281-ps...-a1743436.html
Apart from the lower amount of L3 cache, these CPUs are identical. If I had to buy now, I would be very tempted to get the 7281 instead.

If your jobs are really small in terms of cell count, I would start with one workstation first and see if it scales properly on 32 cores. Then you can decide if a second workstation is worth it.

mpoit April 9, 2018 05:29

Quote:

Infiniband cards: https://www.ebay.de/itm/40Gbps-low-p...-/152380156089
Cable: https://www.ebay.de/itm/Mellanox-MC2...gAAOSwWG5aoBMl
This is the first cable I found, there might be cheaper ones if you search a little bit more.
Ok thanks a lot for your advices :). I thought infiniband was expenssive because some people hesitate beetween infiniband and ethernet but maybe it becomes to cost a lot when you have more than 2 nodes because you need other expenssive components (switch or other.. i'm not used with cluster architecture)? Do you have link or advices for configuring and piloting such a 2 nodes cluster ?

Quote:

The price difference between Epyc 7301 and 7281 became pretty substantial. It was less than 100€ when I bought, now it is more than 300€
https://geizhals.eu/amd-epyc-7301-ps...-a1743454.html
https://geizhals.eu/amd-epyc-7281-ps...-a1743436.html
Apart from the lower amount of L3 cache, these CPUs are identical. If I had to buy now, I would be very tempted to get the 7281 instead.
Yes i see that's right. I'm so dissapointing with my opteron configuration that i really whant to be sure that my future configuration will be performant. So that's a dilemma because i know 7301 works well thanks to you !!

Quote:

If your jobs are really small in terms of cell count, I would start with one workstation first and see if it scales properly on 32 cores. Then you can decide if a second workstation is worth it.
Yes good approach, i'm going to to that.

flotus1 April 9, 2018 06:05

There are quite a few tutorials on how to setup Infiniband interconnects with various operating systems. Just use the forum search.

I think people hesitate to go Infiniband for three reasons
  1. new Infiniband hardware is expensive indeed
  2. not everyone is comfortable with used hardware
  3. setting it up properly can be tedious without a guide to follow

If you are worried about hardware failures when buying used: my opinion is that you could easily buy additional spare parts and store them in a drawer for quick replacement. Still much cheaper and faster than waiting for a warranty replacement part.

mpoit April 9, 2018 13:21

Ok thanks for your answer.

Concerning the Mainboard is it a difference between your Supermicro H11DSi and the H11DSi-NT ?

Is the second one better if i want to upgrade to a second node later?

Thanks

flotus1 April 9, 2018 13:42

NT has built-in 10Gigabit Ethernet instead of 1Gigabit.
If you want to give Ethernet a try before going Infiniband, NT is the version you want.

mpoit April 9, 2018 16:39

Quote:

Originally Posted by flotus1 (Post 688151)
NT has built-in 10Gigabit Ethernet instead of 1Gigabit.
If you want to give Ethernet a try before going Infiniband, NT is the version you want.

Ok good to know ! Thank you very much for help :cool:

SLC April 27, 2018 08:19

I've purchased a new compute setup.

It consists of two nodes, each with dual Intel Xeon Gold 6146 CPUs.

I don't normally run Fluent, but I want to benchmark the CPUs to give a comparison.

See below:

System
CPU: 2x Intel Xeon Gold 6146 (12 cores, 3.9 GHz all-core turbo, 4.2 GHz single-core turbo)
RAM: 12 x 8GB DDR4-2666 ECC (single rank)
Interconnect: 10 GbE
OS: Windows 10 Pro
Fluent: 19.0

1) External Flow Over an Aircraft Wing (aircraft_2m), single precision

INTEL Single Node, 1 core, 10 iterations: 234 s

INTEL Single Node, 24 cores, 100 iterations: 107 s

INTEL Dual Node, 32 cores, 100 iterations: 87 s


2) External Flow Over an Aircraft Wing (aircraft_14m), double precision

INTEL Dual Node, 24 cores, 10 iterations: 101 s

INTEL Dual Node, 32 cores, 10 iterations: 84 s

INCORRECT BENCHMARKS, SEE UPDATED POST https://www.cfd-online.com/Forums/ha...tml#post691234




I'm a little surprised by the poor single core performance of the aircraft_2m benchmark? Could this be a result of using single rank memory...? The systems are Dell Precision 7920 racks, and unfortunately Dell could only deliver dual rank memory in 32GB sticks (stupidly expensive!). The memory sticks are properly distributed/installed across the memory slots.

As far as I can tell from benchmarking, the system is performing well for CFX, both compared to my old compute setup and compared to published CFX benchmark results.

What do you guys think?

flotus1 April 27, 2018 11:45

To be honest, these numbers are lower (i.e. higher execution times) than I would expect for the kind of hardware you have. Both single-core and parallel. At least when comparing against the results in the initial post here. We used different operating systems and different software versions, so there is that...

I hardly think that poor single-threaded performance is linked to the choice of memory. A single thread usually can not saturate memory bandwidth on such a system. And even if this was the cause of the issue, performance difference between single- and dual-rank is less than 10%.

Checklist
  • disable SMT in bios
  • also in bios: rank interleaving enabled (edit: well kind of pointless for single-rank DIMMs); channel interleaving enabled; socket interleaving disabled
  • do the processors run at their expected frequencies both in single- and multi-threaded workloads, even for longer periods of time? In Windows you can use CPU-Z for this
  • do the systems reach their expected performance in synthetic benchmarks? I would recommend AIDA64 memory benchmark and Cinebench R15 respectively.
Scaling on two nodes is a different topic that would have to be addressed once we are sure that each node individually performs as expected.


All times are GMT -4. The time now is 11:31.