CFD Online Discussion Forums - Weak parallel efficiency of TR3990X-based workstation with Star-CCM+

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Hardware (https://www.cfd-online.com/Forums/hardware/)

- - Weak parallel efficiency of TR3990X-based workstation with Star-CCM+ (https://www.cfd-online.com/Forums/hardware/235354-weak-parallel-efficiency-tr3990x-based-workstation-star-ccm.html)

kiteguy

April 12, 2021 06:01

Weak parallel efficiency of TR3990X-based workstation with Star-CCM+

2 Attachment(s)

Hello all,

I recently ran some benchmark tests with Star-CCM+ on my departments workstation and noticed that the parallel efficiency scales painfully bad with the number of cores (see the attached specifications and benchmark results).

I have read in other posts that the Threadripper lineup is not ideal for CFD-purposes due to the relatively small number of available memory lanes (4). Does this however explain why the parallel efficiency drops to as little as 65% with 8 cores already?

I would really appreciate any suggestions on how to find and possibly fix the bottle-neck in the set-up.

Best regards!

Attachments:
- Screenshot of representative benchmark test with Star-CCM+
- Workstation specifications

CPU: AMD Ryzen Threadripper 3990X 64-Core Processor
mem: Corsair Vengeance LPX, DDR4-3200, CL16 - 64 GB Dual Kit (128GB total)
graphics: Gigabyte GeForce GTX 1660 Ti OC 6G, 6144MB GDDR6
SSD: Gigabyte Aorus NVMe SSD, PCIe 4.0 M.2 Type 2280

flotus1

April 12, 2021 10:57

It is not entirely unexpected that you got less-than-linear scaling with 8 threads.
Whether 65% parallel efficiency is too low or not, I would not want to judge.

Maybe we can narrow things down by answering a few questions:
Are you running a double precision solver version?
What about the absolute run time of your benchmark? Can you compare it to some other machine?
Memory is sitting in the right slots according to the motherboard manual? And it's actually running at DDR4-3200?
Have you tried checking which physical cores these 8 threads are pinned to? tools like htop and lstopo come in handy. Ideally, it should be one core from each of the 8 compute dies.
Have you run any other popular benchmarks? This would allow you to stress-test your system (also keep an eye on temperatures and frequency), and compare to known good results.

kiteguy

April 12, 2021 12:00

1 Attachment(s)

thanks for the quick reply, Alex! I appreciate the help.

(1) I am running the double precision version, which I now understand may entail compromises on the performance. To be honest, I never considered this and will have the mixed version installed!
(2) I compared the performance to another machine with similar specs (other CPU but same amount of RAM, see the attached screenshot) which requires about twice the time on a single core but scales linearly with the number of cores and thus becomes as good as our problem child or better with 8 cores and more.
(3) We opened up the workstation and checked: 4*32 GB of memory sticks are installed (1 per memory lane)
(4) Hyperthreading is deactivated, so the tasks must be running on physical cores. Unfortunately I don't have the necessary admin rights to check whether all compute dies are used, but I will look into this with our IT department.
(5) We haven't looked into other benchmark tests yet, but our IT department is planning to run a Cinebench on it.

Upon reading through a spotlight presentation by the developers of Star-CCM+ outlining hardware requirements, I noticed they recommend 2 memory sticks per lane. Could this explain our issue?

gnwt4a

April 12, 2021 12:46

using the gnu compiler on the v3 system may be disadvantageous to intel chips. the intel compilers and mkl libs are now free for use in private and academia.

kiteguy

April 12, 2021 12:55

Quote:

Originally Posted by gnwt4a (Post 801306)

using the gnu compiler on the v3 system may be disadvantageous to intel chips. the intel compilers and mkl libs are now free for use in private and academia.

thanks for the info! But if this is an Intel issue it probably does not apply to the AMD TR3990X which is installed in our machine, does it?

gnwt4a

April 12, 2021 13:26

right. u may be underestimating the v3 performance - that is all. fwiw, a couple of years ago i compared the performance of the TR gen 1 (16 core) against a 3930k using a fortran dns code, and the tr was about 10% faster than the 6-core intel chip. i expect things to be better with amd now, but i will not be buying an amd chip which is a rejig of zen-1. wait until zen-4.

flotus1

April 12, 2021 14:54

Quote:

(1) I am running the double precision version, which I now understand may entail compromises on the performance. To be honest, I never considered this and will have the mixed version installed!

In a situation like this with a pretty severe memory bandwidth bottleneck, the single precision solver will be significantly faster. Only use DP if you really need it.

Quote:

(2) I compared the performance to another machine with similar specs (other CPU but same amount of RAM, see the attached screenshot) which requires about twice the time on a single core but scales linearly with the number of cores and thus becomes as good as our problem child or better with 8 cores and more.

It's a bit odd that the v3 Xeon catches up at 8 cores. Which Linux version are you running, and which kernel version?

Quote:

(3) We opened up the workstation and checked: 4*32 GB of memory sticks are installed (1 per memory lane)

Crack open the manual for your motherboard. It will contain a recommendation which exact DIMM slots need to be populated with 4 DIMMs.
Also: check which transfer rate the memory is actually running at. Just because you bought memory rated for up to DDR4-3200 doesn't mean that it is running at that speed.

Quote:

Upon reading through a spotlight presentation by the developers of Star-CCM+ outlining hardware requirements, I noticed they recommend 2 memory sticks per lane. Could this explain our issue?

No, one DIMM per channel is enough to get very close to peak performance. As long as they are in the correct slots and running at the advertised transfer rate. Again: check both.
What thy are probably referring to is the number of ranks per channel. There can be a performance difference in the order of 10% with one rank per channel vs. 2. But since the DIMMs you bought are already dual-rank, you automatically have two ranks per channel. Again, provided they are in the right slots.

kiteguy

April 13, 2021 08:04

Hey Alex, thanks so much! Very helpful once again.

(1) the local machine runs 'Ubuntu 20.04.2 LTS' and kernel version '5.8.0-48-generic'. The machine used for comparison hosts 'Scientific Linux release 7.7 (Nitrogen)' and kernel version '3.10.0-1160.15.2.el7.x86_64'

(2) The four DIMMs were indeed installed in the correct slots on the motherboard, we will check the actual transfer rate as soon as possible.

I read that the developers of Star-CCM+ advise to set the NUMA nodes per socket (NPS) to 4 for AMD Epyc processors. Do you think this should also be the case for the TR3990x? (We use Power-On-Demand licenses, so the number of nodes should not be an issue)

I will post some performance updates once the IT department has the mixed precision version installed!

flotus1

April 13, 2021 08:39

I was hoping that part of the problem might be an old Linux kernel. But that doesn't seem to be the case.

For Epyc Rome CPUs, NPS=4 is indeed the best setting. But the performance difference to NPS=1 is not huge, in the order of 10%.
Not sure if a Zen2 Threadripper CPU/mobo has the same option available. It might only go up to NPS=2.

kiteguy

April 14, 2021 10:28

quick update:

- switching to the most recent mixed-solver version decreased the single-core runtime by approx. 29%
- we changed the NPS setting from Auto to 2, which decreased the efficiency a little at small core counts but seems to be quite beneficial for 16+ cores (up to 9% increase). NPS=4 was also possible but led to worse performance at core counts between 2 and 32.

So all in all, the scaling is still not great but the run-times already look much better than a few days ago. We're hoping to get a further performance boost by installing the additional 4 memory sticks in order to operate two per memory lane.

Simbelmynë

April 14, 2021 11:57

Why is it not possible to operate the memory at 3200 MT/s? This seems a bit odd to me.

If you have control of the BIOS you could also try to tweak the memory settings. Zen2 and Zen3 can see huge increases with the proper memory timings. I have not seen reports from the Threadripper series yet on this forum and the only Threadripper I have access to is first generation which has a rather crappy memory controller, so I cannot test it myself.

The Ryzen DRAM calculator has options for Threadripper so you could try that out. Seeing that you have a 3200 MHz CL16 memory, perhaps you should not expect any greater success, but I think it is worth a try.

EDIT: Looking at the memory support page of your MB vendor it seems that it has a large amount of RAM that has passed 3600 MT/s. For instance, this kit "F4-3600C16Q-64GTZR" is dual rank @ CL16.

The RAM support document even specifies the memory type (Samsung B-die, Hynix etc.). Running the infinity fabric 1:1 and the memory @ 3600 MT/s is likely the sweet spot for the 3990X also (for Ryzen it is).

https://download.gigabyte.com/FileLi...eme_200304.pdf

cwl	April 25, 2021 18:18

Kiteguy, you haven't actually described your simulation case, so I'll randomly pay your attention to the fact that parallel efficiency depends much on amount of Boundaries within the Region and amount of Interfaces (if any) also.

It is mentioned in the official Siemens Best Practices video (https://youtu.be/U9WUPEdX-6A) at 45:00.
I guess you have amount of Boundaries much less than hundreds mentioned in the video, yet I remember even a dozen being a slowing factor reported in Star-CCM+ section of the forum.

Maybe that could be a reason?

wkernkamp

April 27, 2021 15:53

8 channels versus 4 channels

Quote:

Originally Posted by kiteguy (Post 801304)

thanks for the quick reply, Alex! I appreciate the help.

(2) I compared the performance to another machine with similar specs (other CPU but same amount of RAM, see the attached screenshot) which requires about twice the time on a single core but scales linearly with the number of cores and thus becomes as good as our problem child or better with 8 cores and more.

Your comparison system is a dual cpu config with a total of 8 memory channels versus 4 channels for the threadripper. This explains why the dual xeons are still linear with cores at 10 cores, while the threadripper is already falling off.

All times are GMT -4. The time now is 08:34.