CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   Hardware (https://www.cfd-online.com/Forums/hardware/)
-   -   Poor scaling of dual-Xeon E5-2697A-V4 (https://www.cfd-online.com/Forums/hardware/197857-poor-scaling-dual-xeon-e5-2697a-v4.html)

havref January 19, 2018 04:16

Poor scaling of dual-Xeon E5-2697A-V4
 
2 Attachment(s)
After getting advice on this forum, I decided to run some scaling tests on our Dell R630 Server, equipped with 2x Xeon E5-2697A-V4, 8x16Gb 2400MHz Ram and an 480GB SSD. In addition, to benchmark against some larger cases I tested the Cavity-case previously tested on a HPC at NTNU: ( OpenFOAM Performance on Vilje ).

A .pdf of results, and a .txt file of memory setup is attached. In the first Scaling tests-part, all cases run are confidential hull designs, so I can't share much more information regarding those meshes. The cache was cleared between each of the benchmarks against Vilje using the command: sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

As you see in the .pdf, the performance using only 16 cores is much higher than the performance of 32 cores, even with as many as ~27 million cells. Any idea why this is happening? I was expecting 32 cores to outperform 16 by far even at the InterDyMFOAM test using 9 million cells. Is this assumption wrong?

Because of these results, I am suspecting there's something wrong in our setup, but I have no idea where to start looking. Any comments or recommendations are greatly appreciated.

ErikAdr January 19, 2018 08:12

Re: Poor scaling of dual-Xeon E5-2697A-V4
 
My guess is that the performance of your test case is limited by the memory bandwidth and not the cpu power. You have two strong cpu, but only 8 memory channels. For cfd two cores are usually enough to saturate one memory channel. The best performance in your case is achieved with that ratio between cores and memory channels. It doesn't surprise me that the performance drop when using 32 cores, but I don't believe the drop is due alone to having too many cores competing about the limited memory bandwidth. For a larger test case 32 cores could do better, but I think 16 cores are optimal.

havref January 22, 2018 03:38

Thank you for the response. It makes sense when you put it like that. I'll do some tests using 24 cores etc. as well, to see where the peak performance is for these cases.
Quote:

Originally Posted by ErikAdr (Post 678703)
For cfd two cores are usually enough to saturate one memory channel. The best performance in your case is achieved with that ratio between cores and memory channels.

So I originally performed these scaling tests to see whether our server's performance scaled well before purchasing a second server. We were looking into the Xeon Gold 6154 3.0 GHz (18 cores), 6146 3.2 GHz (12 cores) and 6144 3.5 GHz(8 cores). As these new Xeon Scalable processors have 6 memory channels per processor, can I assume that both the 8 and 12 core processors are good choices purely based on the number of cores pr memory channel? Or do you think that 12 cores will still saturate the memory channels to such a degree that the performance decreases?

ErikAdr January 22, 2018 04:53

Re: Poor scaling of dual-Xeon E5-2697A-V4
 
I have no experience with OpenFoam so I don't feel qualified to give a definitive answer on this. But I would not choose less than 12 cores and I would feel safer with 16 cores. For my own in house cfd code I would go for an EPYC 7281 or EPYC 7301 which both have 8 memory channels and cost just a fraction of Intels Gold processors. If I had to select an Intel processor, then I would go for the cheaper models like Sliver 4116 or Gold 6130, and then buy some more systems to 'fill the budget'. Perhaps You are able to get the high end processors at a better price than I. Look at https://www.spec.org/cpu2017/results to see benchmarks for various systems. I usually look at the results for bwaves_r and wrf_r where the former is the most memory intensive. Please also look for the other treads on this forum on the topic i.e. 'Epyc vs Xeon Skylake SP'.

flotus1 January 22, 2018 04:57

I have no experience with OpenFOAM, but such a drastic decrease of performance seems rather unusual.
In order to check if there is something wrong with your setup, you could run this benchmark here, I have numbers for the same platform to compare them:
http://www.palabos.org/software/download
Find the benchmark in "examples/benchmarks/cavity3d". I would recommend running problem sizes 100 and 400 with 1, 2, 4, 8... cores.

Concerning Skylake-SP, I would strongly advise against it for non-commercial CFD software. You don't have to pay a per-core license, so AMD Epyc would give you much better performance/$.

havref January 22, 2018 05:28

Thank you Alex, I'll install and run though that benchmark shortly.

For the second server we were considering a quad-socket setup with one of the Skylake-SP processors. However, if two servers with dual Epyc 7351 (or other) will give a better performance/$ we will definitely consider it. Even a single server with dual Epyc would probably be sufficient if it performs much better than the server we have now.

flotus1 January 22, 2018 05:53

I just realized that I did not run a full set of benchmarks on our newest Intel system, but it should be sufficient for a quick comparison:

2x AMD Epyc 7301
Code:

#threads          msu_100  msu_400  msu_1000
01(1 die)          9.369    12.720      7.840
02(2 dies)        17.182    24.809    19.102
04(4 dies)        33.460    48.814    49.291
08(8 dies)        56.289    95.870    105.716
16(8 dies)        102.307  158.212    158.968
32(8 dies)        169.955  252.729    294.178

2x Intel Xeon E5-2650v4
Code:

#threads  msu_100  msu_400
01          8.412    11.747
24          88.268  154.787

Full description of the systems can be found here: https://www.cfd-online.com/Forums/ha...ys-fluent.html
Yours should slightly outperform my Intel setup with 32 and 16 cores active. Testing your setup with 24 cores might give worse results, this benchmark does not run too well when the number of cores is not divisible by the number of active cores or a power of 2.

havref January 22, 2018 10:49

Here are the results. Ran each benchmark 3 times, so average values are given in the table to the left, with exact values to the right. As you can see, there's a huge decrease in MSU when increasing to 32 cores. It should also be noted that there's a large variation between each try when using 32 cores compared to the rest. Operating system is Ubuntu 16.04 LTS.
Code:

2x Intel Xeon E5-2697A V4                       
                       
#Cores        msu_100        msu_400        ||#Cores        msu_100                        msu_400               
1        10.97        12.51        ||1        10.9552        10.9675        10.9739        |12.4802 12.5846 12.4516
2        18.20        23.31        ||2        18.1663        18.022        18.4169        |23.4069 23.208  23.3125
4        29.26        39.95        ||4        29.21        29.3525        29.222        |39.8766 39.566  40.3934
8        52.70        76.58        ||8        52.6798        52.5828        52.8315        |76.3575 76.0869 77.3047
16        76.97        123.01        ||16        76.892        77.2672        76.7622        |123.351 123.381 122.295
24        84.23        141.68        ||24        84.6862        83.7586        84.2461        |140.979 141.109 142.966
32        39.61        113.68        ||32        36.5778        43.7236        38.5174        |119.524 116.61  104.91

Obviously there's something strange going on here, but I have no idea what to look for. Got any ideas? Let me know if you need more hardware info

flotus1 January 22, 2018 11:01

I have never seen such a drastic decrease of performance with this benchmark on any system when using all cores. In fact, I never observed lower performance with max cores used compared to a lower amount of cores used.
You might want to check system temperatures (sensors) and frequencies (sudo turbostat) while running the benchmark.
I assume hyperthreading is turned off and memory is populated correctly, i.e. 4 DIMMs per socket in the right slots?

havref January 22, 2018 11:22

5 Attachment(s)
I'll check the memory configuration once again to be sure, but I'm quite sure it is correct. Hyper-threading is turned off in BIOS and only 32 cores are visible from the Ubuntu terminal.

Turbostat-file and temperature monitoring is attached. I looked at the temperature over time and after a short while it stabilized at the values seen in the .txt file. I did two runs, so I attached both a temperature and turbostat output from both.

flotus1 January 22, 2018 11:33

Weird. If that turbostat output was taken during a run with 32 cores, all cores should be near 100% load (despite the memory bottleneck) and running at 2900MHz edit: 3100MHz. Something is holding them back, and it is not CPU temperature or power draw.
What Linux kernel are you using and which MPI library?

havref January 22, 2018 11:36

I agree. I attached another set of files in the post above. These are in the middle of the test.

Linux Kernel: 4.13.0-26-generic
MPI library: mpiexec (OpenRTE) 1.10.2


Edit: I did three new tests now with 32 cores and N=400, which yielded the following results:
154.4, 157,4 and 138.3 msu. It does seem to variate quite a lot.

flotus1 January 22, 2018 11:53

Slowly running out of ideas. Before agreeing that your hardware is cursed, we could try an even simpler stress test running:
stress -c 32 -m 32
If CPU load or frequency goes down during this test, it might be a VRM throttling issue. Though I doubt that this is the case.

havref January 22, 2018 12:23

2 Attachment(s)
Thank you for helping out:)
Turbostat from the stresstest is attached. Unless I am reading it incorrectly, the average MHz looks to be appoximately 3100MHz. Which means that there is probably something wrong with the memory setup?

Edit: Added Memoryinformation.txt and the following text:
The memory setup of the server is attached. Originally the server was purchased with only 6 RAM sticks (I know), so two more with identical specs, but from a different vendor was installed. Can this be the fault? Or is there any additional setup in BIOS that needs to be done for all installed memory to work properly?

The server was (very roughly) tested both before and after the additional RAM upgrade, and showed quite a bit of performance increase. However, if you suspect the RAM-configuration is incorrect I'll be happy to run the same Palabos-tests with only 6x16Gb RAM to compare it properly.

flotus1 January 22, 2018 15:26

Memory seems to be populated correctly.
The output of turbostat you attached, did you stop the stress test somewhere in between? If so, there is a pretty high idle load on your machine. Any idea where it is coming from?

havref January 23, 2018 09:31

I checked this morning and both Xorg and compiz were using an surprising amount of CPU resources, both around 70% of one CPU core when idle.
First I deactivated most of the unnecessary animations and such for compiz. Secondly and probably more importantly, I reinstalled the latest driver for the CPUs, not using the provided Dell drivers this time, but the Processor microcode firmware for Intel CPU's from intel-microcode (open source) I found in the update manager.

New results using the same cavity3d benchmark from Palabos:
Code:

#Cores        msu_100        msu_400
1        11.05        12.83
2        18.40        23.98
4        32.21        43.16
8        58.01        83.04
16        82.62        133.23
24        90.67        153.89
32        105.66        175.41

So, finally some results similar to those of your Intel setup. This is a great improvement from yesterdays results and I guess it is closer to what is expected? Slightly better performance than AMD for 1-core processing, but slower when more cores are used due to the different number of memory channels?

flotus1 January 23, 2018 09:50

That is a good find and a pretty solid improvement.
Yes, AMD Epyc is so much faster with many cores thanks to 8 memory channels per CPU and slightly slower in single-core due to the low clock speed of 2.7GHz.

havref January 23, 2018 11:19

Thank you so much for your help, flotus1!

And thank both you and ErikAdr for your advice regarding our next server as well. I'm looking into a few Supermicro-boards with Epyc processors and I now think we'll end up with one of the following instead of the Intel CPUs:
Code:

Processor            Euro per processor              Euro total build
7281(2.1MHz)                        685                          6235
7301(2.2MHz)                        870                          6600
7351(2.4MHz)                      1140                        7140

Based on these off-the-website prices and benchmarks here (https://www.servethehome.com/amd-epy...ks-and-review/) I'm having a hard time deciding if the upgrade to 7351 is worth it.

flotus1 January 23, 2018 11:36

I would avoid the 7281 because it only has half the L3 cache of the larger models.
The 7351 gets you 0.2GHz more clock speed compared to the 7301 (2.9GHz vs 2.7GHz all-core turbo, these CPUs always use maximum turbo speed for CFD workloads). Looking at the prices for the processors alone, this is not worth it at all for 7.4% more clock speed. However, looking a the total system cost one might be tempted to do this upgrade. Personally, I don't think I would. The performance increase might be less than 5%. I can't make this decision for you ;)

ErikAdr January 24, 2018 04:26

For my own in house cfd program I would not worry about the smaller L3 cache in 7281. Per core it got 2 MB and that is the same as for the top range EPYC with 32 cores. Intel Gold got 1,275 MB per core. In the spec2017fp out of 13 test cases 7281 and 7301 performs equally in 10 cases, but 7301 is up to 16% faster in the remaining three cases. Try to compare: https://www.spec.org/cpu2017/results...128-01266.html and https://www.spec.org/cpu2017/results...128-01292.html
The price difference between 7281 and 7301 is small, so if you believe OpenFoam benefits from a larger cache, it is a small cost to choose 7301 instead of 7281. I don't know if that is the case. I think 7351 is not worth the extra cost.


All times are GMT -4. The time now is 00:14.