Poor scaling of dual-Xeon E5-2697A-V4
2 Attachment(s)
After getting advice on this forum, I decided to run some scaling tests on our Dell R630 Server, equipped with 2x Xeon E5-2697A-V4, 8x16Gb 2400MHz Ram and an 480GB SSD. In addition, to benchmark against some larger cases I tested the Cavity-case previously tested on a HPC at NTNU: ( OpenFOAM Performance on Vilje ).
A .pdf of results, and a .txt file of memory setup is attached. In the first Scaling tests-part, all cases run are confidential hull designs, so I can't share much more information regarding those meshes. The cache was cleared between each of the benchmarks against Vilje using the command: sync && echo 3 | sudo tee /proc/sys/vm/drop_caches As you see in the .pdf, the performance using only 16 cores is much higher than the performance of 32 cores, even with as many as ~27 million cells. Any idea why this is happening? I was expecting 32 cores to outperform 16 by far even at the InterDyMFOAM test using 9 million cells. Is this assumption wrong? Because of these results, I am suspecting there's something wrong in our setup, but I have no idea where to start looking. Any comments or recommendations are greatly appreciated. |
Re: Poor scaling of dual-Xeon E5-2697A-V4
My guess is that the performance of your test case is limited by the memory bandwidth and not the cpu power. You have two strong cpu, but only 8 memory channels. For cfd two cores are usually enough to saturate one memory channel. The best performance in your case is achieved with that ratio between cores and memory channels. It doesn't surprise me that the performance drop when using 32 cores, but I don't believe the drop is due alone to having too many cores competing about the limited memory bandwidth. For a larger test case 32 cores could do better, but I think 16 cores are optimal.
|
Thank you for the response. It makes sense when you put it like that. I'll do some tests using 24 cores etc. as well, to see where the peak performance is for these cases.
Quote:
|
Re: Poor scaling of dual-Xeon E5-2697A-V4
I have no experience with OpenFoam so I don't feel qualified to give a definitive answer on this. But I would not choose less than 12 cores and I would feel safer with 16 cores. For my own in house cfd code I would go for an EPYC 7281 or EPYC 7301 which both have 8 memory channels and cost just a fraction of Intels Gold processors. If I had to select an Intel processor, then I would go for the cheaper models like Sliver 4116 or Gold 6130, and then buy some more systems to 'fill the budget'. Perhaps You are able to get the high end processors at a better price than I. Look at https://www.spec.org/cpu2017/results to see benchmarks for various systems. I usually look at the results for bwaves_r and wrf_r where the former is the most memory intensive. Please also look for the other treads on this forum on the topic i.e. 'Epyc vs Xeon Skylake SP'.
|
I have no experience with OpenFOAM, but such a drastic decrease of performance seems rather unusual.
In order to check if there is something wrong with your setup, you could run this benchmark here, I have numbers for the same platform to compare them: http://www.palabos.org/software/download Find the benchmark in "examples/benchmarks/cavity3d". I would recommend running problem sizes 100 and 400 with 1, 2, 4, 8... cores. Concerning Skylake-SP, I would strongly advise against it for non-commercial CFD software. You don't have to pay a per-core license, so AMD Epyc would give you much better performance/$. |
Thank you Alex, I'll install and run though that benchmark shortly.
For the second server we were considering a quad-socket setup with one of the Skylake-SP processors. However, if two servers with dual Epyc 7351 (or other) will give a better performance/$ we will definitely consider it. Even a single server with dual Epyc would probably be sufficient if it performs much better than the server we have now. |
I just realized that I did not run a full set of benchmarks on our newest Intel system, but it should be sufficient for a quick comparison:
2x AMD Epyc 7301 Code:
#threads msu_100 msu_400 msu_1000 Code:
#threads msu_100 msu_400 Yours should slightly outperform my Intel setup with 32 and 16 cores active. Testing your setup with 24 cores might give worse results, this benchmark does not run too well when the number of cores is not divisible by the number of active cores or a power of 2. |
Here are the results. Ran each benchmark 3 times, so average values are given in the table to the left, with exact values to the right. As you can see, there's a huge decrease in MSU when increasing to 32 cores. It should also be noted that there's a large variation between each try when using 32 cores compared to the rest. Operating system is Ubuntu 16.04 LTS.
Code:
2x Intel Xeon E5-2697A V4 |
I have never seen such a drastic decrease of performance with this benchmark on any system when using all cores. In fact, I never observed lower performance with max cores used compared to a lower amount of cores used.
You might want to check system temperatures (sensors) and frequencies (sudo turbostat) while running the benchmark. I assume hyperthreading is turned off and memory is populated correctly, i.e. 4 DIMMs per socket in the right slots? |
5 Attachment(s)
I'll check the memory configuration once again to be sure, but I'm quite sure it is correct. Hyper-threading is turned off in BIOS and only 32 cores are visible from the Ubuntu terminal.
Turbostat-file and temperature monitoring is attached. I looked at the temperature over time and after a short while it stabilized at the values seen in the .txt file. I did two runs, so I attached both a temperature and turbostat output from both. |
Weird. If that turbostat output was taken during a run with 32 cores, all cores should be near 100% load (despite the memory bottleneck) and running at 2900MHz edit: 3100MHz. Something is holding them back, and it is not CPU temperature or power draw.
What Linux kernel are you using and which MPI library? |
I agree. I attached another set of files in the post above. These are in the middle of the test.
Linux Kernel: 4.13.0-26-generic MPI library: mpiexec (OpenRTE) 1.10.2 Edit: I did three new tests now with 32 cores and N=400, which yielded the following results: 154.4, 157,4 and 138.3 msu. It does seem to variate quite a lot. |
Slowly running out of ideas. Before agreeing that your hardware is cursed, we could try an even simpler stress test running:
stress -c 32 -m 32 If CPU load or frequency goes down during this test, it might be a VRM throttling issue. Though I doubt that this is the case. |
2 Attachment(s)
Thank you for helping out:)
Turbostat from the stresstest is attached. Unless I am reading it incorrectly, the average MHz looks to be appoximately 3100MHz. Which means that there is probably something wrong with the memory setup? Edit: Added Memoryinformation.txt and the following text: The memory setup of the server is attached. Originally the server was purchased with only 6 RAM sticks (I know), so two more with identical specs, but from a different vendor was installed. Can this be the fault? Or is there any additional setup in BIOS that needs to be done for all installed memory to work properly? The server was (very roughly) tested both before and after the additional RAM upgrade, and showed quite a bit of performance increase. However, if you suspect the RAM-configuration is incorrect I'll be happy to run the same Palabos-tests with only 6x16Gb RAM to compare it properly. |
Memory seems to be populated correctly.
The output of turbostat you attached, did you stop the stress test somewhere in between? If so, there is a pretty high idle load on your machine. Any idea where it is coming from? |
I checked this morning and both Xorg and compiz were using an surprising amount of CPU resources, both around 70% of one CPU core when idle.
First I deactivated most of the unnecessary animations and such for compiz. Secondly and probably more importantly, I reinstalled the latest driver for the CPUs, not using the provided Dell drivers this time, but the Processor microcode firmware for Intel CPU's from intel-microcode (open source) I found in the update manager. New results using the same cavity3d benchmark from Palabos: Code:
#Cores msu_100 msu_400 |
That is a good find and a pretty solid improvement.
Yes, AMD Epyc is so much faster with many cores thanks to 8 memory channels per CPU and slightly slower in single-core due to the low clock speed of 2.7GHz. |
Thank you so much for your help, flotus1!
And thank both you and ErikAdr for your advice regarding our next server as well. I'm looking into a few Supermicro-boards with Epyc processors and I now think we'll end up with one of the following instead of the Intel CPUs: Code:
Processor Euro per processor Euro total build |
I would avoid the 7281 because it only has half the L3 cache of the larger models.
The 7351 gets you 0.2GHz more clock speed compared to the 7301 (2.9GHz vs 2.7GHz all-core turbo, these CPUs always use maximum turbo speed for CFD workloads). Looking at the prices for the processors alone, this is not worth it at all for 7.4% more clock speed. However, looking a the total system cost one might be tempted to do this upgrade. Personally, I don't think I would. The performance increase might be less than 5%. I can't make this decision for you ;) |
For my own in house cfd program I would not worry about the smaller L3 cache in 7281. Per core it got 2 MB and that is the same as for the top range EPYC with 32 cores. Intel Gold got 1,275 MB per core. In the spec2017fp out of 13 test cases 7281 and 7301 performs equally in 10 cases, but 7301 is up to 16% faster in the remaining three cases. Try to compare: https://www.spec.org/cpu2017/results...128-01266.html and https://www.spec.org/cpu2017/results...128-01292.html
The price difference between 7281 and 7301 is small, so if you believe OpenFoam benefits from a larger cache, it is a small cost to choose 7301 instead of 7281. I don't know if that is the case. I think 7351 is not worth the extra cost. |
All times are GMT -4. The time now is 00:14. |