|
[Sponsors] |
March 1, 2020, 04:50 |
Intel based workstation selection
|
#1 |
Member
Join Date: Oct 2019
Posts: 63
Rep Power: 7 |
Hi friends,
I am using Flow 3D that supporting only the Intel FORTRAN compiler for customization on the Cent.OS 7.7 (Kernel 3.10). The solver did not show good performance using Epyc 7302P and as the second system, I prefer to try from Intel's product. I really can not decide between the following alternatives: Xeon Silver 4214R + 6x2666MHz i9-10900X + 4x3200MHz Waiting for your suggestion, Thanks. |
|
March 1, 2020, 10:46 |
|
#2 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
You have been searching for a setup that meets your expectations for quite a while now.
With a few possible bottlenecks ruled out, I think you should take a step back and try to evaluate what is causing your code to perform lower than expected. I would probably start with a scaling analysis. Maybe your "customizations" are running single-threaded, or something else about the code or setup causes poor scaling. Turning this into a software problem, not hardware. |
|
March 1, 2020, 11:30 |
|
#3 | |
Member
Join Date: Oct 2019
Posts: 63
Rep Power: 7 |
Quote:
Thank you dear Alex, you are somehow right. In my project, after many optimization on the code finally I achieved the 100 percent of performance (reported by software). In that case, I still using EPYC due to low temperature and sufficient performance in long time. But how much faster? 15 percent faster than my personal 2920x. Note that, the 2920x temperature reach to 62 degree at 10 cores and I don't like to run it for e.g. 20 hours continuously. Some part of these difference related to 2666MHz of RAM (8XDRank) at EPYC system, but I had to remain under the line for each Unit (budget limited). The main problem with EPYC is the NPS0 condition that strangely executed my project faster than NPS2 or NPS4. The condition get worst when I was limited to CentOS 7 with old kernel version. We can not change all these codes in a short time but the new researcher have the budget to buy new work station. Maybe I am wrong but the EPYC is not ideal CPU for the codes of Intel FORTRAN(2016) or any software (or OS) which is not Numa aware literally. Please let me know your opinion. |
||
March 1, 2020, 15:34 |
|
#4 | |||||
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Quote:
Quote:
Quote:
According to the info I found on the interwebs about this software, it used to be parallelized via OpenMP. In later (11.2 and up) versions, they added hybrid MPI/OpenMP. With OpenMP on systems with more than one NUMA node (like TR2920x or Epyc 7302P in NPS2 and NPS4 mode) core binding is one crucial piece to get decent performance. Along with the solver being written with NUMA in mind, which you have no control over. And with hybrid MPI/OpenMP, core binding and placement becomes even more important. The software should have some documentation and hints how to control these parameters. If they don't, I would get their support involved. It's a total shot in the dark, because it might get overwritten internally by the software. But try using OMP_PROC_BIND=true and see if it does anything for performance. In the shell you start the software with, type Code:
export OMP_PROC_BIND=true Additionally, you should clear caches/buffers before starting a new run (root privileges required) Quote:
Quote:
|
||||||
March 2, 2020, 00:39 |
|
#5 | |||
Member
Join Date: Oct 2019
Posts: 63
Rep Power: 7 |
Thank you for your time and helpful comments.
Quote:
I don’t know how it exactly works, instead I can share some results depend on both hardware and software differences. For example, the performance number using 2970wx drop from 100 in higher than 10 cores, shows 60 percent in 24 cores. From a software viewpoint, if the time step gradually falls from specific value, the performance also follows the pattern and it shows something like 70~80 percent. By the way, at the current position, my EPYC shows between 98 to 100 percent at 16 cores (44°C degrees). Quote:
Quote:
The only sentence that I found in the manual is: "The KMP_AFFINITY environment variable gives users access to Intel’s Thread Affinity Interface, which controls how OpenMP threads are bound to physical processors. Setting this variable to scatter can sometimes improve the performance of the solver." I found the bellow links but needing the proper time to check them: https://software.intel.com/en-us/for...ead-allocation https://www.nas.nasa.gov/hecc/suppor...nning_285.html https://software.intel.com/en-us/cpp...ux-and-windows |
||||
March 2, 2020, 01:32 |
|
#6 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I almost forgot that OMP_PROC_BIND only works partially on more recent versions of Intel fortran. You can do the same thing with KMP_AFFINITY https://software.intel.com/en-us/for...NMENT_VARIABLE
To start with, I would use Code:
export KMP_AFFINITY=verbose,scatter But don't forget about clearing caches. When the memory on a NUMA node is clogged up by cached memory, most of the other efforts are in vain. Didn't AMD get rid of that weird temperature offset a while ago through bios and microcode updates? While 90°C would in fact be a bit hot for a second gen TR, pointing towards a suboptimal cooling solution, you can still run at these temperatures for years. |
|
March 2, 2020, 09:25 |
|
#7 |
Member
Join Date: Oct 2019
Posts: 63
Rep Power: 7 |
I checked all these conditions in NPS0 to NPS4
(echo 3 > /proc/sys/vm/drop_caches @ each test): export KMP_AFFINITY=verbose,scatter.........Flow 3D failed export KMP_AFFINITY=verbose,compact...... Flow 3D failed export KMP_AFFINITY=scatter ......ok export KMP_AFFINITY=compact.......ok export KMP_AFFINITY=disable.......ok After 4 hours testing, I did not find the benefits at all and still NPS0 shows the best performance regardless affinity setup. P.S. Dear Alex let me know how I can control the fans speed in this super micro motherboard, the RPM continuously change from 500 to 1000 and return again, very annoying condition. Is there any solution? Thanks again. Last edited by Habib-CFD; March 2, 2020 at 11:53. |
|
March 2, 2020, 13:26 |
|
#8 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
The fan issue stems from Supermicro setting the lower RPM threshold way too high for silent fans. You can change these threshold values through IPMI: https://calvin.me/quick-how-to-decre...fan-threshold/
We really need a scaling analysis move forward. Use "compact" for affinity, and make sure to clear caches before each run. There should not be any message after invoking the clear cache command, otherwise it probably didn't work. And check with htop during the runs, to see which cores actually get loaded, and if the threads still get switched around by the OS. With "compact", a simulation on a single thread should run on core 0 (reported as core 1 in htop). A simulation with 2 threads should run on cores 0+1 and so on. Plot the performance metric (iterations per second, run time or whatever) for 1, 2, 4, 6, 8, 12, 16 threads. |
|
March 3, 2020, 00:32 |
|
#9 |
Member
Join Date: Oct 2019
Posts: 63
Rep Power: 7 |
Thank you, I changed the fan thresholds successfully.
The affinity of cores did not change by software or OS and it follows the compact mode in "htop". However, I was shocked by the results. Cores.........My model.......Default sample of software 1...................448s..................575s 2...................298s..................354s 4...................229s..................269s 8...................184s..................183s 12.................169s..................144s 16.................161s..................140s till 12 cores, I need to optimize my code, but higher than 12, it seems the Flow 3D has some problems. |
|
March 3, 2020, 02:00 |
|
#10 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I'd rather say Flow-3D has some issues right from the start. Here is what your results look like in terms of strong scaling:
scaling.png While scaling is better with the default sample, it is still far from acceptable. Since you used compact binding, we can also rule out that this is a problem with NUMA. Already going from 1 to 2 cores on the same NUMA node, scaling drops way below ideal. I don't think buying any other CPU could help here. In general, when CPU cores are too slow, scaling gets better. But here scaling is the main issue. So either there is a significant serial portion in the code (which is even larger with your custom case), or the model does not fit into memory. How much memory do these 2 cases consume? As a side-note: we clearly see that the software reporting "100% parallel efficiency" is utter nonsense. Parallel efficiency is usually defined as actual speedup divided by ideal speedup. Which is somewhere below 30% in these 2 cases. The only other thing that comes to mind: the solver itself might be running fine and with proper parallel efficiency, but your run time measurements include another component. Like file I/O, happening outside of the solver. Or run times for pre- and post-processing. |
|
March 3, 2020, 02:43 |
|
#11 |
Member
Join Date: Oct 2019
Posts: 63
Rep Power: 7 |
Nice analyses. I agree with you about the nature of the problem.
The total RAM usage is about 2GB for both tests. I am using 970 Nvme in this system. The preprocessing takes around 2s and I already considered them in results (subtracted). Unfortunately, I can not check the scaling in my intel family PCs now, but I found some recorded data based on the default sample of software using only a single thread on my i7-7700k that shows 630s of total run time (SMT on, 2666MHz, Windows 10). |
|
March 3, 2020, 05:20 |
|
#12 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
So the Intel CPU was slower even on a single core, despite higher clock speed.
Just to be 100% sure: using an NVMe SSD is no guarantee for avoiding I/O bottlenecks. You can monitor file-I/O on Linux using iotop (root privileges required). It will give you a summary of the total reads/writes to disk, as well as individual results for all processes. After that, I think we are out of ideas. So it would be time to present these findings to whoever does support for this software. Even if you should be (no longer) a paying customer, these poor results should be enough incentive for them to have a closer look. |
|
March 3, 2020, 05:44 |
|
#13 |
Member
Join Date: Oct 2019
Posts: 63
Rep Power: 7 |
You convinced me. The "iotop" shows 0.04 percent usage.
Thank you again for this helpful discussion. |
|
March 3, 2020, 06:43 |
|
#14 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I didn't set out to convince you of anything. At least I like think so myself.
There are cases where one CPU is clearly better than another, which holds true to this day for both Intel and AMD. But I think we worked out together that this is not a problem with unsuitable hardware, but software instead. |
|
March 11, 2020, 18:33 |
|
#15 |
Member
Join Date: Oct 2019
Posts: 63
Rep Power: 7 |
Finally, I had time to test my 2920x performance with the default sample of Flow 3D.
The result shows the very specific competition between clock frequency and IPC performance until 8 cores considering the memory difference: Default sample of Flow 3D, both CentOS 7.7 Cores........2920X(4x3200MHz)......7302P(8x2666MHz-DR) 1.......................483s...................... ......575s 2.......................328s...................... ......354s 4.......................235s...................... ......269s 8.......................173s...................... ......183s 12.....................165s....................... .....144s 16................................................ ........140s Last edited by Habib-CFD; March 11, 2020 at 22:48. |
|
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Suggestions for StarCCM+ Workstation configuration | ifil | Hardware | 15 | October 30, 2018 06:09 |
Request for review on workstation setup | fusij | Hardware | 2 | December 2, 2015 14:29 |
CFX11 + Fortran compiler ? | Mohan | CFX | 20 | March 30, 2011 19:56 |
incorrect temperature in pressure based solution | Kian | FLUENT | 1 | July 6, 2009 06:59 |
cfd workstation selection | Bob Harries | Main CFD Forum | 7 | June 7, 2000 18:16 |