Intel based workstation selection

Habib-CFD · March 1, 2020, 03:50

Hi friends,
I am using Flow 3D that supporting only the Intel FORTRAN compiler for customization on the Cent.OS 7.7 (Kernel 3.10).
The solver did not show good performance using Epyc 7302P and as the second system, I prefer to try from Intel's product.

I really can not decide between the following alternatives:
Xeon Silver 4214R + 6x2666MHz
i9-10900X + 4x3200MHz

Waiting for your suggestion,

Thanks.

flotus1 · March 1, 2020, 09:46

You have been searching for a setup that meets your expectations for quite a while now.
With a few possible bottlenecks ruled out, I think you should take a step back and try to evaluate what is causing your code to perform lower than expected.

I would probably start with a scaling analysis. Maybe your "customizations" are running single-threaded, or something else about the code or setup causes poor scaling. Turning this into a software problem, not hardware.

Habib-CFD · March 1, 2020, 10:30

Quote:

Originally Posted by flotus1

You have been searching for a setup that meets your expectations for quite a while now.
With a few possible bottlenecks ruled out, I think you should take a step back and try to evaluate what is causing your code to perform lower than expected.

I would probably start with a scaling analysis. Maybe your "customizations" are running single-threaded, or something else about the code or setup causes poor scaling. Turning this into a software problem, not hardware.

Thank you dear Alex, you are somehow right. In my project, after many optimization on the code finally I achieved the 100 percent of performance (reported by software). In that case, I still using EPYC due to low temperature and sufficient performance in long time. But how much faster? 15 percent faster than my personal 2920x. Note that, the 2920x temperature reach to 62 degree at 10 cores and I don't like to run it for e.g. 20 hours continuously. Some part of these difference related to 2666MHz of RAM (8XDRank) at EPYC system, but I had to remain under the line for each Unit (budget limited). The main problem with EPYC is the NPS0 condition that strangely executed my project faster than NPS2 or NPS4. The condition get worst when I was limited to CentOS 7 with old kernel version. We can not change all these codes in a short time but the new researcher have the budget to buy new work station. Maybe I am wrong but the EPYC is not ideal CPU for the codes of Intel FORTRAN(2016) or any software (or OS) which is not Numa aware literally. Please let me know your opinion.

flotus1 · March 1, 2020, 14:34

Quote:

after many optimization on the code finally I achieved the 100 percent of performance (reported by software)

I have no clue what that means, but it sounds dubious to say the least. The CFD software reports that it is running at 100% performance? How is that defined? What is the reference point?

Quote:

Note that, the 2920x temperature reach to 62 degree at 10 cores and I don't like to run it for e.g. 20 hours continuously.

I don't see the problem here. CPUs can run this hot, some even near 100°C. 62°C is nothing to worry about, and not a reason to avoid long periods of high load.

Quote:

The main problem with EPYC is the NPS0 condition that strangely executed my project faster than NPS2 or NPS4

Now that's something to work with.
According to the info I found on the interwebs about this software, it used to be parallelized via OpenMP. In later (11.2 and up) versions, they added hybrid MPI/OpenMP.
With OpenMP on systems with more than one NUMA node (like TR2920x or Epyc 7302P in NPS2 and NPS4 mode) core binding is one crucial piece to get decent performance. Along with the solver being written with NUMA in mind, which you have no control over.
And with hybrid MPI/OpenMP, core binding and placement becomes even more important. The software should have some documentation and hints how to control these parameters. If they don't, I would get their support involved.

It's a total shot in the dark, because it might get overwritten internally by the software. But try using OMP_PROC_BIND=true and see if it does anything for performance. In the shell you start the software with, type

Code:

export OMP_PROC_BIND=true

then start the software or batch run.
Additionally, you should clear caches/buffers before starting a new run (root privileges required)

Quote:

echo 3 > /proc/sys/vm/drop_caches

These are the most basic settings for a scaling analysis, which I still highly recommend.

Quote:

Maybe I am wrong but the EPYC is not ideal CPU for the codes of Intel FORTRAN(2016) or any software (or OS) which is not Numa aware literally

I am able to get decent results with fortran OpenMP code on Epyc CPUs, both with gfortran and the Intel fortran compiler. As long as I write code that follows standard practice for scaling across several NUMA nodes. This is my responsibility as a programmer, the programming language is not to blame. The same would apply for other programming languages.

Habib-CFD · March 1, 2020, 23:39

Thank you for your time and helpful comments.

Quote:

Originally Posted by flotus1

I have no clue what that means, but it sounds dubious to say the least. The CFD software reports that it is running at 100% performance? How is that defined? What is the reference point?

"%PE = The parallel efficiency of the solver. This is a measure of how effective the parallelization scheme is for a particular problem."
I don’t know how it exactly works, instead I can share some results depend on both hardware and software differences. For example, the performance number using 2970wx drop from 100 in higher than 10 cores, shows 60 percent in 24 cores. From a software viewpoint, if the time step gradually falls from specific value, the performance also follows the pattern and it shows something like 70~80 percent. By the way, at the current position, my EPYC shows between 98 to 100 percent at 16 cores (44°C degrees).

Quote:

Originally Posted by flotus1

I don't see the problem here. CPUs can run this hot, some even near 100°C. 62°C is nothing to worry about, and not a reason to avoid long periods of high load.

As you may know the maximum temperature of 2XXX TR series limited to 68 degrees (AMD Ref.). At full load, the AMD Ryzen Master shows 62 degrees in windows. The same at Linux exhibit around 90 degrees considering a +27°C offset between the tCTL° (reported) temperature and the actual Tj° temperature.

Quote:

Originally Posted by flotus1

Now that's something to work with.
According to the info I found on the interwebs about this software, it used to be parallelized via OpenMP. In later (11.2 and up) versions, they added hybrid MPI/OpenMP..... It's a total shot in the dark, because it might get overwritten internally by the software. But try using OMP_PROC_BIND=true and see if it does anything for performance.

Very nice comments. I followed your suggestions carefully at different conditions (NPS0 to NPS4). Unfortunately, it did not make a difference. You encourage me to think deeply about this issue.
The only sentence that I found in the manual is:
"The KMP_AFFINITY environment variable gives users access to Intel’s Thread Affinity Interface, which controls how OpenMP threads are bound to physical processors. Setting this variable to scatter can sometimes improve the performance of the solver."

I found the bellow links but needing the proper time to check them:
https://software.intel.com/en-us/for...ead-allocation
https://www.nas.nasa.gov/hecc/suppor...nning_285.html
https://software.intel.com/en-us/cpp...ux-and-windows

flotus1 · March 2, 2020, 00:32

I almost forgot that OMP_PROC_BIND only works partially on more recent versions of Intel fortran. You can do the same thing with KMP_AFFINITY https://software.intel.com/en-us/for...NMENT_VARIABLE
To start with, I would use

Code:

export KMP_AFFINITY=verbose,scatter

For a scaling analysis, maybe replace scatter with compact

But don't forget about clearing caches. When the memory on a NUMA node is clogged up by cached memory, most of the other efforts are in vain.

Didn't AMD get rid of that weird temperature offset a while ago through bios and microcode updates? While 90°C would in fact be a bit hot for a second gen TR, pointing towards a suboptimal cooling solution, you can still run at these temperatures for years.

Habib-CFD · March 2, 2020, 08:25

I checked all these conditions in NPS0 to NPS4

(echo 3 > /proc/sys/vm/drop_caches @ each test):

export KMP_AFFINITY=verbose,scatter.........Flow 3D failed
export KMP_AFFINITY=verbose,compact...... Flow 3D failed
export KMP_AFFINITY=scatter ......ok
export KMP_AFFINITY=compact.......ok
export KMP_AFFINITY=disable.......ok

After 4 hours testing, I did not find the benefits at all and still NPS0 shows the best performance regardless affinity setup.

P.S. Dear Alex let me know how I can control the fans speed in this super micro motherboard, the RPM continuously change from 500 to 1000 and return again, very annoying condition. Is there any solution? Thanks again.

flotus1 · March 2, 2020, 12:26

The fan issue stems from Supermicro setting the lower RPM threshold way too high for silent fans. You can change these threshold values through IPMI: https://calvin.me/quick-how-to-decre...fan-threshold/

We really need a scaling analysis move forward. Use "compact" for affinity, and make sure to clear caches before each run. There should not be any message after invoking the clear cache command, otherwise it probably didn't work. And check with htop during the runs, to see which cores actually get loaded, and if the threads still get switched around by the OS.
With "compact", a simulation on a single thread should run on core 0 (reported as core 1 in htop). A simulation with 2 threads should run on cores 0+1 and so on.

Plot the performance metric (iterations per second, run time or whatever) for 1, 2, 4, 6, 8, 12, 16 threads.

Habib-CFD · March 2, 2020, 23:32

Thank you, I changed the fan thresholds successfully.

The affinity of cores did not change by software or OS and it follows the compact mode in "htop". However, I was shocked by the results.

Cores.........My model.......Default sample of software
1...................448s..................575s
2...................298s..................354s
4...................229s..................269s
8...................184s..................183s
12.................169s..................144s
16.................161s..................140s

till 12 cores, I need to optimize my code, but higher than 12, it seems the Flow 3D has some problems.

flotus1 · March 3, 2020, 01:00

I'd rather say Flow-3D has some issues right from the start. Here is what your results look like in terms of strong scaling:
scaling.png
While scaling is better with the default sample, it is still far from acceptable.
Since you used compact binding, we can also rule out that this is a problem with NUMA. Already going from 1 to 2 cores on the same NUMA node, scaling drops way below ideal.
I don't think buying any other CPU could help here. In general, when CPU cores are too slow, scaling gets better. But here scaling is the main issue.

So either there is a significant serial portion in the code (which is even larger with your custom case), or the model does not fit into memory. How much memory do these 2 cases consume?
As a side-note: we clearly see that the software reporting "100% parallel efficiency" is utter nonsense. Parallel efficiency is usually defined as actual speedup divided by ideal speedup. Which is somewhere below 30% in these 2 cases.
The only other thing that comes to mind: the solver itself might be running fine and with proper parallel efficiency, but your run time measurements include another component. Like file I/O, happening outside of the solver. Or run times for pre- and post-processing.

Habib-CFD · March 3, 2020, 01:43

Nice analyses. I agree with you about the nature of the problem.
The total RAM usage is about 2GB for both tests. I am using 970 Nvme in this system. The preprocessing takes around 2s and I already considered them in results (subtracted). Unfortunately, I can not check the scaling in my intel family PCs now, but I found some recorded data based on the default sample of software using only a single thread on my i7-7700k that shows 630s of total run time (SMT on, 2666MHz, Windows 10).

flotus1 · March 3, 2020, 04:20

So the Intel CPU was slower even on a single core, despite higher clock speed.
Just to be 100% sure: using an NVMe SSD is no guarantee for avoiding I/O bottlenecks. You can monitor file-I/O on Linux using iotop (root privileges required). It will give you a summary of the total reads/writes to disk, as well as individual results for all processes.

After that, I think we are out of ideas. So it would be time to present these findings to whoever does support for this software. Even if you should be (no longer) a paying customer, these poor results should be enough incentive for them to have a closer look.

Habib-CFD · March 3, 2020, 04:44

You convinced me. The "iotop" shows 0.04 percent usage.
Thank you again for this helpful discussion.

flotus1 · March 3, 2020, 05:43

I didn't set out to convince you of anything. At least I like think so myself.
There are cases where one CPU is clearly better than another, which holds true to this day for both Intel and AMD.
But I think we worked out together that this is not a problem with unsuitable hardware, but software instead.

Habib-CFD · March 11, 2020, 17:33

Finally, I had time to test my 2920x performance with the default sample of Flow 3D.

The result shows the very specific competition between clock frequency and IPC performance until 8 cores considering the memory difference:

Default sample of Flow 3D, both CentOS 7.7

Cores........2920X(4x3200MHz)......7302P(8x2666MHz-DR)
1.......................483s...................... ......575s
2.......................328s...................... ......354s
4.......................235s...................... ......269s
8.......................173s...................... ......183s
12.....................165s....................... .....144s
16................................................ ........140s

March 1, 2020, 03:50	Intel based workstation selection	#1
Habib-CFD Member Join Date: Oct 2019 Posts: 63 Rep Power: 6	Hi friends, I am using Flow 3D that supporting only the Intel FORTRAN compiler for customization on the Cent.OS 7.7 (Kernel 3.10). The solver did not show good performance using Epyc 7302P and as the second system, I prefer to try from Intel's product. I really can not decide between the following alternatives: Xeon Silver 4214R + 6x2666MHz i9-10900X + 4x3200MHz Waiting for your suggestion, Thanks.

March 2, 2020, 00:32		#6
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,400 Rep Power: 47	I almost forgot that OMP_PROC_BIND only works partially on more recent versions of Intel fortran. You can do the same thing with KMP_AFFINITY https://software.intel.com/en-us/for...NMENT_VARIABLE To start with, I would use Code: export KMP_AFFINITY=verbose,scatter For a scaling analysis, maybe replace scatter with compact But don't forget about clearing caches. When the memory on a NUMA node is clogged up by cached memory, most of the other efforts are in vain. Didn't AMD get rid of that weird temperature offset a while ago through bios and microcode updates? While 90°C would in fact be a bit hot for a second gen TR, pointing towards a suboptimal cooling solution, you can still run at these temperatures for years. Habib-CFD likes this.

March 2, 2020, 08:25		#7
Habib-CFD Member Join Date: Oct 2019 Posts: 63 Rep Power: 6	I checked all these conditions in NPS0 to NPS4 (echo 3 > /proc/sys/vm/drop_caches @ each test): export KMP_AFFINITY=verbose,scatter.........Flow 3D failed export KMP_AFFINITY=verbose,compact...... Flow 3D failed export KMP_AFFINITY=scatter ......ok export KMP_AFFINITY=compact.......ok export KMP_AFFINITY=disable.......ok After 4 hours testing, I did not find the benefits at all and still NPS0 shows the best performance regardless affinity setup. P.S. Dear Alex let me know how I can control the fans speed in this super micro motherboard, the RPM continuously change from 500 to 1000 and return again, very annoying condition. Is there any solution? Thanks again. Last edited by Habib-CFD; March 2, 2020 at 10:53.

March 3, 2020, 01:00		#10
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,400 Rep Power: 47	I'd rather say Flow-3D has some issues right from the start. Here is what your results look like in terms of strong scaling: scaling.png While scaling is better with the default sample, it is still far from acceptable. Since you used compact binding, we can also rule out that this is a problem with NUMA. Already going from 1 to 2 cores on the same NUMA node, scaling drops way below ideal. I don't think buying any other CPU could help here. In general, when CPU cores are too slow, scaling gets better. But here scaling is the main issue. So either there is a significant serial portion in the code (which is even larger with your custom case), or the model does not fit into memory. How much memory do these 2 cases consume? As a side-note: we clearly see that the software reporting "100% parallel efficiency" is utter nonsense. Parallel efficiency is usually defined as actual speedup divided by ideal speedup. Which is somewhere below 30% in these 2 cases. The only other thing that comes to mind: the solver itself might be running fine and with proper parallel efficiency, but your run time measurements include another component. Like file I/O, happening outside of the solver. Or run times for pre- and post-processing. Habib-CFD likes this.

March 3, 2020, 04:20		#12
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,400 Rep Power: 47	So the Intel CPU was slower even on a single core, despite higher clock speed. Just to be 100% sure: using an NVMe SSD is no guarantee for avoiding I/O bottlenecks. You can monitor file-I/O on Linux using iotop (root privileges required). It will give you a summary of the total reads/writes to disk, as well as individual results for all processes. After that, I think we are out of ideas. So it would be time to present these findings to whoever does support for this software. Even if you should be (no longer) a paying customer, these poor results should be enough incentive for them to have a closer look. Habib-CFD likes this.

March 1, 2020, 09:46		#2
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,400 Rep Power: 47	You have been searching for a setup that meets your expectations for quite a while now. With a few possible bottlenecks ruled out, I think you should take a step back and try to evaluate what is causing your code to perform lower than expected. I would probably start with a scaling analysis. Maybe your "customizations" are running single-threaded, or something else about the code or setup causes poor scaling. Turning this into a software problem, not hardware.

March 2, 2020, 12:26		#8
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,400 Rep Power: 47	The fan issue stems from Supermicro setting the lower RPM threshold way too high for silent fans. You can change these threshold values through IPMI: https://calvin.me/quick-how-to-decre...fan-threshold/ We really need a scaling analysis move forward. Use "compact" for affinity, and make sure to clear caches before each run. There should not be any message after invoking the clear cache command, otherwise it probably didn't work. And check with htop during the runs, to see which cores actually get loaded, and if the threads still get switched around by the OS. With "compact", a simulation on a single thread should run on core 0 (reported as core 1 in htop). A simulation with 2 threads should run on cores 0+1 and so on. Plot the performance metric (iterations per second, run time or whatever) for 1, 2, 4, 6, 8, 12, 16 threads.

March 2, 2020, 23:32		#9
Habib-CFD Member Join Date: Oct 2019 Posts: 63 Rep Power: 6	Thank you, I changed the fan thresholds successfully. The affinity of cores did not change by software or OS and it follows the compact mode in "htop". However, I was shocked by the results. Cores.........My model.......Default sample of software 1...................448s..................575s 2...................298s..................354s 4...................229s..................269s 8...................184s..................183s 12.................169s..................144s 16.................161s..................140s till 12 cores, I need to optimize my code, but higher than 12, it seems the Flow 3D has some problems.

March 3, 2020, 01:43		#11
Habib-CFD Member Join Date: Oct 2019 Posts: 63 Rep Power: 6	Nice analyses. I agree with you about the nature of the problem. The total RAM usage is about 2GB for both tests. I am using 970 Nvme in this system. The preprocessing takes around 2s and I already considered them in results (subtracted). Unfortunately, I can not check the scaling in my intel family PCs now, but I found some recorded data based on the default sample of software using only a single thread on my i7-7700k that shows 630s of total run time (SMT on, 2666MHz, Windows 10).

March 3, 2020, 04:44		#13
Habib-CFD Member Join Date: Oct 2019 Posts: 63 Rep Power: 6	You convinced me. The "iotop" shows 0.04 percent usage. Thank you again for this helpful discussion.

March 3, 2020, 05:43		#14
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,400 Rep Power: 47	I didn't set out to convince you of anything. At least I like think so myself. There are cases where one CPU is clearly better than another, which holds true to this day for both Intel and AMD. But I think we worked out together that this is not a problem with unsuitable hardware, but software instead. Habib-CFD likes this.

March 11, 2020, 17:33		#15
Habib-CFD Member Join Date: Oct 2019 Posts: 63 Rep Power: 6	Finally, I had time to test my 2920x performance with the default sample of Flow 3D. The result shows the very specific competition between clock frequency and IPC performance until 8 cores considering the memory difference: Default sample of Flow 3D, both CentOS 7.7 Cores........2920X(4x3200MHz)......7302P(8x2666MHz-DR) 1.......................483s...................... ......575s 2.......................328s...................... ......354s 4.......................235s...................... ......269s 8.......................173s...................... ......183s 12.....................165s....................... .....144s 16................................................ ........140s Last edited by Habib-CFD; March 11, 2020 at 21:48.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Suggestions for StarCCM+ Workstation configuration	ifil	Hardware	15	October 30, 2018 05:09
Request for review on workstation setup	fusij	Hardware	2	December 2, 2015 13:29
CFX11 + Fortran compiler ?	Mohan	CFX	20	March 30, 2011 18:56
incorrect temperature in pressure based solution	Kian	FLUENT	1	July 6, 2009 05:59
cfd workstation selection	Bob Harries	Main CFD Forum	7	June 7, 2000 17:16