CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Intel based workstation selection

Register Blogs Community New Posts Updated Threads Search

Like Tree4Likes
  • 1 Post By flotus1
  • 1 Post By flotus1
  • 1 Post By flotus1
  • 1 Post By flotus1

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   March 1, 2020, 03:50
Default Intel based workstation selection
  #1
Member
 
Join Date: Oct 2019
Posts: 63
Rep Power: 6
Habib-CFD is on a distinguished road
Hi friends,
I am using Flow 3D that supporting only the Intel FORTRAN compiler for customization on the Cent.OS 7.7 (Kernel 3.10).
The solver did not show good performance using Epyc 7302P and as the second system, I prefer to try from Intel's product.

I really can not decide between the following alternatives:
Xeon Silver 4214R + 6x2666MHz
i9-10900X + 4x3200MHz


Waiting for your suggestion,

Thanks.
Habib-CFD is offline   Reply With Quote

Old   March 1, 2020, 09:46
Default
  #2
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,400
Rep Power: 47
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
You have been searching for a setup that meets your expectations for quite a while now.
With a few possible bottlenecks ruled out, I think you should take a step back and try to evaluate what is causing your code to perform lower than expected.

I would probably start with a scaling analysis. Maybe your "customizations" are running single-threaded, or something else about the code or setup causes poor scaling. Turning this into a software problem, not hardware.
flotus1 is offline   Reply With Quote

Old   March 1, 2020, 10:30
Default
  #3
Member
 
Join Date: Oct 2019
Posts: 63
Rep Power: 6
Habib-CFD is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
You have been searching for a setup that meets your expectations for quite a while now.
With a few possible bottlenecks ruled out, I think you should take a step back and try to evaluate what is causing your code to perform lower than expected.

I would probably start with a scaling analysis. Maybe your "customizations" are running single-threaded, or something else about the code or setup causes poor scaling. Turning this into a software problem, not hardware.

Thank you dear Alex, you are somehow right. In my project, after many optimization on the code finally I achieved the 100 percent of performance (reported by software). In that case, I still using EPYC due to low temperature and sufficient performance in long time. But how much faster? 15 percent faster than my personal 2920x. Note that, the 2920x temperature reach to 62 degree at 10 cores and I don't like to run it for e.g. 20 hours continuously. Some part of these difference related to 2666MHz of RAM (8XDRank) at EPYC system, but I had to remain under the line for each Unit (budget limited). The main problem with EPYC is the NPS0 condition that strangely executed my project faster than NPS2 or NPS4. The condition get worst when I was limited to CentOS 7 with old kernel version. We can not change all these codes in a short time but the new researcher have the budget to buy new work station. Maybe I am wrong but the EPYC is not ideal CPU for the codes of Intel FORTRAN(2016) or any software (or OS) which is not Numa aware literally. Please let me know your opinion.
Habib-CFD is offline   Reply With Quote

Old   March 1, 2020, 14:34
Default
  #4
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,400
Rep Power: 47
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
after many optimization on the code finally I achieved the 100 percent of performance (reported by software)
I have no clue what that means, but it sounds dubious to say the least. The CFD software reports that it is running at 100% performance? How is that defined? What is the reference point?

Quote:
Note that, the 2920x temperature reach to 62 degree at 10 cores and I don't like to run it for e.g. 20 hours continuously.
I don't see the problem here. CPUs can run this hot, some even near 100°C. 62°C is nothing to worry about, and not a reason to avoid long periods of high load.

Quote:
The main problem with EPYC is the NPS0 condition that strangely executed my project faster than NPS2 or NPS4
Now that's something to work with.
According to the info I found on the interwebs about this software, it used to be parallelized via OpenMP. In later (11.2 and up) versions, they added hybrid MPI/OpenMP.
With OpenMP on systems with more than one NUMA node (like TR2920x or Epyc 7302P in NPS2 and NPS4 mode) core binding is one crucial piece to get decent performance. Along with the solver being written with NUMA in mind, which you have no control over.
And with hybrid MPI/OpenMP, core binding and placement becomes even more important. The software should have some documentation and hints how to control these parameters. If they don't, I would get their support involved.

It's a total shot in the dark, because it might get overwritten internally by the software. But try using OMP_PROC_BIND=true and see if it does anything for performance. In the shell you start the software with, type
Code:
export OMP_PROC_BIND=true
then start the software or batch run.
Additionally, you should clear caches/buffers before starting a new run (root privileges required)
Quote:
echo 3 > /proc/sys/vm/drop_caches
These are the most basic settings for a scaling analysis, which I still highly recommend.

Quote:
Maybe I am wrong but the EPYC is not ideal CPU for the codes of Intel FORTRAN(2016) or any software (or OS) which is not Numa aware literally
I am able to get decent results with fortran OpenMP code on Epyc CPUs, both with gfortran and the Intel fortran compiler. As long as I write code that follows standard practice for scaling across several NUMA nodes. This is my responsibility as a programmer, the programming language is not to blame. The same would apply for other programming languages.
flotus1 is offline   Reply With Quote

Old   March 1, 2020, 23:39
Default
  #5
Member
 
Join Date: Oct 2019
Posts: 63
Rep Power: 6
Habib-CFD is on a distinguished road
Thank you for your time and helpful comments.

Quote:
Originally Posted by flotus1 View Post
I have no clue what that means, but it sounds dubious to say the least. The CFD software reports that it is running at 100% performance? How is that defined? What is the reference point?
"%PE = The parallel efficiency of the solver. This is a measure of how effective the parallelization scheme is for a particular problem."
I don’t know how it exactly works, instead I can share some results depend on both hardware and software differences. For example, the performance number using 2970wx drop from 100 in higher than 10 cores, shows 60 percent in 24 cores. From a software viewpoint, if the time step gradually falls from specific value, the performance also follows the pattern and it shows something like 70~80 percent. By the way, at the current position, my EPYC shows between 98 to 100 percent at 16 cores (44°C degrees).

Quote:
Originally Posted by flotus1 View Post
I don't see the problem here. CPUs can run this hot, some even near 100°C. 62°C is nothing to worry about, and not a reason to avoid long periods of high load.
As you may know the maximum temperature of 2XXX TR series limited to 68 degrees (AMD Ref.). At full load, the AMD Ryzen Master shows 62 degrees in windows. The same at Linux exhibit around 90 degrees considering a +27°C offset between the tCTL° (reported) temperature and the actual Tj° temperature.

Quote:
Originally Posted by flotus1 View Post
Now that's something to work with.
According to the info I found on the interwebs about this software, it used to be parallelized via OpenMP. In later (11.2 and up) versions, they added hybrid MPI/OpenMP..... It's a total shot in the dark, because it might get overwritten internally by the software. But try using OMP_PROC_BIND=true and see if it does anything for performance.
Very nice comments. I followed your suggestions carefully at different conditions (NPS0 to NPS4). Unfortunately, it did not make a difference. You encourage me to think deeply about this issue.
The only sentence that I found in the manual is:
"The KMP_AFFINITY environment variable gives users access to Intel’s Thread Affinity Interface, which controls how OpenMP threads are bound to physical processors. Setting this variable to scatter can sometimes improve the performance of the solver."

I found the bellow links but needing the proper time to check them:
https://software.intel.com/en-us/for...ead-allocation
https://www.nas.nasa.gov/hecc/suppor...nning_285.html
https://software.intel.com/en-us/cpp...ux-and-windows
Habib-CFD is offline   Reply With Quote

Old   March 2, 2020, 00:32
Default
  #6
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,400
Rep Power: 47
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
I almost forgot that OMP_PROC_BIND only works partially on more recent versions of Intel fortran. You can do the same thing with KMP_AFFINITY https://software.intel.com/en-us/for...NMENT_VARIABLE
To start with, I would use
Code:
export KMP_AFFINITY=verbose,scatter
For a scaling analysis, maybe replace scatter with compact

But don't forget about clearing caches. When the memory on a NUMA node is clogged up by cached memory, most of the other efforts are in vain.

Didn't AMD get rid of that weird temperature offset a while ago through bios and microcode updates? While 90°C would in fact be a bit hot for a second gen TR, pointing towards a suboptimal cooling solution, you can still run at these temperatures for years.
Habib-CFD likes this.
flotus1 is offline   Reply With Quote

Old   March 2, 2020, 08:25
Unhappy
  #7
Member
 
Join Date: Oct 2019
Posts: 63
Rep Power: 6
Habib-CFD is on a distinguished road
I checked all these conditions in NPS0 to NPS4

(echo 3 > /proc/sys/vm/drop_caches @ each test):


export KMP_AFFINITY=verbose,scatter.........Flow 3D failed
export KMP_AFFINITY=verbose,compact...... Flow 3D failed
export KMP_AFFINITY=scatter ......ok
export KMP_AFFINITY=compact.......ok
export KMP_AFFINITY=disable.......ok


After 4 hours testing, I did not find the benefits at all and still NPS0 shows the best performance regardless affinity setup.


P.S. Dear Alex let me know how I can control the fans speed in this super micro motherboard, the RPM continuously change from 500 to 1000 and return again, very annoying condition. Is there any solution? Thanks again.

Last edited by Habib-CFD; March 2, 2020 at 10:53.
Habib-CFD is offline   Reply With Quote

Old   March 2, 2020, 12:26
Default
  #8
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,400
Rep Power: 47
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
The fan issue stems from Supermicro setting the lower RPM threshold way too high for silent fans. You can change these threshold values through IPMI: https://calvin.me/quick-how-to-decre...fan-threshold/

We really need a scaling analysis move forward. Use "compact" for affinity, and make sure to clear caches before each run. There should not be any message after invoking the clear cache command, otherwise it probably didn't work. And check with htop during the runs, to see which cores actually get loaded, and if the threads still get switched around by the OS.
With "compact", a simulation on a single thread should run on core 0 (reported as core 1 in htop). A simulation with 2 threads should run on cores 0+1 and so on.

Plot the performance metric (iterations per second, run time or whatever) for 1, 2, 4, 6, 8, 12, 16 threads.
flotus1 is offline   Reply With Quote

Old   March 2, 2020, 23:32
Default
  #9
Member
 
Join Date: Oct 2019
Posts: 63
Rep Power: 6
Habib-CFD is on a distinguished road
Thank you, I changed the fan thresholds successfully.

The affinity of cores did not change by software or OS and it follows the compact mode in "htop". However, I was shocked by the results.

Cores.........My model.......Default sample of software
1...................448s..................575s
2...................298s..................354s
4...................229s..................269s
8...................184s..................183s
12.................169s..................144s
16.................161s..................140s

till 12 cores, I need to optimize my code, but higher than 12, it seems the Flow 3D has some problems.
Habib-CFD is offline   Reply With Quote

Old   March 3, 2020, 01:00
Default
  #10
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,400
Rep Power: 47
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
I'd rather say Flow-3D has some issues right from the start. Here is what your results look like in terms of strong scaling:
scaling.png
While scaling is better with the default sample, it is still far from acceptable.
Since you used compact binding, we can also rule out that this is a problem with NUMA. Already going from 1 to 2 cores on the same NUMA node, scaling drops way below ideal.
I don't think buying any other CPU could help here. In general, when CPU cores are too slow, scaling gets better. But here scaling is the main issue.

So either there is a significant serial portion in the code (which is even larger with your custom case), or the model does not fit into memory. How much memory do these 2 cases consume?
As a side-note: we clearly see that the software reporting "100% parallel efficiency" is utter nonsense. Parallel efficiency is usually defined as actual speedup divided by ideal speedup. Which is somewhere below 30% in these 2 cases.
The only other thing that comes to mind: the solver itself might be running fine and with proper parallel efficiency, but your run time measurements include another component. Like file I/O, happening outside of the solver. Or run times for pre- and post-processing.
Habib-CFD likes this.
flotus1 is offline   Reply With Quote

Old   March 3, 2020, 01:43
Default
  #11
Member
 
Join Date: Oct 2019
Posts: 63
Rep Power: 6
Habib-CFD is on a distinguished road
Nice analyses. I agree with you about the nature of the problem.
The total RAM usage is about 2GB for both tests. I am using 970 Nvme in this system. The preprocessing takes around 2s and I already considered them in results (subtracted). Unfortunately, I can not check the scaling in my intel family PCs now, but I found some recorded data based on the default sample of software using only a single thread on my i7-7700k that shows 630s of total run time (SMT on, 2666MHz, Windows 10).
Habib-CFD is offline   Reply With Quote

Old   March 3, 2020, 04:20
Default
  #12
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,400
Rep Power: 47
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
So the Intel CPU was slower even on a single core, despite higher clock speed.
Just to be 100% sure: using an NVMe SSD is no guarantee for avoiding I/O bottlenecks. You can monitor file-I/O on Linux using iotop (root privileges required). It will give you a summary of the total reads/writes to disk, as well as individual results for all processes.

After that, I think we are out of ideas. So it would be time to present these findings to whoever does support for this software. Even if you should be (no longer) a paying customer, these poor results should be enough incentive for them to have a closer look.
Habib-CFD likes this.
flotus1 is offline   Reply With Quote

Old   March 3, 2020, 04:44
Default
  #13
Member
 
Join Date: Oct 2019
Posts: 63
Rep Power: 6
Habib-CFD is on a distinguished road
You convinced me. The "iotop" shows 0.04 percent usage.
Thank you again for this helpful discussion.
Habib-CFD is offline   Reply With Quote

Old   March 3, 2020, 05:43
Default
  #14
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,400
Rep Power: 47
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
I didn't set out to convince you of anything. At least I like think so myself.
There are cases where one CPU is clearly better than another, which holds true to this day for both Intel and AMD.
But I think we worked out together that this is not a problem with unsuitable hardware, but software instead.
Habib-CFD likes this.
flotus1 is offline   Reply With Quote

Old   March 11, 2020, 17:33
Default
  #15
Member
 
Join Date: Oct 2019
Posts: 63
Rep Power: 6
Habib-CFD is on a distinguished road
Finally, I had time to test my 2920x performance with the default sample of Flow 3D.

The result shows the very specific competition between clock frequency and IPC performance until 8 cores considering the memory difference:


Default sample of Flow 3D, both CentOS 7.7

Cores........2920X(4x3200MHz)......7302P(8x2666MHz-DR)
1.......................483s...................... ......575s
2.......................328s...................... ......354s
4.......................235s...................... ......269s
8.......................173s...................... ......183s
12.....................165s....................... .....144s
16................................................ ........140s

Last edited by Habib-CFD; March 11, 2020 at 21:48.
Habib-CFD is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Suggestions for StarCCM+ Workstation configuration ifil Hardware 15 October 30, 2018 05:09
Request for review on workstation setup fusij Hardware 2 December 2, 2015 13:29
CFX11 + Fortran compiler ? Mohan CFX 20 March 30, 2011 18:56
incorrect temperature in pressure based solution Kian FLUENT 1 July 6, 2009 05:59
cfd workstation selection Bob Harries Main CFD Forum 7 June 7, 2000 17:16


All times are GMT -4. The time now is 14:03.