CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

OpenFOAM benchmarks on various hardware

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree177Likes

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   March 22, 2018, 11:08
Default
  #41
Senior Member
 
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 414
Rep Power: 17
JBeilke is on a distinguished road
You are using just ONE processor. So you have only half of the memory bandwith.
JBeilke is offline   Reply With Quote

Old   March 22, 2018, 11:14
Default
  #42
Member
 
Johan Roenby
Join Date: May 2011
Location: Denmark
Posts: 90
Rep Power: 18
roenby will become famous soon enough
Quote:
Originally Posted by JBeilke View Post
You are using just ONE processor. So you have only half of the memory bandwith.
But when flotus1 is running on 16 of his 32 cores, I thought he was effectively using just one of his CPU's which in my understanding only communicates with the 8 RAM slots associated with that CPU. Did I misunderstand this?
SESP likes this.
roenby is offline   Reply With Quote

Old   March 22, 2018, 11:33
Default
  #43
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 2,864
Rep Power: 40
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Running on 16 of 32 cores, mpirun with default settings spreads out the active cores as evenly as possible across all NUMA-nodes. I confirmed this by looking at which cores are actually doing any work using htop. So my results with 16 cores will definitely be better than with a single CPU running 16 cores. A better estimate for 16 cores on a single CPU would be my result on 32 cores multiplied by 2.
If you want to I can do a few runs pinning all threads to one CPU so you can compare your results.
Which linux kernel version are you running? If it is the default kernel version of Ubuntu 16.04, it might be too old to use the full potential of your CPU. You will have to use HWE kernel to get better results. Is SMT turned off already?
roenby and Spanner like this.
flotus1 is offline   Reply With Quote

Old   March 22, 2018, 11:42
Default
  #44
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 483
Rep Power: 13
Simbelmynė is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
Which linux kernel version are you running? If it is the default kernel version of Ubuntu 16.04, it might be too old to use the full potential of your CPU. You will have to use HWE kernel to get better results. Is SMT turned off already?
I am not sure this is general Linux problem. I think it is a bug in the Palabos benchmark (I have not been able to confirm any dependence of the kernel in my OpenFOAM benchmarks).
Simbelmynė is offline   Reply With Quote

Old   March 22, 2018, 11:57
Default
  #45
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,962
Blog Entries: 45
Rep Power: 125
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Quote:
Originally Posted by roenby View Post
This is quite disappointing compared to flotus1's Epyc 7301 results reported above by (66 s on 16 cores compared to my 110 s).

Any educated guesses as to what might be the cause of this?
Quick note: flotus1 used 2x AMD Epyc 7301, which means that the 16 cores were being used with 8 cores on each socket, not 16 on a single socket. This means that all 16 memory channels were being used, while in your case you have 2 cores per memory channel being used.
The maximum clock speed differences will likely account for it not being strictly 2x faster on the 2x7301.


As for capacity per module: the more RAM there is, the higher the latency is expected, if I remember correctly, so the 8GB should be an itty-bitty-tiny-bit faster than 16GB

edit: I didn't notice that others had already answered
roenby likes this.

Last edited by wyldckat; March 22, 2018 at 11:58. Reason: see "edit:"
wyldckat is offline   Reply With Quote

Old   March 22, 2018, 14:14
Default
  #46
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 2,864
Rep Power: 40
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Reran the benchmark with 16 cores pinned to a single CPU: 84.4s execution time.
Surprisingly close to my prediction of two times the result for 32 cores
roenby and tin_tad like this.
flotus1 is offline   Reply With Quote

Old   March 22, 2018, 15:26
Default
  #47
Member
 
Johan Roenby
Join Date: May 2011
Location: Denmark
Posts: 90
Rep Power: 18
roenby will become famous soon enough
Quote:
Originally Posted by flotus1 View Post
Reran the benchmark with 16 cores pinned to a single CPU: 84.4s execution time.
Surprisingly close to my prediction of two times the result for 32 cores
I now disabled multithreading in the BIOS and reran the benchmark:

# cores Wall time (s):
------------------------
1 1008.95
2 582.33
4 273.67
6 174.61
8 126.35
12 123.35
16 85.05

So on 16 cores, I am now comfortable that things are OK (I reran it 20 times and all were in the range 83-88 s).

It is interesting to see that the other runs with idle cores available are not really affected by the multithreading. I guess these threads with other stuff that is apparently running alongside my simulation just find one of the idle cores to work on.
roenby is offline   Reply With Quote

Old   March 22, 2018, 16:36
Default
  #48
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 2,864
Rep Power: 40
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
roenbi could you post the output of lscpu please?
flotus1 is offline   Reply With Quote

Old   March 22, 2018, 16:41
Default
  #49
Member
 
Johan Roenby
Join Date: May 2011
Location: Denmark
Posts: 90
Rep Power: 18
roenby will become famous soon enough
Quote:
Originally Posted by flotus1 View Post
roenbi could you post the output of lscpu please?
Code:
roenby@aref:~$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          4
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 1
Model name:            AMD EPYC 7351 16-Core Processor
Stepping:              2
CPU MHz:               2400.000
CPU max MHz:           2400,0000
CPU min MHz:           1200,0000
BogoMIPS:              4799.73
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             64K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-3
NUMA node1 CPU(s):     4-7
NUMA node2 CPU(s):     8-11
NUMA node3 CPU(s):     12-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx cpb hw_pstate retpoline retpoline_amd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
flotus1 and tin_tad like this.
roenby is offline   Reply With Quote

Old   March 23, 2018, 08:20
Default
  #50
Member
 
Giovanni Medici
Join Date: Mar 2014
Posts: 30
Rep Power: 9
giovanni.medici is on a distinguished road
First of all I want to thank everyone for the help.
As suggested flotus1 our machine was scaling quite bad. I tried to switch off hyperthreading, and things got a little bit better. Moreover, I checked and the position of the 2 32 Gb RAM dimms, where ok (A1 B1).

We now installed 8 x 8 Gb DDR4, slightly faster (2400MHz) RAM dimms, so to populate slots (A1 A2 A3 A4, B1 B2 B3 B4). Therefore all 4 memory channels of each socket are now populated.

The results which I report here where obtained with 2x E5 2630 v3 2.4 GHz hyperThreading ON (i.e. 32 threads), and 32Gb allocated to the OracleVM :

Code:
# cores   Wall time (s):		
------------------------		
1	1032.88	1.00
2	577.14	1.79
4	328.22	3.15
6	262.23	3.94
8	258.98	3.99
12	247.23	4.18
16	236.92	4.36
24	281.56	3.67
30	342.56	3.02
32	391.9	2.64
Results improved considerably, anyway I think there is yet room for improvement. Beside performing other tests with HyperThreading OFF, I'm wondering how to increase performance more.

I'm running under Windows Server 2012R2, with OF_1712 (ESI distribution), therefore I'm taking advantage of OracleVM. The VM does not allow me to allocate all the RAM available (otherwise, I think, the OS could collapse), therefore I'm not quite sure every thread/core is using in the fastest way the RAM.
  1. Does anybody of you have some suggestion on VM settings that could be beneficial (currently I'm running, beside the number of cores and RAM, with the default settings).
  2. Do you think VM performance could benefit of an update of the Windows mpi? Does any of you know if there is a way to give to the VM priority over other processes (sort of the TaskManager option)?
  3. I was thinking on the opportunity to install an Ubuntu distribution on another disk and try dual boot. Does anybody of you tried it or is it a nightmare (physical access to the machine is quite limited, therefore I would consider unfeasible an option relying on the BIOS to change boot order).
  4. An easier way could be to install/rely on blueCFD, but I'm not quite sure of the compatibility with windows server 2012R2. Did anybody tried it?


Thank you !!!!
etinavid and erik87 like this.
giovanni.medici is offline   Reply With Quote

Old   March 23, 2018, 09:57
Default
  #51
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 2,864
Rep Power: 40
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
I guess the memory allocation in Oracle VM is your problem: https://blogs.oracle.com/wim/underst...-oracle-vm-xen
This also explains why it seemed like memory was mis-configured with 2 DIMMs. Unfortunately, I have no idea how to improve this Maybe by asking oracle support...
giovanni.medici likes this.
flotus1 is offline   Reply With Quote

Old   March 23, 2018, 20:35
Default
  #52
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,962
Blog Entries: 45
Rep Power: 125
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
(Somewhat) Quick answers:
Quote:
Originally Posted by giovanni.medici View Post
1. Does anybody of you have some suggestion on VM settings that could be beneficial (currently I'm running, beside the number of cores and RAM, with the default settings).
I have no clue on OracleVM settings, but on VirtualBox you can state that you want 100% performance... i.e., as much CPU resources as available. But it will still fight Windows for CPU resources

Quote:
Originally Posted by giovanni.medici View Post
2. Do you think VM performance could benefit of an update of the Windows mpi?
Nope, if you can't link directly to MS-MPI from Linux, then it's pointless to upgrade it. This would only make sense if you were building OpenFOAM from source code directly on Windows.

Quote:
Originally Posted by giovanni.medici View Post
2.a. Does any of you know if there is a way to give to the VM priority over other processes (sort of the TaskManager option)?
My best guess would be to switch to Windows' HyperV virtualization thingamabob...
That and/or use a Linux Kernel compiled to be able to connect to the OracleVM on the host, so that it could gain additional direct-metal-access capabilities. Never used it myself, but I guess that OracleVM has something like that.


Quote:
Originally Posted by giovanni.medici View Post
3. I was thinking on the opportunity to install an Ubuntu distribution on another disk and try dual boot. Does anybody of you tried it or is it a nightmare (physical access to the machine is quite limited, therefore I would consider unfeasible an option relying on the BIOS to change boot order).
If your machine has remote control capabilities at the motherboard level (I believe it's called IPMI BMC, although I'm not 100% certain of the nomenclature), that would allow you full control remotely, even while the machine is booting. This means that you would login directly into the motherboard's remote control capabilities, probably first through a web-browser.


Quote:
Originally Posted by giovanni.medici View Post
4. An easier way could be to install/rely on blueCFD, but I'm not quite sure of the compatibility with windows server 2012R2. Did anybody tried it?
Disclaimer: I'm the (main) developer of blueCFD-Core.
Right now, blueCFD-Core is mostly only good enough as a convenient replacement to any other virtualization strategies, so that you don't need to leave Windows to use OpenFOAM. Performance-wise, it's not great


But if you want to take full advantage of your hardware, you should install a Linux Distribution natively, or at least use an extremely efficient virtualization software. That and build OpenFOAM from source code, along with dedicated flags for your CPU model-type... don't use pre-built packages.
giovanni.medici likes this.
wyldckat is offline   Reply With Quote

Old   March 24, 2018, 04:38
Default
  #53
Member
 
Giovanni Medici
Join Date: Mar 2014
Posts: 30
Rep Power: 9
giovanni.medici is on a distinguished road
Quote:
Originally Posted by wyldckat View Post
(Somewhat) Quick answers:

I have no clue on OracleVM settings, but on VirtualBox you can state that you want 100% performance... i.e., as much CPU resources as available. But it will still fight Windows for CPU resources


Nope, if you can't link directly to MS-MPI from Linux, then it's pointless to upgrade it. This would only make sense if you were building OpenFOAM from source code directly on Windows.


My best guess would be to switch to Windows' HyperV virtualization thingamabob...
That and/or use a Linux Kernel compiled to be able to connect to the OracleVM on the host, so that it could gain additional direct-metal-access capabilities. Never used it myself, but I guess that OracleVM has something like that.



If your machine has remote control capabilities at the motherboard level (I believe it's called IPMI BMC, although I'm not 100% certain of the nomenclature), that would allow you full control remotely, even while the machine is booting. This means that you would login directly into the motherboard's remote control capabilities, probably first through a web-browser.



Disclaimer: I'm the (main) developer of blueCFD-Core.
Right now, blueCFD-Core is mostly only good enough as a convenient replacement to any other virtualization strategies, so that you don't need to leave Windows to use OpenFOAM. Performance-wise, it's not great


But if you want to take full advantage of your hardware, you should install a Linux Distribution natively, or at least use an extremely efficient virtualization software. That and build OpenFOAM from source code, along with dedicated flags for your CPU model-type... don't use pre-built packages.

Thanks wyldckat for the fast and comprehensive answer. I will definitely investigate the IPMI capabilities of our motherboard (namely the Dell PowerEdge R430).
BlueCFD looks like to be a really interesting option for the users which can not (for whatever reason), switch completely to Linux.
giovanni.medici is offline   Reply With Quote

Old   March 26, 2018, 10:27
Default
  #54
New Member
 
Join Date: Dec 2017
Posts: 5
Rep Power: 5
Mr.Turbulence is on a distinguished road
Hi everyone,

I tried to run the OF benchmark with the 2x AMD EPYC 7351, 16x 8GB DDR4 2666MHz, with OpenFOAM 5.0 on Ubuntu 16.04.

I ran the calculation by binding the processes to core, to socket and none on 16 and 32 cores.

There are the results below :

HTML Code:
# cores   Wall time (s):         Wall time (s):       Wall time (s):
              core                       socket                  none
---------------------------------------------------------------------
1                                                                    922
16             153.34                 55.7                    65.78
32             70.8                     38.68                  38.8
I don't understand why the calculations are so long by binding the processes to core compare to the others results.

Do you think it could come from the fact that the hyper-threading is on ?

Thanks in Advance
flotus1 likes this.
Mr.Turbulence is offline   Reply With Quote

Old   March 26, 2018, 10:33
Default
  #55
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 2,864
Rep Power: 40
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
I don't think it has to do with SMT. I had it turned off and tried a few binding options but ended up with the same poor performance you observed. I still have no clue what causes it.

May I ask which exact memory type you are using?
flotus1 is offline   Reply With Quote

Old   March 26, 2018, 10:42
Default
  #56
New Member
 
Join Date: Dec 2017
Posts: 5
Rep Power: 5
Mr.Turbulence is on a distinguished road
There is my config :

Handle 0x0053, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x001A
Error Information Handle: 0x0052
Total Width: 72 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: None
Locator: P2-DIMMH1
Bank Locator: P1_Node0_Channel7_Dimm0
Type: DDR4
Type Detail: Synchronous Registered (Buffered)
Speed: 2667 MHz
Manufacturer: Samsung
Serial Number: 030C18C6
Asset Tag: P2-DIMMH1_AssetTag (date:17/05)
Part Number: M393A1G40EB2-CTD
Rank: 1
Configured Clock Speed: 2667 MHz
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V

which binding option are you usually using ?
flotus1 likes this.
Mr.Turbulence is offline   Reply With Quote

Old   March 26, 2018, 10:49
Default
  #57
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 2,864
Rep Power: 40
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
For this benchmark I ended up using no binding option at all which gave the best overall results. I don't use OpenFOAM for my work. A simple bind to core in order to avoid messing up caches and memory access is usually enough for the solvers I use.
Mr.Turbulence and tin_tad like this.
flotus1 is offline   Reply With Quote

Old   March 26, 2018, 13:00
Default
  #58
New Member
 
Join Date: Dec 2017
Posts: 5
Rep Power: 5
Mr.Turbulence is on a distinguished road
Thanks a lot. By turning off the Multi-threading i get the following results :




Code:
# cores   Wall time (s):         Wall time (s):       Wall time (s):
              core                       socket                 none
---------------------------------------------------------------------
16             81.23                 60.52                   61.91
32             37.37                  36.94                  39.67
It is better but i still get a poor performance at 16 core by binding the processes to core which is strange. If you find any clue i am really interested.
Mr.Turbulence is offline   Reply With Quote

Old   March 26, 2018, 17:13
Default
  #59
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,962
Blog Entries: 45
Rep Power: 125
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Greetings to all!

Quote:
Originally Posted by Mr.Turbulence View Post
It is better but i still get a poor performance at 16 core by binding the processes to core which is strange. If you find any clue i am really interested.
For the latest results, this is fairly simple to explain: binding by core will likely assign in logical core numbering, which usually is first fully assigned to the first socket, then starts counting on the second socket.
This means that for the 16 processes run, those 16 were fighting for access to the existing 8 memory channels on the first socket.

When binding per socket, it will have likely ordered based on balanced distribution, namely 8 cores on each socket.

This is clearer when compared to the results with 32 cores, where the results are nearly the same.

Side note: If you are trying to pinpoint which mode is best, I strongly suggest running several runs on each mode, because the majority of the results seem to be within a statistical margin of error, i.e. the latest results with 32 cores seem mostly identical, regardless of the assignment mode.

Best regards,
Bruno
wyldckat is offline   Reply With Quote

Old   March 26, 2018, 18:49
Default
  #60
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 2,864
Rep Power: 40
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
What I tried among other options was explicitly binding threads to certain cores, making sure the distribution was optimal - at least in theory. The same method worked for other solvers. I still ended up with low performance for most thread counts in OpenFOAM.
flotus1 is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology wyldckat OpenFOAM 17 November 10, 2017 16:54
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days joegi.geo OpenFOAM Announcements from Other Sources 0 October 1, 2016 20:20
OpenFOAM Training Beijing 22-26 Aug 2016 cfd.direct OpenFOAM Announcements from Other Sources 0 May 3, 2016 05:57
New OpenFOAM Forum Structure jola OpenFOAM 2 October 19, 2011 07:55
Hardware for OpenFOAM LES LijieNPIC Hardware 0 November 8, 2010 10:54


All times are GMT -4. The time now is 13:33.