OpenFOAM benchmarks on various hardware

JBeilke · March 22, 2018, 10:08

You are using just ONE processor. So you have only half of the memory bandwith.

roenby · March 22, 2018, 10:14

Quote:

Originally Posted by JBeilke

You are using just ONE processor. So you have only half of the memory bandwith.

But when flotus1 is running on 16 of his 32 cores, I thought he was effectively using just one of his CPU's which in my understanding only communicates with the 8 RAM slots associated with that CPU. Did I misunderstand this?

flotus1 · March 22, 2018, 10:33

Running on 16 of 32 cores, mpirun with default settings spreads out the active cores as evenly as possible across all NUMA-nodes. I confirmed this by looking at which cores are actually doing any work using htop. So my results with 16 cores will definitely be better than with a single CPU running 16 cores. A better estimate for 16 cores on a single CPU would be my result on 32 cores multiplied by 2.
If you want to I can do a few runs pinning all threads to one CPU so you can compare your results.
Which linux kernel version are you running? If it is the default kernel version of Ubuntu 16.04, it might be too old to use the full potential of your CPU. You will have to use HWE kernel to get better results. Is SMT turned off already?

Simbelmynë · March 22, 2018, 10:42

Quote:

Originally Posted by flotus1

Which linux kernel version are you running? If it is the default kernel version of Ubuntu 16.04, it might be too old to use the full potential of your CPU. You will have to use HWE kernel to get better results. Is SMT turned off already?

I am not sure this is general Linux problem. I think it is a bug in the Palabos benchmark (I have not been able to confirm any dependence of the kernel in my OpenFOAM benchmarks).

wyldckat · March 22, 2018, 10:57

Quote:

Originally Posted by roenby

This is quite disappointing compared to flotus1's Epyc 7301 results reported above by (66 s on 16 cores compared to my 110 s).

Any educated guesses as to what might be the cause of this?

Quick note: flotus1 used 2x AMD Epyc 7301, which means that the 16 cores were being used with 8 cores on each socket, not 16 on a single socket. This means that all 16 memory channels were being used, while in your case you have 2 cores per memory channel being used.
The maximum clock speed differences will likely account for it not being strictly 2x faster on the 2x7301.

As for capacity per module: the more RAM there is, the higher the latency is expected, if I remember correctly, so the 8GB should be an itty-bitty-tiny-bit faster than 16GB

edit: I didn't notice that others had already answered

flotus1 · March 22, 2018, 13:14

Reran the benchmark with 16 cores pinned to a single CPU: 84.4s execution time.
Surprisingly close to my prediction of two times the result for 32 cores

roenby · March 22, 2018, 14:26

Quote:

Originally Posted by flotus1

Reran the benchmark with 16 cores pinned to a single CPU: 84.4s execution time.
Surprisingly close to my prediction of two times the result for 32 cores

I now disabled multithreading in the BIOS and reran the benchmark:

# cores Wall time (s):
------------------------
1 1008.95
2 582.33
4 273.67
6 174.61
8 126.35
12 123.35
16 85.05

So on 16 cores, I am now comfortable that things are OK (I reran it 20 times and all were in the range 83-88 s).

It is interesting to see that the other runs with idle cores available are not really affected by the multithreading. I guess these threads with other stuff that is apparently running alongside my simulation just find one of the idle cores to work on.

flotus1 · March 22, 2018, 15:36

roenbi could you post the output of lscpu please?

roenby · March 22, 2018, 15:41

Quote:

Originally Posted by flotus1

roenbi could you post the output of lscpu please?

Code:

roenby@aref:~$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          4
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 1
Model name:            AMD EPYC 7351 16-Core Processor
Stepping:              2
CPU MHz:               2400.000
CPU max MHz:           2400,0000
CPU min MHz:           1200,0000
BogoMIPS:              4799.73
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             64K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-3
NUMA node1 CPU(s):     4-7
NUMA node2 CPU(s):     8-11
NUMA node3 CPU(s):     12-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx cpb hw_pstate retpoline retpoline_amd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

giovanni.medici · March 23, 2018, 07:20

First of all I want to thank everyone for the help.
As suggested flotus1 our machine was scaling quite bad. I tried to switch off hyperthreading, and things got a little bit better. Moreover, I checked and the position of the 2 32 Gb RAM dimms, where ok (A1 B1).

We now installed 8 x 8 Gb DDR4, slightly faster (2400MHz) RAM dimms, so to populate slots (A1 A2 A3 A4, B1 B2 B3 B4). Therefore all 4 memory channels of each socket are now populated.

The results which I report here where obtained with 2x E5 2630 v3 2.4 GHz hyperThreading ON (i.e. 32 threads), and 32Gb allocated to the OracleVM :

Code:

# cores   Wall time (s):		
------------------------		
1	1032.88	1.00
2	577.14	1.79
4	328.22	3.15
6	262.23	3.94
8	258.98	3.99
12	247.23	4.18
16	236.92	4.36
24	281.56	3.67
30	342.56	3.02
32	391.9	2.64

Results improved considerably, anyway I think there is yet room for improvement. Beside performing other tests with HyperThreading OFF, I'm wondering how to increase performance more.

I'm running under Windows Server 2012R2, with OF_1712 (ESI distribution), therefore I'm taking advantage of OracleVM. The VM does not allow me to allocate all the RAM available (otherwise, I think, the OS could collapse), therefore I'm not quite sure every thread/core is using in the fastest way the RAM.

Does anybody of you have some suggestion on VM settings that could be beneficial (currently I'm running, beside the number of cores and RAM, with the default settings).
Do you think VM performance could benefit of an update of the Windows mpi? Does any of you know if there is a way to give to the VM priority over other processes (sort of the TaskManager option)?
I was thinking on the opportunity to install an Ubuntu distribution on another disk and try dual boot. Does anybody of you tried it or is it a nightmare (physical access to the machine is quite limited, therefore I would consider unfeasible an option relying on the BIOS to change boot order).
An easier way could be to install/rely on blueCFD, but I'm not quite sure of the compatibility with windows server 2012R2. Did anybody tried it?

Thank you !!!!

flotus1 · March 23, 2018, 08:57

I guess the memory allocation in Oracle VM is your problem: https://blogs.oracle.com/wim/underst...-oracle-vm-xen
This also explains why it seemed like memory was mis-configured with 2 DIMMs. Unfortunately, I have no idea how to improve this

Maybe by asking oracle support...

wyldckat · March 23, 2018, 19:35

(Somewhat) Quick answers:

Quote: