CFD Online Discussion Forums - Issues with poor performance in faster CPU

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Hardware (https://www.cfd-online.com/Forums/hardware/)

- - Issues with poor performance in faster CPU (https://www.cfd-online.com/Forums/hardware/207063-issues-poor-performance-faster-cpu.html)

gian93

September 18, 2018 07:46

Issues with poor performance in faster CPU

Hi to everyone!
Actually i'm working on two type of machine for an OpenFoam simulation on my workThesis.
i'm sorry about my poor preparation in hardware field but i cannot figure out why one machine, apparently with more performances with respect to the other, is anyway absolutely slower.

here i reported the cpu charateristic of the two :

First and faster machine:

processor : 27
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
stepping : 1
cpu MHz : 2593.881
cache size : 35840 KB
physical id : 1
siblings : 14
core id : 14
cpu cores : 14
apicid : 60
initial apicid : 60
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
bogomips : 5187.60
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual

And then my second machine which is apparently better but shows very bad performance in computational time (infinitely more sowly with respect to the previous one)

processor : 95
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
stepping : 4
microcode : 0x2000018
cpu MHz : 3399.996
cache size : 33792 KB
physical id : 1
siblings : 48
core id : 29
cpu cores : 24
apicid : 123
initial apicid : 123
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
bugs : cpu_meltdown spectre_v1 spectre_v2
bogomips : 5388.93
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual

Can anyone be so patient to explain me how can i imprve the computational time of the second slower one? is it an issue related to the cpu architecture or it depends also from other parameters?
thanks

flotus1

September 19, 2018 05:55

The first thing that would come to mind is -as always- memory.
Xeon V4 has 4 memory channels, Skylake-SP (Xeon Platinum) has 6 memory channels. For optimal performance, all memory channels have to be populated with identical amounts of memory.

Other ideas:
How many CPUs do these machines have? Not cores, but physical CPUs.
Apparently, SMT/Hyperthreading is deactivated on the first machine. You should do the same on the second machine.

gian93

September 19, 2018 15:11

thanks !! on second machine i have only two slot occupied!!!
maybe it is the problem!

ProLiant-DL380-Gen10:~/OpenFOAM/innovation-2.2.x/run/1500sim$ sudo dmidecode -t memory | grep Size
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: 32 GB
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: 32 GB

why i should de activate hypertreading?

flotus1

September 19, 2018 17:53

Quote:

Originally Posted by gian93 (Post 706843)

thanks !! on second machine i have only two slot occupied!!!
maybe it is the problem!

why i should de activate hypertreading?

Only 2 DIMMs is definitely part of the problem. Memory is the MVP for CFD since it is usually bandwidth limited, especially with high core count CPUs. Throwing more money at "faster" CPUs usually does not help. You would need 5 additional DIMMs per CPU (based on the current memory population, I guess there are two CPUs installed?) in order to fix this.
SMT is known to cause a performance penalty in many cases involving CFD computations. We have seen many examples for this behavior in this thread alone. That's why it is often turned off so nobody has to fiddle around with affinity settings.

gian93

October 12, 2018 14:29

Hi! thanks for the reply !
i've followed your advice and i've saturated all the DIMMs with 32 Gb .

The performance increased a lot but with the same machine that have installed the Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz, i decided to make another test with this set up :
>same number of cells (2,5 *10^6)

>same DIMMs as before

>change the CPU to this one:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz
stepping : 4
microcode : 0x2000018
cpu MHz : 3699.875
cache size : 25344 KB
physical id : 1
siblings : 8
core id : 26
cpu cores : 8
apicid : 116
initial apicid : 116
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags :
bugs : cpu_meltdown spectre_v1 spectre_v2
bogomips : 6386.72
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

speed now seems better but unfortunately i noticed that there are very few cores . You suggest to change to another type of cpu for further improvment? i've really need to reduce as much as possible computational time (at the moment only one node is available)...

flotus1

October 13, 2018 08:28

Quote:

i've followed your advice and i've saturated all the DIMMs with 32 Gb .

6 DIMMs per CPU really would have been enough.

I find it a bit difficult to follow

Quote:

The performance increased a lot but with the same machine that have installed the Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz, i decided to make another test with this set up :
>same number of cells (2,5 *10^6)

>same DIMMs as before
>change the CPU to this one: Xeon Gold 6134

Same DIMMs as before means what exactly? 2 DIMMs or all slots populated?
How do you test? Same number of threads for both CPUs? Maximum number of threads available? When using 8 cores per CPU, the two models you compared should perform roughly the same give or take 10%.
Are you comparing this new CPU with SMT disabled against the old CPU with SMT enabled?
It would be helpful to have some actual numbers to compare the performance differences. It might help to distinguish between different kinds errors in the setup.
Maybe I am missing something, but I still don't know if you are using single- or dual-CPU.

There is not really a faster CPU you could buy in Intels lineup. The Xeon Platinum 8168 should not be significantly slower than any other CPU. Maybe you tested it with SMT on? Or maybe your test case shows negative scaling for a very high number of cores? If that is the case, you can simply reduce the number of cores your simulation runs at and distribute them evenly across both? CPUs. This should be the default behavior anyway.

gian93

October 15, 2018 10:48

hi thanks for your reply

i have 12 DIMM'S FOR 2 CPU (slot are 12x2 = 24, i have occupied one channel of the two available with 32 GB per slot)

i've changed the cpu ( Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz) and mounted the new one ( Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz) . With this configuration i can only use the 8 x 2 processors (intel platinum instead had 48 processors) and i simply compare the time to complete a simulation case with maximum number of processors available for both test.

For both case we have tested dual-CPU.
hypertreading is disabled .

Who can i verify my scaling ?

flotus1

October 15, 2018 11:01

Quote:

Who can i verify my scaling ?

Run the same case with an increasing number of threads. 1, 2, 4, 8, 16 and so on. Perfect scaling would mean that the simulation time is proportional to 1/number of cores. You won't get that beyond 12-16 cores. With very high core counts, some simulations can even take longer than with lower core counts.
With 16 threads on dual Xeon 8168 you should get about the same performance as with dual Xeon 6134. Otherwise you will have to dig into stuff like thread pinning and sub-NUMA clustering (formerly cluster on die)...

gian93

October 29, 2018 06:04

hi! thanks for your advice. i've made some test and the best number of core per simulation are infect 16-18 cores .
Anyway i noticed this stuff.

when i run a single simulation on a single machine (whathewer simulation is , whatever the hardware is) using for example 16 processor over 48 , the speed up (visible also by eyes from terminal tail log) is much higher than the case in which i run two simulation in parallel on the same machine (obviously when i do this i'm careful to do not exceed the core available on my node . example: if available cores are 48, usually i use 16 +16 cores for the two simulations )
if is possible , how can be fixed this problem?

thanks !!

flotus1

October 29, 2018 13:34

This is usually not a problem that can be fixed. Unless you run out of memory with 2 simulations running simultaneously.
The reason for slowdown is -again- memory bandwidth limitation. An over-simplified example: Lets say the machine you are using has a peak memory bandwidth of 100GB/s. Running one simulation on 16 cores uses 80GB/s of memory bandwidth. Adding a second simulation that would also require 80GB/s of memory bandwidth when running on 16 cores will obviously max out the peak memory bandwidth of the machine and both simulations will run slower than a single simulation.

All times are GMT -4. The time now is 22:47.