Issues with poor performance in faster CPU

gian93 · September 18, 2018, 07:46

Hi to everyone!
Actually i'm working on two type of machine for an OpenFoam simulation on my workThesis.
i'm sorry about my poor preparation in hardware field but i cannot figure out why one machine, apparently with more performances with respect to the other, is anyway absolutely slower.

here i reported the cpu charateristic of the two :

First and faster machine:

processor : 27
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
stepping : 1
cpu MHz : 2593.881
cache size : 35840 KB
physical id : 1
siblings : 14
core id : 14
cpu cores : 14
apicid : 60
initial apicid : 60
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
bogomips : 5187.60
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual

And then my second machine which is apparently better but shows very bad performance in computational time (infinitely more sowly with respect to the previous one)

processor : 95
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
stepping : 4
microcode : 0x2000018
cpu MHz : 3399.996
cache size : 33792 KB
physical id : 1
siblings : 48
core id : 29
cpu cores : 24
apicid : 123
initial apicid : 123
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
bugs : cpu_meltdown spectre_v1 spectre_v2
bogomips : 5388.93
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual

Can anyone be so patient to explain me how can i imprve the computational time of the second slower one? is it an issue related to the cpu architecture or it depends also from other parameters?
thanks

flotus1 · September 19, 2018, 05:55

The first thing that would come to mind is -as always- memory.
Xeon V4 has 4 memory channels, Skylake-SP (Xeon Platinum) has 6 memory channels. For optimal performance, all memory channels have to be populated with identical amounts of memory.

Other ideas:
How many CPUs do these machines have? Not cores, but physical CPUs.
Apparently, SMT/Hyperthreading is deactivated on the first machine. You should do the same on the second machine.

gian93 · September 19, 2018, 15:11

thanks !! on second machine i have only two slot occupied!!!
maybe it is the problem!

ProLiant-DL380-Gen10:~/OpenFOAM/innovation-2.2.x/run/1500sim$ sudo dmidecode -t memory | grep Size
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: 32 GB
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: 32 GB

why i should de activate hypertreading?

flotus1 · September 19, 2018, 17:53

Quote:

Originally Posted by gian93

thanks !! on second machine i have only two slot occupied!!!
maybe it is the problem!

why i should de activate hypertreading?

Only 2 DIMMs is definitely part of the problem. Memory is the MVP for CFD since it is usually bandwidth limited, especially with high core count CPUs. Throwing more money at "faster" CPUs usually does not help. You would need 5 additional DIMMs per CPU (based on the current memory population, I guess there are two CPUs installed?) in order to fix this.
SMT is known to cause a performance penalty in many cases involving CFD computations. We have seen many examples for this behavior in this thread alone. That's why it is often turned off so nobody has to fiddle around with affinity settings.

gian93 · October 12, 2018, 14:29

Hi! thanks for the reply !
i've followed your advice and i've saturated all the DIMMs with 32 Gb .

The performance increased a lot but with the same machine that have installed the Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz, i decided to make another test with this set up :
>same number of cells (2,5 *10^6)

>same DIMMs as before

>change the CPU to this one:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz
stepping : 4
microcode : 0x2000018
cpu MHz : 3699.875
cache size : 25344 KB
physical id : 1
siblings : 8
core id : 26
cpu cores : 8
apicid : 116
initial apicid : 116
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags :
bugs : cpu_meltdown spectre_v1 spectre_v2
bogomips : 6386.72
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

speed now seems better but unfortunately i noticed that there are very few cores . You suggest to change to another type of cpu for further improvment? i've really need to reduce as much as possible computational time (at the moment only one node is available)...

flotus1 · October 13, 2018, 08:28

Quote:

i've followed your advice and i've saturated all the DIMMs with 32 Gb .

6 DIMMs per CPU really would have been enough.

I find it a bit difficult to follow

Quote:

The performance increased a lot but with the same machine that have installed the Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz, i decided to make another test with this set up :
>same number of cells (2,5 *10^6)

>same DIMMs as before
>change the CPU to this one: Xeon Gold 6134

Same DIMMs as before means what exactly? 2 DIMMs or all slots populated?
How do you test? Same number of threads for both CPUs? Maximum number of threads available? When using 8 cores per CPU, the two models you compared should perform roughly the same give or take 10%.
Are you comparing this new CPU with SMT disabled against the old CPU with SMT enabled?
It would be helpful to have some actual numbers to compare the performance differences. It might help to distinguish between different kinds errors in the setup.
Maybe I am missing something, but I still don't know if you are using single- or dual-CPU.

There is not really a faster CPU you could buy in Intels lineup. The Xeon Platinum 8168 should not be significantly slower than any other CPU. Maybe you tested it with SMT on? Or maybe your test case shows negative scaling for a very high number of cores? If that is the case, you can simply reduce the number of cores your simulation runs at and distribute them evenly across both? CPUs. This should be the default behavior anyway.

gian93 · October 15, 2018, 10:48

hi thanks for your reply

i have 12 DIMM'S FOR 2 CPU (slot are 12x2 = 24, i have occupied one channel of the two available with 32 GB per slot)

i've changed the cpu ( Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz) and mounted the new one ( Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz) . With this configuration i can only use the 8 x 2 processors (intel platinum instead had 48 processors) and i simply compare the time to complete a simulation case with maximum number of processors available for both test.

For both case we have tested dual-CPU.
hypertreading is disabled .

Who can i verify my scaling ?

flotus1 · October 15, 2018, 11:01

Quote:

Who can i verify my scaling ?

Run the same case with an increasing number of threads. 1, 2, 4, 8, 16 and so on. Perfect scaling would mean that the simulation time is proportional to 1/number of cores. You won't get that beyond 12-16 cores. With very high core counts, some simulations can even take longer than with lower core counts.
With 16 threads on dual Xeon 8168 you should get about the same performance as with dual Xeon 6134. Otherwise you will have to dig into stuff like thread pinning and sub-NUMA clustering (formerly cluster on die)...

gian93 · October 29, 2018, 06:04

hi! thanks for your advice. i've made some test and the best number of core per simulation are infect 16-18 cores .
Anyway i noticed this stuff.

when i run a single simulation on a single machine (whathewer simulation is , whatever the hardware is) using for example 16 processor over 48 , the speed up (visible also by eyes from terminal tail log) is much higher than the case in which i run two simulation in parallel on the same machine (obviously when i do this i'm careful to do not exceed the core available on my node . example: if available cores are 48, usually i use 16 +16 cores for the two simulations )
if is possible , how can be fixed this problem?

thanks !!

flotus1 · October 29, 2018, 13:34

This is usually not a problem that can be fixed. Unless you run out of memory with 2 simulations running simultaneously.
The reason for slowdown is -again- memory bandwidth limitation. An over-simplified example: Lets say the machine you are using has a peak memory bandwidth of 100GB/s. Running one simulation on 16 cores uses 80GB/s of memory bandwidth. Adding a second simulation that would also require 80GB/s of memory bandwidth when running on 16 cores will obviously max out the peak memory bandwidth of the machine and both simulations will run slower than a single simulation.

September 18, 2018, 07:46	Issues with poor performance in faster CPU	#1
gian93 Member giovanni Join Date: Sep 2017 Posts: 50 Rep Power: 8	Hi to everyone! Actually i'm working on two type of machine for an OpenFoam simulation on my workThesis. i'm sorry about my poor preparation in hardware field but i cannot figure out why one machine, apparently with more performances with respect to the other, is anyway absolutely slower. here i reported the cpu charateristic of the two : First and faster machine: processor : 27 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz stepping : 1 cpu MHz : 2593.881 cache size : 35840 KB physical id : 1 siblings : 14 core id : 14 cpu cores : 14 apicid : 60 initial apicid : 60 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes bogomips : 5187.60 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual And then my second machine which is apparently better but shows very bad performance in computational time (infinitely more sowly with respect to the previous one) processor : 95 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz stepping : 4 microcode : 0x2000018 cpu MHz : 3399.996 cache size : 33792 KB physical id : 1 siblings : 48 core id : 29 cpu cores : 24 apicid : 123 initial apicid : 123 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes bugs : cpu_meltdown spectre_v1 spectre_v2 bogomips : 5388.93 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual Can anyone be so patient to explain me how can i imprve the computational time of the second slower one? is it an issue related to the cpu architecture or it depends also from other parameters? thanks

October 29, 2018, 13:34		#10
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	This is usually not a problem that can be fixed. Unless you run out of memory with 2 simulations running simultaneously. The reason for slowdown is -again- memory bandwidth limitation. An over-simplified example: Lets say the machine you are using has a peak memory bandwidth of 100GB/s. Running one simulation on 16 cores uses 80GB/s of memory bandwidth. Adding a second simulation that would also require 80GB/s of memory bandwidth when running on 16 cores will obviously max out the peak memory bandwidth of the machine and both simulations will run slower than a single simulation. davidtechassitance likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Any ideas on the Penalty for dual CPU and infiniband	JoshuaB	Hardware	3	July 3, 2018 13:00
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 05:36
Star cd es-ice solver error	ernarasimman	STAR-CD	2	September 12, 2014 00:01
OpenFOAM 13 Intel quadcore parallel results	msrinath80	OpenFOAM Running, Solving & CFD	13	February 5, 2008 05:26
more RAM or faster CPU??	Fabrizio Grieco	Siemens	11	January 23, 2001 07:35

September 19, 2018, 05:55		#2
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	The first thing that would come to mind is -as always- memory. Xeon V4 has 4 memory channels, Skylake-SP (Xeon Platinum) has 6 memory channels. For optimal performance, all memory channels have to be populated with identical amounts of memory. Other ideas: How many CPUs do these machines have? Not cores, but physical CPUs. Apparently, SMT/Hyperthreading is deactivated on the first machine. You should do the same on the second machine.

September 19, 2018, 15:11		#3
gian93 Member giovanni Join Date: Sep 2017 Posts: 50 Rep Power: 8	thanks !! on second machine i have only two slot occupied!!! maybe it is the problem! ProLiant-DL380-Gen10:~/OpenFOAM/innovation-2.2.x/run/1500sim$ sudo dmidecode -t memory \| grep Size Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: 32 GB Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: 32 GB why i should de activate hypertreading?

October 12, 2018, 14:29		#5
gian93 Member giovanni Join Date: Sep 2017 Posts: 50 Rep Power: 8	Hi! thanks for the reply ! i've followed your advice and i've saturated all the DIMMs with 32 Gb . The performance increased a lot but with the same machine that have installed the Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz, i decided to make another test with this set up : >same number of cells (2,5 *10^6) >same DIMMs as before >change the CPU to this one: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz stepping : 4 microcode : 0x2000018 cpu MHz : 3699.875 cache size : 25344 KB physical id : 1 siblings : 8 core id : 26 cpu cores : 8 apicid : 116 initial apicid : 116 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : bugs : cpu_meltdown spectre_v1 spectre_v2 bogomips : 6386.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: speed now seems better but unfortunately i noticed that there are very few cores . You suggest to change to another type of cpu for further improvment? i've really need to reduce as much as possible computational time (at the moment only one node is available)...

October 15, 2018, 10:48		#7
gian93 Member giovanni Join Date: Sep 2017 Posts: 50 Rep Power: 8	hi thanks for your reply i have 12 DIMM'S FOR 2 CPU (slot are 12x2 = 24, i have occupied one channel of the two available with 32 GB per slot) i've changed the cpu ( Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz) and mounted the new one ( Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz) . With this configuration i can only use the 8 x 2 processors (intel platinum instead had 48 processors) and i simply compare the time to complete a simulation case with maximum number of processors available for both test. For both case we have tested dual-CPU. hypertreading is disabled . Who can i verify my scaling ?

October 29, 2018, 06:04		#9
gian93 Member giovanni Join Date: Sep 2017 Posts: 50 Rep Power: 8	hi! thanks for your advice. i've made some test and the best number of core per simulation are infect 16-18 cores . Anyway i noticed this stuff. when i run a single simulation on a single machine (whathewer simulation is , whatever the hardware is) using for example 16 processor over 48 , the speed up (visible also by eyes from terminal tail log) is much higher than the case in which i run two simulation in parallel on the same machine (obviously when i do this i'm careful to do not exceed the core available on my node . example: if available cores are 48, usually i use 16 +16 cores for the two simulations ) if is possible , how can be fixed this problem? thanks !!