CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   Hardware (https://www.cfd-online.com/Forums/hardware/)
-   -   Kernel for new CPUs (https://www.cfd-online.com/Forums/hardware/196044-kernel-new-cpus.html)

Simbelmynė November 23, 2017 13:03

Kernel for new CPUs
 
Hey,

Tried a 7940X setup today with a fresh CentOS installation. As benchmark I used Palabos (cavity3d)

The performance was abysmal to say the least.

At N=100 the results for 1 thread was about 4.5 msu

With the 4.14 kernel I managed to increase the value to 9 msu

4 threads: 29 msu
8 threads: 43 msu

Compared to the old CPUs in the reference I feel that I am doing something wrong. What do you think, it should be higher, right?

http://wiki.palabos.org/plb_wiki:benchmark:cavity_n100

Installing OpenFOAM now to get some regular CFD benchmark data.

flotus1 November 23, 2017 14:25

Seems like there is something wrong. My laptop processor (I5-4210U) is getting 7.3 MLUPS on one core.

Did you recompile after the kernel update?
Did you clear caches before running the benchmark?
Code:

# free && sync && echo 3 > /proc/sys/vm/drop_caches && free
Did you make sure the CPU is running at maximum frequency while solving?
Which memory do you have?
Did you try to bind the process to a core when solving on one core?

Edit: You might want to try a few ordinary benchmarks first. Being able to compare your results with known results for the same hardware helps finding possible causes for bad performance. Maybe even using a test installation of windows. The tools you have there are easier to use in my opinion.

Simbelmynė November 24, 2017 07:35

Quote:

Originally Posted by flotus1 (Post 672664)
Seems like there is something wrong. My laptop processor (I5-4210U) is getting 7.3 MLUPS on one core.

Seems reasonable!

Quote:

Originally Posted by flotus1 (Post 672664)
Did you recompile after the kernel update?

Yes, did a "make clean" and "make". At least it was a 2x speed improvement with the new kernel.


Quote:

Originally Posted by flotus1 (Post 672664)
Did you clear caches before running the benchmark?

No, but after testing this suggestion it is still around 9.5-9.9 msu.

Quote:

Originally Posted by flotus1 (Post 672664)
Did you make sure the CPU is running at maximum frequency while solving?

This is interesting. I don't see any use of Turbo boost. The frequency just stays at 3.1 GHz on all cores. Not sure why though. While testing the same benchmark on my 7600k (Ubuntu 16.04) I get the same behavior. Not sure why the Turbo boost is not kicking in.

Quote:

Originally Posted by flotus1 (Post 672664)
Which memory do you have?

Corsair vengeance LPX 3200 MHz 4x8 GB. Not sure if the XMP is on or not. SSHing in to the computer so I cannot check atm.

Quote:

Originally Posted by flotus1 (Post 672664)
Did you try to bind the process to a core when solving on one core?

No. How do I do that? I can see that the same core is being used though throughout the benchmark.

Quote:

Originally Posted by flotus1 (Post 672664)
Edit: You might want to try a few ordinary benchmarks first. Being able to compare your results with known results for the same hardware helps finding possible causes for bad performance. Maybe even using a test installation of windows. The tools you have there are easier to use in my opinion.

Yeah, perhaps a dual boot with Windows is good anyway. I will test it!


UPDATE: I have now also checked the benchmark on my 7600k and it gives 5.4 msu, under Ubuntu 16.04. The 7600k should be able to out-perform I5-4210U, unless you have it extremely overclocked (seems unlikely in a laptop though), right?

I compile it using the make-file (no changes).
I run it with "./cavity3d 100"
or "mpirun -np 1 ./cavity3d 100"

Anything I do different here?

flotus1 November 24, 2017 08:08

Quote:

Originally Posted by Simbelmynė (Post 672737)
This is interesting. I don't see any use of Turbo boost. The frequency just stays at 3.1 GHz on all cores. Not sure why though. While testing the same benchmark on my 7600k (Ubuntu 16.04) I get the same behavior. Not sure why the Turbo boost is not kicking in.

Another reason to test a different OS. With Opensuse and Mint (and Windows of course) I never had any issues with turbo not being used.
Maybe an energy saving option in your OS? Maybe deactivated in the Bios?
Anyway, if your CPU is running at 3.1GHz that would explain the mediocre performance. You could try lock the clock speed to a higher value once you have access to the bios again and see if this changes anything.

Quote:

Originally Posted by Simbelmynė (Post 672737)
No. How do I do that? I can see that the same core is being used though throughout the benchmark.

You can try using taskset for any program.
For MPI, there must be some environment variables or runtime options controlling thread affinity. I am not an MPI expert yet. But if the process ran on one specific core all along this is not the cause of your problem

Quote:

Originally Posted by Simbelmynė (Post 672737)
UPDATE: I have now also checked the benchmark on my 7600k and it gives 5.4 msu, under Ubuntu 16.04. The 7600k should be able to out-perform I5-4210U, unless you have it extremely overclocked (seems unlikely in a laptop though), right?

I probably would if I could :D
But no, it is running at 2.7GHz single-core turbo with dual-channel DDR3-1600. Linux Mint 18.2

Quote:

Originally Posted by Simbelmynė (Post 672737)
I compile it using the make-file (no changes).
I run it with "./cavity3d 100"
or "mpirun -np 1 ./cavity3d 100"

Anything I do different here?

Same here. I tried versions 1.5 and 2.0 of the program, with no significant differences.

But I have to say that is a neat and handy CFD Benchmark...

Simbelmynė November 27, 2017 09:32

I have done some testing with OpenFOAM and it seems that Ubuntu correctly applies turboBoost to increase the frequency of the 7600K CPU. However, CentOS does not appear to apply turboBoost at all to the 7940X CPU.

Not sure if this is a monitoring issue, I have yet to install i7z on CentOS. So far I only use

Code:

$ lscpu | grep "MHz"
which I believe is not giving correct readings. However, it gives no indication whatsoever that the frequency changes, which is strange.

My main suspicion is that I have to modify the BIOS and turn off some power saving options. Will try it when I have the possibility.

UPDATE: I can now verify that lscpu is not showing frequency correctly under the 4.14 kernel (OK under the 4.10 kernel). I tested to upgrade the Ubuntu installation to 4.14 and got the same problem as in CentOS. Furthermore, I have now also tested i7z under CentOS and it shows correct turbo frequency.

(Still get 9.86 msu, on the 7940X, and even though that feels a bit low, I think when comparing to OpenFOAM results the CPU seems to work reasonable).

flotus1 December 2, 2017 06:02

As I just found out myself while setting up my new workstation: sudo turbostat shows the actual CPU frequency on linux. While "lscpu" and "cat /proc/cpuinfo | grep MHz" always worked fine for me on Intel systems including turbo frequencies, this seems to show only base frequencies under certain circumstances.
Btw: AMD Epyc 7301 with DDR4-2133 memory and 2.7GHz clock speed hits around 9.2MLUPS single-core in this benchmark.
You might want to overclock the uncore/cache/ring/mesh/whateveritscallednow on your CPU. It is known to be quite low and the cause for some mediocre benchmark results where Skylake-X gets beaten by its predecessors.

Simbelmynė December 2, 2017 08:43

Thank you for the suggestions! I do not run calculations on my 7600k, but it is annoying anyway, so I will try to figure out how to solve it ;)


It seems that the Epyc 7301 and 7940X are quite equally matched in single core in this benchmark. The 7940X is turbo boosting to about 4.3 GHz, but it only has 4 memory channels. I expect the difference to be much larger at higher thread count.

I have also tested an 8700k and it yields approx. 13 msu in the same benchmark, so very powerful in single threaded simulations. In the OF motorbike benchmark it hits a wall at 4 threads, with very minor improvement after that.

flotus1 December 2, 2017 11:37

Quote:

Originally Posted by Simbelmynė (Post 673671)
I expect the difference to be much larger at higher thread count.

Indeed :D
Running the 100 benchmark size on all 32 cores I get 162 MLUPS.

Simbelmynė December 13, 2017 08:41

I have an interesting observation.

Running the Palabos benchmark in a Virtualbox using Ubuntu 17.10 I get 11.6 msu with the 7600k cpu. This is a rather massive improvement from the 5.4 msu I get when booting into Ubuntu 16.04 with the same machine. Not sure if the difference is attributed to the Virtualbox being Ubuntu 17.10 (as opposed to the 16.04 I have installed) or if it has something to do with the Virtualbox itself.

Btw, changing "uncore/cache/ring/mesh/whateveritscallednow" had no effect.

flotus1 December 13, 2017 11:04

Weird. I suspect that something is still bottlenecking your I9 CPU. Did you already try a clean install with a more recent Linux version? I am currently running Opensuse Tumbleweed on my AMD workstation which works quite well.

Simbelmynė December 13, 2017 17:11

OK so I have done some more testing (best kernel in bold text).

Still the Cavity3d 100, test case from Palabos, single thread.

7600K
Linux Mint 18.3 - approx. 6 msu (kernel 4.10.0)
Ubuntu 16.04 - approx. 6 msu (kernel 4.10.0)
Ubuntu 16.04 (Virtualbox) - 5.9 msu (kernel 4.10.0)
Ubuntu 17.10 (Virtualbox) - 11.6 msu (kernel 4.13.0)

8700k
CentOS 7.4.1708 - 2.8 msu (kernel 3.10.0) (yes I double checked this one!)
CentOS 7.4.1708 - 12.8 msu (kernel 4.14.2)

Threadripper 1950X
CentOS 7.4.1708 - 8.1 (kernel 3.10.0)
CentOS 7.4.1708 - 7.6 msu (kernel 4.14.2)

I9 7940X
CentOS 7.4.1708 - 4.5 msu (kernel 3.10.0)
CentOS 7.4.1708 - 10.7 msu (kernel 4.14.1-1) *updated*

Epyc 7301 (from flotus1)
CentOS 7 - 9.2 (kernel 4.14-3-1)
OpenSUSE Tumbleweed - 9.37 msu (kernel 4.14-3-1)


The top results from each CPU is more or less in line with the frequency and IPC of each model, with one big exception - the EPYC - which performs much better for some reason. Perhaps the 4.14.2 kernel is not good enough and still limits the 7940X and the Threadripper 1959X?

It is very clear that the kernel has a dramatic impact on the performance of this benchmark. However, it seems that the latest is not always the greatest. This is true for the Threadripper case only though. I will keep a close eye on the kernel releases, now that there is once again competition in the high-end segment of computing, leading to more frequent releases of new models.

flotus1 December 14, 2017 04:08

IIRC the 9.2 single-core were on CentOS 7 with Kernel version 4.14-3-1
Now with Tumbleweed (same Kernel) this slightly improved to 9.37.

Simbelmynė December 14, 2017 06:52

OK, thanks! I will try out the 4.14-3-1 kernel and see if that helps the Threadripper.

How about the results. Do you think this benchmark is bandwidth limited for 1 thread? It doesn't seem to be when looking at the Intel line-up. But your EPYC clearly says otherwise. :confused:

flotus1 December 14, 2017 07:43

I don't think so, a memory bandwidth limit on a single core would be rather unusual. Apart from that, your I9 CPU has more memory bandwidth available to a single core, AMD Epyc only has 2 memory channels per die, your CPU has 4 with even higher frequency.
Which MPI library did you use and how did you install it? I downloaded openmpi 3.0 and compiled it from source https://www.open-mpi.org/software/ompi/v3.0/
And which compiler version are you using?

Simbelmynė January 3, 2018 09:26

Ok, so I have realized that the test case is most likely flawed. After testing all currently stable kernels I got very inconclusive results.

The compiler itself seems important (but not always, on CentOS the kernel version seemed more important :confused:).

Using linux Mint 18.3 with the 5.3.1 version of g++ my 8700k managed 6.9 msu, while having the 4.13.0 kernel. However, when using the 7.2 version of g++ under Ubuntu 17.10 (4.13.0 kernel) I got 14 msu.

Installing g++ 7.1 on the linux Mint installation resulted in a benchmark value of 623 msu !! :D:D

So now I have tested the cavity3d example under /example/showcases/ instead. It gives the same results regardless of compiler and kernel.

Would be interesting to hear what results you get when running that case with 1 thread.

I changed the code to suppress output and hdd read/write. e.g.:


Code:

const T logT = (T)1/(T) 1;
const T imSave = (T)10/(T) 1;
const T vtkSave = (T)10/(T) 1;

With 1 thread I got 10m20s and with all 6 threads it is decreased to 3m10s real time measured by:

Code:

time mpirun -np 6 ./cavity3d
Intel 7940X (14 threads): 2m45s
Threadripper 1950X (16 threads): 2m01s

flotus1 January 3, 2018 14:00

I applied the changes to the source code as you suggested. Output for dual AMD Epyc 7301:
Code:

as01449@localhost:~/benchmarks/palabos-v2.0r0/examples/showCases/cavity3d> time mpirun -np 32 cavity3d
omega= 1.53846
Writing Gif ...
step 0; t=0; av energy=6.12745098e-07; av rho=1
Time spent during previous iteration: 0
step 5000; t=1; av energy=1.906470281e-06; av rho=0.9999633641
Time spent during previous iteration: 0.001128533
step 10000; t=2; av energy=1.907210258e-06; av rho=0.9999266731
Time spent during previous iteration: 0.001125073
step 15000; t=3; av energy=1.90721169e-06; av rho=0.9998899833
Time spent during previous iteration: 0.001120143
step 20000; t=4; av energy=1.907211692e-06; av rho=0.9998532949
Time spent during previous iteration: 0.001123033
step 25000; t=5; av energy=1.907211692e-06; av rho=0.9998166078
Time spent during previous iteration: 0.001139313
step 30000; t=6; av energy=1.907211692e-06; av rho=0.999779922
Time spent during previous iteration: 0.001128202
step 35000; t=7; av energy=1.907211692e-06; av rho=0.9997432376
Time spent during previous iteration: 0.001129653
step 40000; t=8; av energy=1.907211692e-06; av rho=0.9997065546
Time spent during previous iteration: 0.001125473
step 45000; t=9; av energy=1.907211692e-06; av rho=0.9996698728
Time spent during previous iteration: 0.001126403
Writing Gif ...
Saving VTK file ...
step 50000; t=10; av energy=1.907211692e-06; av rho=0.9996331925
Time spent during previous iteration: 0.001134143

real    0m59.133s
user    29m53.143s
sys    1m29.464s
as01449@localhost:~/benchmarks/palabos-v2.0r0/examples/showCases/cavity3d> time mpirun -np 1 cavity3d
omega= 1.53846
Writing Gif ...
step 0; t=0; av energy=6.12745098e-07; av rho=1
Time spent during previous iteration: 0
step 5000; t=1; av energy=1.906470281e-06; av rho=0.9999633641
Time spent during previous iteration: 0.021283929
step 10000; t=2; av energy=1.907210258e-06; av rho=0.9999266731
Time spent during previous iteration: 0.021312529
step 15000; t=3; av energy=1.90721169e-06; av rho=0.9998899833
Time spent during previous iteration: 0.021273699
step 20000; t=4; av energy=1.907211692e-06; av rho=0.9998532949
Time spent during previous iteration: 0.021283039
step 25000; t=5; av energy=1.907211692e-06; av rho=0.9998166078
Time spent during previous iteration: 0.021346609
step 30000; t=6; av energy=1.907211692e-06; av rho=0.999779922
Time spent during previous iteration: 0.021269809
step 35000; t=7; av energy=1.907211692e-06; av rho=0.9997432376
Time spent during previous iteration: 0.021288709
step 40000; t=8; av energy=1.907211692e-06; av rho=0.9997065546
Time spent during previous iteration: 0.021296029
step 45000; t=9; av energy=1.907211692e-06; av rho=0.9996698728
Time spent during previous iteration: 0.021267211
Writing Gif ...
Saving VTK file ...
step 50000; t=10; av energy=1.907211692e-06; av rho=0.9996331925
Time spent during previous iteration: 0.02127318

real    17m56.751s
user    17m54.621s
sys    0m0.237s
as01449@localhost:~/benchmarks/palabos-v2.0r0/examples/showCases/cavity3d> gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/7/lto-wrapper
OFFLOAD_TARGET_NAMES=hsa:nvptx-none
Target: x86_64-suse-linux
Configured with: ../configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,ada,go --enable-offload-targets=hsa,nvptx-none=/usr/nvptx-none, --without-cuda-driver --enable-checking=release --disable-werror --with-gxx-include-dir=/usr/include/c++/7 --enable-ssp --disable-libssp --disable-libvtv --disable-libcc1 --enable-plugin --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux' --with-slibdir=/lib64 --with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-version-specific-runtime-libs --with-gcc-major-version-only --enable-linker-build-id --enable-linux-futex --enable-gnu-indirect-function --program-suffix=-7 --without-system-libunwind --enable-multilib --with-arch-32=x86-64 --with-tune=generic --build=x86_64-suse-linux --host=x86_64-suse-linux
Thread model: posix
gcc version 7.2.1 20171020 [gcc-7-branch revision 253932] (SUSE Linux)
as01449@localhost:~/benchmarks/palabos-v2.0r0/examples/showCases/cavity3d>


Simbelmynė January 3, 2018 14:35

Fantastic, thank you!

So the dual Epyc is twice as fast as the Threadripper in this benchmark! It is interesting to note that the 1950X is actually (slightly) faster than the 7940X even at 14 threads. I did not expect that. The 8700k is superior in single threaded performance (as it should be, the previous results were really confusing).

Do you think the system size is large enough to utilize the dual Epyc bandwidth, or will larger sizes yield even bigger differences compared to the 1950X and 7940X? Of course twice the speed is huge in it's own right, I'm just asking from a pure price/performance viewpoint ;)

flotus1 January 4, 2018 07:08

The system size here is 50x50x50? Then I would expect the gap to become larger with increased problem size.

Simbelmynė January 4, 2018 09:09

Running at N=100 (one million cells):

1950X: 29m16s
7940X: 37m39s

The AMD system is 29% faster. At N=50 (above) the AMD system is 36% faster, so the gap becomes smaller, but it is still visibly in favor of the Threadripper.

flotus1 January 4, 2018 09:51

My bad, I thought you were referring to parallel execution times compared to the Epyc setup.


All times are GMT -4. The time now is 06:58.