[Other] howto optimize OpenFOAM for Core i7 CPU using extended instruction set

cutter · January 23, 2015, 04:49

Hi,

I tried to compile and optimize OpenFOAM for some new Core i7 CPUs with AVX2 and FMA. As far as I understand the default settings are using the general x86_64 instruction set. I forced the compiler to optimize for the extended instruction set by adding the -march=corei7 flag in /wmake/rules/linux64Gcc/c++Opt and /wmake/rules/linux64Gcc/cOpt. The compiler successfully used the settings, my first benchmarks did not show any noticeable effect though. I've been using a single thread for my cases in order to rule out MPI wait times and measure the raw CPU performance.

I've got two questions regarding this issue:

1. Is this the best or correct way to set the compiler flags?

2. What performance gain can be expected from optimized binaries?

Many Thanks
Cutter

wyldckat · January 24, 2015, 10:16

Greetings Cutter,

In theory, AVX should increase performance in mathematical operations, for any application, after compiling with the necessary options. But I'm not sure if and how much OpenFOAM takes advantage of this, although this is usually optimized by the compiler either way.

In addition, it also depends on the GCC version you're using. It's also possible that you're using GCC version that is new enough and already does this optimization by default, which would explain why you don't notice any performance increase with and without the option.

Therefore, please provide the following details:

CPU model you're using.
GCC version you're using.
Linux Distribution you're using.

I ask this so that it's easier to diagnose what might be the reason why this is either already working or not working at all.

Best regards,
Bruno

fra76 · January 25, 2015, 02:26

Hi all,

I have the same experience as Cutter. I have tried over time with many openfoam versions, gcc, CPUs and operating system, without getting any measurable improvement from the machine-specific optimisation.
Last test a few weeks ago, with gcc 4.9.2 on a very recent hardware with two different CPUs. The march option was correctly applied in both cases, the compilation itself took much longer, about 3 times longer, but the running time of the motorBike tutorial was almost exactly the same, both for mesh and solution.

It would be interesting to know if anyone has a different experience and could point out the compiler options used.

Best regards,
Francesco

cutter · January 30, 2015, 09:19

Hi,

thanks to both of you for the initial feedback!

I'm currently targeting the following two CPU models (obtained via cat /proc/cpuinfo and g++ --version):
* Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz, g++ (GCC) 4.8.3 20140911 (Red Hat 4.8.3-7)
* Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, g++ (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16)

I'm currently doing the research on the first of the two machines, which is running on a Fedora release 19 (Schrödinger’s Cat) with KDE desktop installation:

Code:

$ uname -a
Linux hostname 3.14.23-100.fc19.x86_64 #1 SMP Thu Oct 30 18:36:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

As far as I understand the version of g++ doesn't optimize for the complete instruction set. This can be checked with the following command:

Code:

g++ -dM -E -x c /dev/null | grep -i -e avx -e fma
<<no output here>>

When specifying the concrete architecture via the -march option the compiler activates the optimizations and defines the corresponding preprocessor variables:

Code:

g++ -march=core-avx2 -dM -E -x c /dev/null | grep -i -e avx -e fma
#define __core_avx2__ 1
#define __AVX__ 1
#define __FP_FAST_FMAF 1
#define __FMA__ 1
#define __AVX2__ 1
#define __tune_core_avx2__ 1
#define __core_avx2 1
#define __FP_FAST_FMA 1

The same thing happens when I let the compiler choose the instruction set using the -march=native option:

Code:

$ g++ -march=native -dM -E -x c /dev/null | grep -i -e avx -e fma
#define __core_avx2__ 1
#define __AVX__ 1
#define __FP_FAST_FMAF 1
#define __FMA__ 1
#define __AVX2__ 1
#define __tune_core_avx2__ 1
#define __core_avx2 1
#define __FP_FAST_FMA 1

Hope that will help to shed some light on the issue.

Best Regards
Cutter

fra76 · February 2, 2015, 00:24

Hi Cutter,

Nice checks!
Now we know the compiler is doing its job, or at least is enabling the set of instructions specific to the CPUs, as I think we all expected.

Now the questions are: is it able to use them when compiling OpenFOAM? Does this make any difference to the execution time?

Francesco

zhulianhua · August 29, 2015, 10:27

Hi, Francesco,
Recently, I compared the performance of OF with icc and gcc.
The two configurations are:
#1. Icc 15.0.0, OpenFOAM-2.4.0, runs on E5-2680v3@2.5 GHz, compiled with -xHost -O3 flag, OS: CentOS 6.5 x64, RAM DDR4
#2. Gcc-4.8.1, OpenFOAM-2.3.0, runs on E5-2697v2@2.7 GHz, compiled with the default -m64 flag, OS: CentOS 7.0 x64, RAM DDR3

NOTE a): "-xHost will cause icc/icpc or icl to check the cpu information and find the highest level of extended instructions support to use."
NOTE b): E5-2680v3 supports AVX2.0 instructions while E5-2697v2 doesn't.

I run the cavity flow case in $FOAM_TUT/incompressible/icoFoam/cavity without modifying any files in it, (using only one process.)

Results:
The Icc configuration (#1) takes 0.16s
The Gcc configuration (#2) takes 0.15s

You see, almost the same!

Hope this testing helps,

--
Lianhua

Quote:

Originally Posted by fra76

Hi Cutter,

Nice checks!
Now we know the compiler is doing its job, or at least is enabling the set of instructions specific to the CPUs, as I think we all expected.

Now the questions are: is it able to use them when compiling OpenFOAM? Does this make any difference to the execution time?

Francesco

wyldckat · December 28, 2015, 20:19

Greetings to all!

I've had this thread on my to-do list and I haven't reached a solution yet. Nonetheless, I've done some basic tests that can at least give us a way to get the feeling for the scale up we can hope for. The repository is available here: https://github.com/wyldckat/avxtest

The source code does not depend on OpenFOAM, needs only GCC (4.7 or newer) for building it and the summary results were as follows (using an AMD A10-7850K):

float (single precision):
- x86 FPU: 44478.285 ms
- x86 AVX: 6253.096 ms
double (double precision):
- x86 FPU: 44543.217 ms
- x86 AVX: 13095.627 ms

Which makes for an interesting result: for the people that think that single precision calculations with 64-bit processors will be faster than using double precision... well, they are actually wasting electricity by not investing in a more accurate result

As for OpenFOAM, I still need to look into this in more detail. The compiler should be able to vectorize things on its own, but it seems that the code must be prepared in a way that the compiler can understand "oh, this I can vectorize like so and so".

Best regards,
Bruno

January 30, 2015, 09:19		#4
cutter Senior Member Join Date: Mar 2010 Location: Germany Posts: 154 Rep Power: 16	Hi, thanks to both of you for the initial feedback! I'm currently targeting the following two CPU models (obtained via cat /proc/cpuinfo and g++ --version): * Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz, g++ (GCC) 4.8.3 20140911 (Red Hat 4.8.3-7) * Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, g++ (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16) I'm currently doing the research on the first of the two machines, which is running on a Fedora release 19 (Schrödinger’s Cat) with KDE desktop installation: Code: $ uname -a Linux hostname 3.14.23-100.fc19.x86_64 #1 SMP Thu Oct 30 18:36:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux As far as I understand the version of g++ doesn't optimize for the complete instruction set. This can be checked with the following command: Code: g++ -dM -E -x c /dev/null \| grep -i -e avx -e fma <<no output here>> When specifying the concrete architecture via the -march option the compiler activates the optimizations and defines the corresponding preprocessor variables: Code: g++ -march=core-avx2 -dM -E -x c /dev/null \| grep -i -e avx -e fma #define __core_avx2__ 1 #define __AVX__ 1 #define __FP_FAST_FMAF 1 #define __FMA__ 1 #define __AVX2__ 1 #define __tune_core_avx2__ 1 #define __core_avx2 1 #define __FP_FAST_FMA 1 The same thing happens when I let the compiler choose the instruction set using the -march=native option: Code: $ g++ -march=native -dM -E -x c /dev/null \| grep -i -e avx -e fma #define __core_avx2__ 1 #define __AVX__ 1 #define __FP_FAST_FMAF 1 #define __FMA__ 1 #define __AVX2__ 1 #define __tune_core_avx2__ 1 #define __core_avx2 1 #define __FP_FAST_FMA 1 Hope that will help to shed some light on the issue. Best Regards Cutter elvis, wyldckat, Ohbuchi and 1 others like this.

December 28, 2015, 20:19		#7
wyldckat Retired Super Moderator Bruno Santos Join Date: Mar 2009 Location: Lisbon, Portugal Posts: 10,974 Blog Entries: 45 Rep Power: 128	Greetings to all! I've had this thread on my to-do list and I haven't reached a solution yet. Nonetheless, I've done some basic tests that can at least give us a way to get the feeling for the scale up we can hope for. The repository is available here: https://github.com/wyldckat/avxtest The source code does not depend on OpenFOAM, needs only GCC (4.7 or newer) for building it and the summary results were as follows (using an AMD A10-7850K): float (single precision): x86 FPU: 44478.285 ms x86 AVX: 6253.096 ms double (double precision): x86 FPU: 44543.217 ms x86 AVX: 13095.627 ms Which makes for an interesting result: for the people that think that single precision calculations with 64-bit processors will be faster than using double precision... well, they are actually wasting electricity by not investing in a more accurate result As for OpenFOAM, I still need to look into this in more detail. The compiler should be able to vectorize things on its own, but it seems that the code must be prepared in a way that the compiler can understand "oh, this I can vectorize like so and so". Best regards, Bruno Ohbuchi likes this. __________________ OpenFOAM: FAQ \| Getting started Forum: How to get help, to post code/output and forum guide Read this before sending me PM

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 05:36
Star cd es-ice solver error	ernarasimman	STAR-CD	2	September 12, 2014 00:01
OpenFOAM CPU Usage	musahossein	OpenFOAM Running, Solving & CFD	26	July 18, 2013 09:03
OpenFOAM 13 Intel quadcore parallel results	msrinath80	OpenFOAM Running, Solving & CFD	13	February 5, 2008 05:26
OpenFOAM 13 AMD quadcore parallel results	msrinath80	OpenFOAM Running, Solving & CFD	1	November 10, 2007 23:23

January 23, 2015, 04:49	howto optimize OpenFOAM for Core i7 CPU using extended instruction set	#1
cutter Senior Member Join Date: Mar 2010 Location: Germany Posts: 154 Rep Power: 16	Hi, I tried to compile and optimize OpenFOAM for some new Core i7 CPUs with AVX2 and FMA. As far as I understand the default settings are using the general x86_64 instruction set. I forced the compiler to optimize for the extended instruction set by adding the -march=corei7 flag in /wmake/rules/linux64Gcc/c++Opt and /wmake/rules/linux64Gcc/cOpt. The compiler successfully used the settings, my first benchmarks did not show any noticeable effect though. I've been using a single thread for my cases in order to rule out MPI wait times and measure the raw CPU performance. I've got two questions regarding this issue: 1. Is this the best or correct way to set the compiler flags? 2. What performance gain can be expected from optimized binaries? Many Thanks Cutter

January 24, 2015, 10:16		#2
wyldckat Retired Super Moderator Bruno Santos Join Date: Mar 2009 Location: Lisbon, Portugal Posts: 10,974 Blog Entries: 45 Rep Power: 128	Greetings Cutter, In theory, AVX should increase performance in mathematical operations, for any application, after compiling with the necessary options. But I'm not sure if and how much OpenFOAM takes advantage of this, although this is usually optimized by the compiler either way. In addition, it also depends on the GCC version you're using. It's also possible that you're using GCC version that is new enough and already does this optimization by default, which would explain why you don't notice any performance increase with and without the option. Therefore, please provide the following details: CPU model you're using. GCC version you're using. Linux Distribution you're using. I ask this so that it's easier to diagnose what might be the reason why this is either already working or not working at all. Best regards, Bruno __________________ OpenFOAM: FAQ \| Getting started Forum: How to get help, to post code/output and forum guide Read this before sending me PM

January 25, 2015, 02:26		#3
fra76 Senior Member Francesco Del Citto Join Date: Mar 2009 Location: Zürich Area, Switzerland Posts: 237 Rep Power: 18	Hi all, I have the same experience as Cutter. I have tried over time with many openfoam versions, gcc, CPUs and operating system, without getting any measurable improvement from the machine-specific optimisation. Last test a few weeks ago, with gcc 4.9.2 on a very recent hardware with two different CPUs. The march option was correctly applied in both cases, the compilation itself took much longer, about 3 times longer, but the running time of the motorBike tutorial was almost exactly the same, both for mesh and solution. It would be interesting to know if anyone has a different experience and could point out the compiler options used. Best regards, Francesco

February 2, 2015, 00:24		#5
fra76 Senior Member Francesco Del Citto Join Date: Mar 2009 Location: Zürich Area, Switzerland Posts: 237 Rep Power: 18	Hi Cutter, Nice checks! Now we know the compiler is doing its job, or at least is enabling the set of instructions specific to the CPUs, as I think we all expected. Now the questions are: is it able to use them when compiling OpenFOAM? Does this make any difference to the execution time? Francesco