CFD Online Discussion Forums - [Other] howto optimize OpenFOAM for Core i7 CPU using extended instruction set

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- OpenFOAM Installation (https://www.cfd-online.com/Forums/openfoam-installation/)

- - [Other] howto optimize OpenFOAM for Core i7 CPU using extended instruction set (https://www.cfd-online.com/Forums/openfoam-installation/147526-howto-optimize-openfoam-core-i7-cpu-using-extended-instruction-set.html)

howto optimize OpenFOAM for Core i7 CPU using extended instruction set

Hi,

I tried to compile and optimize OpenFOAM for some new Core i7 CPUs with AVX2 and FMA. As far as I understand the default settings are using the general x86_64 instruction set. I forced the compiler to optimize for the extended instruction set by adding the -march=corei7 flag in /wmake/rules/linux64Gcc/c++Opt and /wmake/rules/linux64Gcc/cOpt. The compiler successfully used the settings, my first benchmarks did not show any noticeable effect though. I've been using a single thread for my cases in order to rule out MPI wait times and measure the raw CPU performance.

I've got two questions regarding this issue:

1. Is this the best or correct way to set the compiler flags?

2. What performance gain can be expected from optimized binaries?

Many Thanks
Cutter

Greetings Cutter,

In theory, AVX should increase performance in mathematical operations, for any application, after compiling with the necessary options. But I'm not sure if and how much OpenFOAM takes advantage of this, although this is usually optimized by the compiler either way.

In addition, it also depends on the GCC version you're using. It's also possible that you're using GCC version that is new enough and already does this optimization by default, which would explain why you don't notice any performance increase with and without the option.

Therefore, please provide the following details:

CPU model you're using.
GCC version you're using.
Linux Distribution you're using.

I ask this so that it's easier to diagnose what might be the reason why this is either already working or not working at all.

Best regards,
Bruno

Hi all,

I have the same experience as Cutter. I have tried over time with many openfoam versions, gcc, CPUs and operating system, without getting any measurable improvement from the machine-specific optimisation.
Last test a few weeks ago, with gcc 4.9.2 on a very recent hardware with two different CPUs. The march option was correctly applied in both cases, the compilation itself took much longer, about 3 times longer, but the running time of the motorBike tutorial was almost exactly the same, both for mesh and solution.

It would be interesting to know if anyone has a different experience and could point out the compiler options used.

Best regards,
Francesco

Hi,

thanks to both of you for the initial feedback!

I'm currently targeting the following two CPU models (obtained via cat /proc/cpuinfo and g++ --version):
* Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz, g++ (GCC) 4.8.3 20140911 (Red Hat 4.8.3-7)
* Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, g++ (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16)

I'm currently doing the research on the first of the two machines, which is running on a Fedora release 19 (Schrödinger’s Cat) with KDE desktop installation:

Code:

$ uname -a

Linux hostname 3.14.23-100.fc19.x86_64 #1 SMP Thu Oct 30 18:36:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

As far as I understand the version of g++ doesn't optimize for the complete instruction set. This can be checked with the following command:

Code:

g++ -dM -E -x c /dev/null | grep -i -e avx -e fma

<<no output here>>

When specifying the concrete architecture via the -march option the compiler activates the optimizations and defines the corresponding preprocessor variables:

Code:

g++ -march=core-avx2 -dM -E -x c /dev/null | grep -i -e avx -e fma

#define __core_avx2__ 1

#define __AVX__ 1

#define __FP_FAST_FMAF 1

#define __FMA__ 1

#define __AVX2__ 1

#define __tune_core_avx2__ 1

#define __core_avx2 1

#define __FP_FAST_FMA 1

The same thing happens when I let the compiler choose the instruction set using the -march=native option:

Code:

$ g++ -march=native -dM -E -x c /dev/null | grep -i -e avx -e fma

#define __core_avx2__ 1

#define __AVX__ 1

#define __FP_FAST_FMAF 1

#define __FMA__ 1

#define __AVX2__ 1

#define __tune_core_avx2__ 1

#define __core_avx2 1

#define __FP_FAST_FMA 1

Hope that will help to shed some light on the issue.

Best Regards
Cutter

Hi Cutter,

Nice checks!
Now we know the compiler is doing its job, or at least is enabling the set of instructions specific to the CPUs, as I think we all expected.

Now the questions are: is it able to use them when compiling OpenFOAM? Does this make any difference to the execution time?

Francesco

Hi, Francesco,
Recently, I compared the performance of OF with icc and gcc.
The two configurations are:
#1. Icc 15.0.0, OpenFOAM-2.4.0, runs on E5-2680v3@2.5 GHz, compiled with -xHost -O3 flag, OS: CentOS 6.5 x64, RAM DDR4
#2. Gcc-4.8.1, OpenFOAM-2.3.0, runs on E5-2697v2@2.7 GHz, compiled with the default -m64 flag, OS: CentOS 7.0 x64, RAM DDR3

NOTE a): "-xHost will cause icc/icpc or icl to check the cpu information and find the highest level of extended instructions support to use."
NOTE b): E5-2680v3 supports AVX2.0 instructions while E5-2697v2 doesn't.

I run the cavity flow case in $FOAM_TUT/incompressible/icoFoam/cavity without modifying any files in it, (using only one process.)

Results:
The Icc configuration (#1) takes 0.16s
The Gcc configuration (#2) takes 0.15s

You see, almost the same!

Hope this testing helps,

--
Lianhua

Quote:

Originally Posted by fra76 (Post 529950)

Greetings to all!

I've had this thread on my to-do list and I haven't reached a solution yet. Nonetheless, I've done some basic tests that can at least give us a way to get the feeling for the scale up we can hope for. The repository is available here: https://github.com/wyldckat/avxtest

The source code does not depend on OpenFOAM, needs only GCC (4.7 or newer) for building it and the summary results were as follows (using an AMD A10-7850K):

float (single precision):
- x86 FPU: 44478.285 ms
- x86 AVX: 6253.096 ms
double (double precision):
- x86 FPU: 44543.217 ms
- x86 AVX: 13095.627 ms

Which makes for an interesting result: for the people that think that single precision calculations with 64-bit processors will be faster than using double precision... well, they are actually wasting electricity by not investing in a more accurate result ;)

As for OpenFOAM, I still need to look into this in more detail. The compiler should be able to vectorize things on its own, but it seems that the code must be prepared in a way that the compiler can understand "oh, this I can vectorize like so and so".

Best regards,
Bruno