CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   OpenFOAM Programming & Development (https://www.cfd-online.com/Forums/openfoam-programming-development/)
-   -   OpenMPI error at the beginnin of parallel OpenFOAM Simulation (https://www.cfd-online.com/Forums/openfoam-programming-development/218307-openmpi-error-beginnin-parallel-openfoam-simulation.html)

tre95 June 16, 2019 07:36

OpenMPI error at the beginnin of parallel OpenFOAM Simulation
 
Hello everyone,


I am currently using a Deep Learning Tool (Tensorflow) to access an artificial neural network during my OpenFOAM simulation. To do so, I used the C API of Tensorflow and wrote my own code. I had to include some headers and link to some shared libraries, but everything went ok, also using parallel runs with OpenMPI.


However now I wanted to increase the speed of the Tensorflow usage so I compiled it from source and activated AVX support (which is allowed on my CPU). Doing so I created new headers and .so-files. However, now the following situation occured:


- Before the upgrade to AVX: Both single core runs as well as parallel simulation using mpirun worked without problems
- After the upgrade to AVX: Single core runs perfect and 60 % faster during the ANN usage, however if I want to use mpirun on several cores I get this error (it repeates as often as the number of cores I want to use in parallel):
Code:

[node134:10568] *** Process received signal ***
[node134:10568] Signal: Segmentation fault (11)
[node134:10568] Signal code: Address not mapped (1)
[node134:10568] Failing at address: (nil)
[node134:10568] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7fac03c53f20]
[node134:10568] [ 1] /home/elias/OpenFOAM/elias-4.1/platforms/linux64GccDPInt32Opt/lib/libtensorflow_framework.so.1(hwloc_bitmap_and+0x14)[0x7fabe8f05534]
[node134:10568] [ 2] /usr/lib/x86_64-linux-gnu/libopen-pal.so.20(opal_hwloc_base_filter_cpus+0x380)[0x7fabcccbab80]
[node134:10568] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_ess_pmi.so(+0x2b4e)[0x7fabcbbe6b4e]
[node134:10568] [ 4] /usr/lib/x86_64-linux-gnu/libopen-rte.so.20(orte_init+0x22e)[0x7fabccf0e1de]
[node134:10568] [ 5] /usr/lib/x86_64-linux-gnu/libmpi.so.20(ompi_mpi_init+0x30e)[0x7fabe70a027e]
[node134:10568] [ 6] /usr/lib/x86_64-linux-gnu/libmpi.so.20(MPI_Init+0x6b)[0x7fabe70c12ab]
[node134:10568] [ 7] /opt/OpenFOAM/OpenFOAM-4.1/platforms/linux64GccDPInt32Opt/lib/openmpi-system/libPstream.so(_ZN4Foam8UPstream4initERiRPPc+0x1f)[0x7fac03a0c43f]
[node134:10568] [ 8] /opt/OpenFOAM/OpenFOAM-4.1/platforms/linux64GccDPInt32Opt/lib/libOpenFOAM.so(_ZN4Foam7argListC1ERiRPPcbbb+0x719)[0x7fac04e1aed9]
[node134:10568] [ 9] tabulatedCombustionFoam(+0x279b8)[0x559e1bd079b8]
[node134:10568] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fac03c36b97]
[node134:10568] [11] tabulatedCombustionFoam(+0x30a0a)[0x559e1bd10a0a]
[node134:10568] *** End of error message ***


- Strangely if I decompose my domain to 1 subdomain and do mpirun without the -parallel tag it works again


Obviously this is an issue concerning mpirun. During the compilation of Tensorflow with AVX from source (using Google's bazel tool) I had to chose whether I want MPI support. Of course I said yes, and I entered the MPI Toolkit folder just as default: /usr


Now I read in this post (https://www.cfd-online.com/Forums/op...-parallel.html) that there might be a conflict between OpenFOAM and Tensorflow trying to use different OpenMPI versions. Can you help me to fix it? I have to ask here because obviously people not using OpenFOAM seem to be unable to help me with this issue.


Edit: I just recognized that if I want to do ./Allwmake in opt/OpenFOAM/ThirdParty, I get:
Build MPI libraries if required

+ cd openmpi
./Allwmake: 78: cd: can't cd to openmpi
+ exit 1

wyldckat June 16, 2019 14:27

Quick answer: In principle, you're using the same Open-MPI in the system... I'm assuming that MPICH2 is not installed at "/usr", given that mpirun gives you Open-MPI by default.

Building with another/custom Open-MPI version will not solve the issue.

Since you are using OpenFOAM 4.1, it looks like you tripped over this bug: https://bugs.openfoam.org/view.php?id=2815

This bug fix was available in OpenFOAM 5, but not in 4.x. You have two choices:
  1. Upgrade to OpenFOAM 5 or 6.
  2. Or manually apply these changes that fix the bug: https://github.com/OpenFOAM/OpenFOAM...e1546a51ecc090

tre95 June 17, 2019 06:36

Hello, thank you very much for your support!


Fortunately I did not have to make any changes (upgrade would have not been possible as 4.1 is the version used at the Institute I work at), as the error was in Tensorflow. The issue is solved here:


https://github.com/tensorflow/tensorflow/issues/29838


Normally the issue should not occur any more as the Tensorflow issue was already solved and the changes were merged to Tensorflow's master.


All times are GMT -4. The time now is 08:50.