CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   SU2 (https://www.cfd-online.com/Forums/su2/)
-   -   MPI error when trying to run su2 in parallel (https://www.cfd-online.com/Forums/su2/216923-mpi-error-when-trying-run-su2-parallel.html)

Gui_AP April 24, 2019 11:06

MPI error when trying to run su2 in parallel
 
Hello guys!



I need some help... I'm trying to run a case in parallel with the script bellow:


mpirun -n 3 SU2_CFD AhmedBody.cfg


But every time I do it, I got this error:

--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node aerofleet-System-Product-Name exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I don't know why this errors occurs... I have a 16Gb of RAM and 50 Gb of swap partition and I'm using Open MPI.. I can run this case in serial, but the calculation takes too long..



Can someone help me with this one? I'll appreciate a lot..


Best wishes, thank you.

pcg May 4, 2019 15:39

Hi Guilherme,
Do you get any other output before that?
Assuming that this is the case in Testcases I can run it fine.
Cheers,
Pedro

Gui_AP May 9, 2019 10:44

Hello Pedro!

No, I didn't get any output before..

pcg May 10, 2019 05:22

If the code does not start at all, I suspect a compilation issue, the typical one is using different MPI versions to compile and run the code.

Gui_AP May 15, 2019 09:13

Hello Pedro!
I uninstalled and installed all again and it seems working now. But now I'm facing another problem:
The SU2 is running slowly in parallel than in serial, I mean.. the more cores I put in the command line (mpirun -n "x") the slower it gets... Do you have an idea about what it may be? Thank you ^^

pcg May 16, 2019 05:28

Hi Guilherme,

If the output is also scrambled, like iteration X being printed multiple times that means you are launching multiple serial instances instead of a parallel run, which again happens when the mpi version used to run the code is not the same used to compile it.

If that is not the case please describe in detail the steps you are following to compile and run the code.

Gui_AP May 16, 2019 08:29

Hello Pedro, thank you for helping me.


I faced that problem before (iterations being printed multiple times), but it's not happening anymore.



I first installed mpi4py and then I installed Open MPI.



Then I compiled the SU2 with this following command:



"./configure --prefix=/$HOME/SU2 CXXFLAGS="-O3" --enable-mpi --with-cc=/$HOME/OpenMpi/bin/mpicc --with-cxx=/$HOME/OpenMpi/bin/mpicxx"


Then I did: "sudo make -j 8 install"


I didn't get any error output and the installation seems to have been succesfull.


So I did run the quick start case with " mpirun -n 4 SU2_CFD inv_NACA0012.cfg".


However, running in parallel seems slowly than running in serial and a curious fact that I had noticed is that, when the computation ends, it don't' give me that output "the calculation finished in "n" cores!" or something like that.


If it's needed I can paste the outputs here. Thank you again! ^^

Sanghera May 16, 2019 09:02

Hi Pedro,

I am also getting an error message when I try to run in parallel saying:

'mpirun noticed that process rank 2 with PID 0 on node node-xxx exited on signal 11 (Segmentation fault).'

When I run in serial I get the error:

'SU2_CFD:xxxxxx terminated with signal 11 at PC=xxxxxx SP=xxxxxxx'

I get both of these errors only for a particular case. When I try to run the Quickstart simulation, it works both in serial and parallel. This seems quite weird to me. I read on another thread that it maybe has to do with the RAM/swap memory?

pcg May 20, 2019 06:17

Hi Guilherme,

Type "which mpirun" on a terminal and see if the executable that the OS finds is inside $HOME/OpenMpi/bin.
Alternatively try running $HOME/OpenMpi/bin/mpirun -n 4 SU2_CFD inv_NACA0012.cfg

Hi Bhupinder,

Signal 11 is a segmentation fault, it may be a problem of the mesh, the combination of settings you are trying to use not being valid, or the code. Try starting from something you know it works (like the quickstart) and go from there.

Gui_AP May 31, 2019 07:46

Hello Pedro, sorry for my very late response... I was focused in another project...


I typed "which mpirun" and I got "/usr/bin/mpirun"


When I tried " $HOME/OpenMpi/bin/mpirun -n 4 SU2_CFD inv_NACA0012.cfg" I got many errors:


It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[aerofleet-System-Product-Name:30145] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[aerofleet-System-Product-Name:30146] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[aerofleet-System-Product-Name:30147] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[aerofleet-System-Product-Name:30144] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[40496,1],3]
Exit code: 1
--------------------------------------------------------------------------

What does it means?


Thank you again...

pcg May 31, 2019 19:36

You seem to have 2 mpi versions installed... One in a system location (/usr) the other in your home folder.
Try compiling with:
"./configure --prefix=/$HOME/SU2 CXXFLAGS="-O3" --enable-mpi --with-cc=/usr/bin/mpicc --with-cxx=/usr/bin/mpicxx"
And running with:
"mpirun -n ..."
If that does not work I have no idea.

nitish_anand June 3, 2019 03:38

Quote:

Originally Posted by Sanghera (Post 733817)
Hi Pedro,

I am also getting an error message when I try to run in parallel saying:

'mpirun noticed that process rank 2 with PID 0 on node node-xxx exited on signal 11 (Segmentation fault).'

When I run in serial I get the error:

'SU2_CFD:xxxxxx terminated with signal 11 at PC=xxxxxx SP=xxxxxxx'

I get both of these errors only for a particular case. When I try to run the Quickstart simulation, it works both in serial and parallel. This seems quite weird to me. I read on another thread that it maybe has to do with the RAM/swap memory?

Hey Bhupinder,

Have you made changes to the code? If you try running with gdb you should get exact function where the issue is.


All times are GMT -4. The time now is 23:20.