CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   SU2 (https://www.cfd-online.com/Forums/su2/)
-   -   Error while calling MPI_Barrier (https://www.cfd-online.com/Forums/su2/133126-error-while-calling-mpi_barrier.html)

CrashLaker April 10, 2014 11:04

Error while calling MPI_Barrier
 
Hello guys!

I've managed to successfully run in 1 node consuming up to 2 processes.
Quote:

parallel_computation.py -f inv_ONERAM6.cfg -p 2
the command: mpirun -np 2 /opt/su2v2/bin/SU2_DDC config_DDC.cfg
But when when I try to scale it to more than 1 node it echoes an MPI_Barrier error. Can you help?

Quote:

parallel_computation.py -f inv_ONERAM6.cfg -p 12
the command: mpirun -np 12 -machinefile hosts /opt/su2v2/bin/SU2_DDC config_DDC.cfg
Error:
Code:

---------------------- Read grid file information -----------------------
Three dimensional problem.
582752 interior elements.
Traceback (most recent call last):
  File "/opt/su2v2/bin/parallel_computation.py", line 113, in <module>
    main()
  File "/opt/su2v2/bin/parallel_computation.py", line 58, in main
    options.divide_grid  )
  File "/opt/su2v2/bin/parallel_computation.py", line 81, in parallel_computation
    info = SU2.run.decompose(config)
  File "/opt/su2v2/bin/SU2/run/decompose.py", line 66, in decompose
    SU2_DDC(konfig)
  File "/opt/su2v2/bin/SU2/run/interface.py", line 73, in DDC
    run_command( the_Command )
  File "/opt/su2v2/bin/SU2/run/interface.py", line 279, in run_command
    raise Exception , message
Exception: Path = /root/oneram6v2/,
Command = mpirun -np 12 -machinefile hosts /opt/su2v2/bin/SU2_DDC config_DDC.cfg
SU2 process returned error '1'
Fatal error in PMPI_Barrier: A process has failed, error stack:
PMPI_Barrier(428)...............: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..........: Failure during collective
MPIR_Barrier_impl(328)..........:
MPIR_Barrier(292)...............:
MPIR_Barrier_intra(149).........:
barrier_smp_intra(94)...........:
MPIR_Barrier_impl(335)..........: Failure during collective
MPIR_Barrier_impl(328)..........:
MPIR_Barrier(292)...............:
MPIR_Barrier_intra(169).........:
MPIDI_CH3U_Recvq_FDU_or_AEP(630): Communication error with rank 0
barrier_smp_intra(109)..........:
MPIR_Bcast_impl(1458)...........:
MPIR_Bcast(1482)................:
MPIR_Bcast_intra(1291)..........:
MPIR_Bcast_binomial(309)........: Failure during collective
Fatal error in PMPI_Barrier: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective


economon April 10, 2014 15:55

Hi,

Can you please share the compiler/MPI type and versions that you are using?

Also, can you try to call SU2_DDC as a stand alone module with something like: mpirun -np 8 SU2_DDC inv_ONERAM6.cfg to verify whether there is a problem on the Python side or on the C++ side?

T

CrashLaker April 10, 2014 16:02

Quote:

Originally Posted by economon (Post 485316)
Hi,

Can you please share the compiler/MPI type and versions that you are using?

Also, can you try to call SU2_DDC as a stand alone module with something like: mpirun -np 8 SU2_DDC inv_ONERAM6.cfg to verify whether there is a problem on the Python side or on the C++ side?

T

Hello Economon!

Im using Metis 5.0.2 along with OMPI 1.4.

I get the same error calling mpirun directly. So it's a C++ problem.
Quote:

------------------------ Divide the numerical grid ----------------------
Domain 1: 108396 points (0 ghost points). Comm buff: 21.98MB of 50.00MB.
[puma4:01831] *** Process received signal ***
[puma4:01831] Signal: Segmentation fault (11)
[puma4:01831] Signal code: Address not mapped (1)
[puma4:01831] Failing at address: 0x2ddb8b20
Domain 2: 0 points (0 ghost points). Comm buff: 0.00MB of 50.00MB.
[puma4:01831] [ 0] /lib64/libpthread.so.0 [0x345400eca0]
[puma4:01831] [ 1] /scratch/ramos/su2mpi/bin/SU2_DDC(_ZN15CDomainGeometryC1EP9CGeometryP7CConfi g+0xb35) [0x50df95]
[puma4:01831] [ 2] /scratch/ramos/su2mpi/bin/SU2_DDC(main+0x2d4) [0x44ce14]
[puma4:01831] [ 3] /lib64/libc.so.6(__libc_start_main+0xf4) [0x345341d9c4]
[puma4:01831] [ 4] /scratch/ramos/su2mpi/bin/SU2_DDC(_ZNSt8ios_base4InitD1Ev+0x39) [0x44ca89]
[puma4:01831] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 1831 on node puma4 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

economon April 10, 2014 17:17

Quote:

Originally Posted by CrashLaker (Post 485317)
Hello Economon!

Im using Metis 5.0.2 along with OMPI 1.4.

I get the same error calling mpirun directly. So it's a C++ problem.

Hmmm.. I have mostly been working with OpenMPI 1.6. Do you have the ability to upgrade your OpenMPI to a newer version and try again?

CrashLaker April 10, 2014 21:32

Quote:

Originally Posted by economon (Post 485323)
Hmmm.. I have mostly been working with OpenMPI 1.6. Do you have the ability to upgrade your OpenMPI to a newer version and try again?

Hello Economon.

Same thing :(

Quote:

Command = /scratch/programas/intel/ompi-1.6-intel/bin/mpirun -np 4 -machinefile hosts /scratch/ramos/su2v4/bin/SU2_DDC config_DDC.cfg
SU2 process returned error '139'
[puma43:16153] *** Process received signal ***
[puma43:16153] Signal: Segmentation fault (11)
[puma43:16153] Signal code: Address not mapped (1)
[puma43:16153] Failing at address: 0x194158a0
[puma43:16153] [ 0] /lib64/libpthread.so.0 [0x32dce0eca0]
[puma43:16153] [ 1] /scratch/ramos/su2v4/bin/SU2_DDC(_ZN15CDomainGeometryC1EP9CGeometryP7CConfi g+0xb35) [0x53b5c5]
[puma43:16153] [ 2] /scratch/ramos/su2v4/bin/SU2_DDC(main+0x2d4) [0x4798b4]
[puma43:16153] [ 3] /lib64/libc.so.6(__libc_start_main+0xf4) [0x32dc21d9c4]
[puma43:16153] [ 4] /scratch/ramos/su2v4/bin/SU2_DDC(_ZNSt8ios_base4InitD1Ev+0x41) [0x479529]
[puma43:16153] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 16153 on node puma43 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

economon April 10, 2014 21:36

Hmm.. can you share the options that you sent to the configure process? And does this happen for every test case that you run in parallel?

CrashLaker April 10, 2014 21:38

Quote:

Originally Posted by economon (Post 485350)
Hmm.. can you share the options that you sent to the configure process? And does this happen for every test case that you run in parallel?

This is my configure.
Quote:

./configure --prefix="/scratch/ramos/su2v4" --with-Metis-lib="/scratch/ramos/metis5.0.2/lib" --with-Metis-include="/scratch/ramos/metis5.0.2/include" --with-Metis-version=5 --with-MPI="/scratch/programas/intel/ompi1.6-intel/bin/mpicxx"
This is actually the 2nd tutorial yet :( The first that teaches how to run in parallel.


All times are GMT -4. The time now is 07:15.