CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   SU2 Installation (https://www.cfd-online.com/Forums/su2-installation/)
-   -   shape_optimization.py - Inconsistent MPI Errors on HPC Nodes (https://www.cfd-online.com/Forums/su2-installation/250692-shape_optimization-py-inconsistent-mpi-errors-hpc-nodes.html)

mardar572 July 4, 2023 11:44

shape_optimization.py - Inconsistent MPI Errors on HPC Nodes
 
Hi everyone,

I'm currently attempting to utilize shape_optimization.py for the 3D inviscid onera tutorial with discrete adjoint. While performing this optimization on my local host, everything runs smoothly without any issues. However, when I attempt to run it on the nodes of HPC cluster, I encounter occasional errors. The error seems to occur randomly, sometimes during the DEFORM process, and other times during the ADJOINT or DIRECT processes. Additionally, I'm using su2 version 740 and open mpi 414.

This inconsistency error is seems like related to MPI, and I'm seeking some insights into potential reasons for this behavior on the HPC nodes. Has anyone else experienced a similar issue or have any ideas on what could be causing this problem?

Thanks in advance for your help!

my job file:

#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -j y
export SU2_MPI_COMMAND="/mypath/apps/ompi414/bin/mpirun --mca mtl ^ofi --mca btl_openib_allow_ib 1 --mca btl vader,self,openib -n %d %s"
/mypath/apps/anaconda3/bin/python /mypath/apps/su740/bin/shape_optimization.py -n 30 -g DISCRETE_ADJOINT -f inv_ONERAM6_adv.cfg


ERROR:

File "/mypath/apps/su740/bin/SU2/run/interface.py", line 208, in SOL
run_command( the_Command )
File "/mypath/apps/su740/bin/SU2/run/interface.py", line 271, in run_command
raise exception(message)
RuntimeError: Path = /mypath/opt_try/try0/DESIGNS/DSN_001/ADJOINT_DRAG/,
Command = /mypath/apps/ompi414/bin/mpirun --mca mtl ^ofi --mca btl_openib_allow_ib 1 --mca btl vader,self,openib -n 30 /mypath/apps/su740/bin/SU2_SOL config_SOL.cfg
SU2 process returned error '139'
[compute-5-2:24097] *** Process received signal ***
[compute-5-2:24097] Signal: Segmentation fault (11)
[compute-5-2:24097] Signal code: Address not mapped (1)
[compute-5-2:24097] Failing at address: 0x2b6466853770
[compute-5-2:24118] *** Process received signal ***
[compute-5-2:24118] Signal: Segmentation fault (11)
[compute-5-2:24118] Signal code: Address not mapped (1)
[compute-5-2:24118] Failing at address: 0x2ad7a4e50770
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 13 with PID 24118 on node compute-5-2 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------


All times are GMT -4. The time now is 19:34.