CFD Online Discussion Forums - Error while running cfx in parallel configuration

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- CFX (https://www.cfd-online.com/Forums/cfx/)

- - Error while running cfx in parallel configuration (https://www.cfd-online.com/Forums/cfx/161036-error-while-running-cfx-parallel-configuration.html)

Error while running cfx in parallel configuration

Hi everyone,

I am currently trying to run CFX (v16) in parallel configuration using the slurm manager. I used runCFX.sh script which is as follows:

Quote:

!/bin/bash
srun hostname -s > /tmp//hosts.$SLURM_JOB_ID
if [ "x$SLURM_NPROCS" = "x" ]; then
if [ "x$SLURM_NTASKS_PER_NODE" = "x" ];then
SLURM_NTASKS_PER_NODE=1
fi
SLURM_NPROCS=`expr $SLURM_JOB_NUM_NODES \* $SLURM_NTASKS_PER_NODE`
fi
# use ssh instead of rsh
export CFX5RSH=ssh
# format the host list for cfx
cfxHosts=`tr '\n' ',' < /tmp//hosts.$SLURM_JOB_ID`
# run the partitioner and solver
/usr/ansys_inc/v160/CFX/bin/cfx5solve -par -par-dist "$cfxHosts" -def ./AADL2.def -part $SLURM_NPROCS -start-method "Platform MPI Distributed Parallel"
# cleanup
rm /tmp/hosts.$SLURM_JOB_ID

I submitted the job with the following command line

Quote:

sbatch -n 10 -N 2 -p mypartition -t 10 ./runCFX.sh

i obtained the following error in the slurm output file:

Quote:

<IBM Platform MPI>: : warning, dlopen of libhwloc.so failed (null)/lib/linux_amd64/libhwloc.so: cannot open shared object file: No such file or directory
An error has occurred in cfx5solve:

The ANSYS CFX partitioner was interrupted by signal SEGV (11)

Can anyone help me to solve this issue? Thank you

We have had problems getting Platform MPI Distributed Parallel to run with SLURM, and got exactly the same error as you. Ansys support wont help since they dont support it...
If I remember correctly it was solved by either using Intel MPI Distributed Parallel and/or unsetting the SLURM_GTIDS environment variable.

Hello sir,

I confirm, the unsettling of the variable SLURM_GTIDS has allowed to solve the problem
Thank you a lot

Thank you, Lance,

We added 'unset SLURM_GTIDS' to the job script and the job runs now,

Now instead of 2 we have 5 people on the planet who know of this workaround. :)

Thanks so much!

I was having this same issue with CFX 17.0 and the unset SLURM_GTIDS command has fixed it.

Thank you for this info.

In case someone is trying to run Abaqus in parallel on one of XSEDE resources (e.g. SDSC Comet) then adding 'unset SLURM_GTIDS' before the ABQ command will get rid of errors about MPI.

Make sure you have parallel_mode=MPI in the ABQ command.

Best,

unset SLURM_GTIDS worked for me too :D

Quote:

Originally Posted by Светлана (Post 590361)

Thank you, Lance,

We added 'unset SLURM_GTIDS' to the job script and the job runs now,

Now instead of 2 we have 5 people on the planet who know of this workaround. :)

Also worked here, added:
unset SLURM_GTIDS
to my job script below the usual module load commands and voila.

Thanks!

Anyone know how to do this for an interactive session?

Alex, do you mean a CFX session on a GNU/Linux computer? Add the "unset SLURM_GTIDS" line to your bashrc or bash profile.