CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   CFX (https://www.cfd-online.com/Forums/cfx/)
-   -   Error while running cfx in parallel configuration (https://www.cfd-online.com/Forums/cfx/161036-error-while-running-cfx-parallel-configuration.html)

stater October 17, 2015 12:45

Error while running cfx in parallel configuration
 
Hi everyone,

I am currently trying to run CFX (v16) in parallel configuration using the slurm manager. I used runCFX.sh script which is as follows:

Quote:

!/bin/bash
srun hostname -s > /tmp//hosts.$SLURM_JOB_ID
if [ "x$SLURM_NPROCS" = "x" ]; then
if [ "x$SLURM_NTASKS_PER_NODE" = "x" ];then
SLURM_NTASKS_PER_NODE=1
fi
SLURM_NPROCS=`expr $SLURM_JOB_NUM_NODES \* $SLURM_NTASKS_PER_NODE`
fi
# use ssh instead of rsh
export CFX5RSH=ssh
# format the host list for cfx
cfxHosts=`tr '\n' ',' < /tmp//hosts.$SLURM_JOB_ID`
# run the partitioner and solver
/usr/ansys_inc/v160/CFX/bin/cfx5solve -par -par-dist "$cfxHosts" -def ./AADL2.def -part $SLURM_NPROCS -start-method "Platform MPI Distributed Parallel"
# cleanup
rm /tmp/hosts.$SLURM_JOB_ID
I submitted the job with the following command line

Quote:

sbatch -n 10 -N 2 -p mypartition -t 10 ./runCFX.sh
i obtained the following error in the slurm output file:

Quote:

<IBM Platform MPI>: : warning, dlopen of libhwloc.so failed (null)/lib/linux_amd64/libhwloc.so: cannot open shared object file: No such file or directory
An error has occurred in cfx5solve:

The ANSYS CFX partitioner was interrupted by signal SEGV (11)
Can anyone help me to solve this issue? Thank you

Lance October 19, 2015 01:59

We have had problems getting Platform MPI Distributed Parallel to run with SLURM, and got exactly the same error as you. Ansys support wont help since they dont support it...
If I remember correctly it was solved by either using Intel MPI Distributed Parallel and/or unsetting the SLURM_GTIDS environment variable.

stater October 19, 2015 12:59

Hello sir,

I confirm, the unsettling of the variable SLURM_GTIDS has allowed to solve the problem
Thank you a lot

Светлана March 18, 2016 00:02

Thank you, Lance,

We added 'unset SLURM_GTIDS' to the job script and the job runs now,

Now instead of 2 we have 5 people on the planet who know of this workaround. :)

EvanOscarSmith March 30, 2016 22:17

Thanks so much!

I was having this same issue with CFX 17.0 and the unset SLURM_GTIDS command has fixed it.

hmp May 3, 2016 18:51

Thank you for this info.

In case someone is trying to run Abaqus in parallel on one of XSEDE resources (e.g. SDSC Comet) then adding 'unset SLURM_GTIDS' before the ABQ command will get rid of errors about MPI.

Make sure you have parallel_mode=MPI in the ABQ command.

Best,

k.vafiadis June 20, 2016 11:53

unset SLURM_GTIDS worked for me too :D


Quote:

Originally Posted by Светлана (Post 590361)
Thank you, Lance,

We added 'unset SLURM_GTIDS' to the job script and the job runs now,

Now instead of 2 we have 5 people on the planet who know of this workaround. :)


Neys Schreiner February 27, 2017 07:08

Also worked here, added:
unset SLURM_GTIDS
to my job script below the usual module load commands and voila.

Thanks!

alaspina May 11, 2017 19:01

Anyone know how to do this for an interactive session?

Светлана May 11, 2017 23:56

Alex, do you mean a CFX session on a GNU/Linux computer? Add the "unset SLURM_GTIDS" line to your bashrc or bash profile.


All times are GMT -4. The time now is 11:38.