CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (http://www.cfd-online.com/Forums/openfoam-solving/)
-   -   all processes end up in the same node when submitting parallel job by SGE (http://www.cfd-online.com/Forums/openfoam-solving/108775-all-processes-end-up-same-node-when-submitting-parallel-job-sge.html)

man@hepia November 1, 2012 09:11

all processes end up in the same node when submitting parallel job by SGE
 
Dear all,
has anyone seen this kind of problem?
Background: OpenFOAM version OpenFOAM-2.1.x, compiled by "Allwmake",
Grid Engine GE 6.2u3, Scientific Linux SL release 5.5
Cluster of 224 cores in 20-something nodes.

The following distributes a task nicely in many nodes:
mpirun -np 64 --machinefile machines simpleFoam -parallel
Slaves :
63
(
"node015.374"
"node016.2178"
..

But submitting the same task by SGE leads to a situation where _all_ the processes are in a single node.

mpirun -np $NSLOTS --machinefile $TMPDIR/machines simpleFoam -parallel

The nodes in "machines" generated by SGE are diverse node015, node016.. but simpleFoam always starts the processes in a single node.
Is there something I should check? mpirun is from the ThirdParty package.

axpl November 12, 2012 03:12

Add this line to your script:
Code:

unset SGE_ROOT
Sincerely,
Alex

man@hepia November 13, 2012 04:43

Hi,
thanks for the reply. I am not sure what else I should change in the script. Here we have it:

mpirun -np $NSLOTS -machinefile $TMPDIR/machines /opt/OpenFOAM/OpenFOAM-2.1.x/platforms/linux64GccDPOpt/bin/simpleFoam -parallel

this runs everything in just one node

unset SGE_ROOT
mpirun -np $NSLOTS -machinefile $TMPDIR/machines /opt/OpenFOAM/OpenFOAM-2.1.x/platforms/linux64GccDPOpt/bin/simpleFoam -parallel

fails with the following error message:
ssh: Unsupported option - -x

axpl November 15, 2012 13:44

The error may depends to openmpi: what version are you using? Can you post your launch script?

Alex

linnemann November 16, 2012 01:07

Hi

if openmpi was built using --with-sge then you dont need "-machinefile $TMPDIR/machines"

unset $SGE_ROOT for our cluster puts the job on one node even though its reserving the nodes.

Here is how I start an OF job on a sge cluster "qsub runScript" with runScript containing the lines below

Code:

#!/bin/bash
#
#$ -cwd
#$ -o ./log.out
#$ -e ./log.err

#$ -pe orte 24
#
#$ -q all.q
#$ -S /bin/bash

# unset SGE_ROOT

echo Got $NSLOTS processors.

source /share/apps/OpenFOAM/OpenFOAM-2.1.x/etc/bashrc

mpi=`command -v mpirun`
solver=`command -v pimpleDyMFoam`
echo $mpi
echo $solver

if [ -z "$mpi" -a -z "$solver" ]
  then
      echo ">> mpi was not found, quitting!"
      exit 1
  else
      echo ">> mpi was found will continue"
      $mpi -np $NSLOTS -x LD_LIBRARY_PATH -x PATH -x WM_PROJECT_DIR -x WM_PROJECT_INST_DIR -x  WM_OPTIONS -x  FOAM_LIBBIN -x  FOAM_APPBIN -x  FOAM_USER_APPBIN -x MPI_BUFFER_SIZE $solver -parallel > log
fi


man@hepia November 16, 2012 09:42

Many thanks!
Unfortunately the script that you copied behaves the same way as before: all processes in 1 node.
I need to set "machinefile" in the script, otherwise I get "ssh unsupported option -x" in stderr.
I compiled OpenFOAM 2.1.1 with just "./Allwmake". Is there a trick to force "--with-sge"?


All times are GMT -4. The time now is 19:41.