CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM (http://www.cfd-online.com/Forums/openfoam/)
-   -   running without rsh between nodes (http://www.cfd-online.com/Forums/openfoam/64568-running-without-rsh-between-nodes.html)

hattonps May 14, 2009 13:27

running without rsh between nodes
 
I've built OpenFOAM v1.5 on our 400-odd node cluster running Scientific Linux 5. It runs fine using all 4 cores on a node but fails when running across 2 nodes, 4 cores per node under torque with

bash: orted: command not found

from the command

mpirun --hostfile $PBS_NODEFILE -np 8 oodles -parallel

We don't allow rsh between nodes, but do allow ssh. From experience this orted error from mpirun is often due to the application expecting rsh between nodes.

The usual way of running parallel jobs on our cluster is mpiexec, not mpirun, but if I try this the openfoam application (orted in this case) thinks it's running on one core. Does anyone know:

- can OpenFOAM use mpiexec rather than mpirun?

- does multi-node OpenFOAM expect rsh access between nodes?

- if so, can it be told to use an alternative such as ssh or a script that mimics rsh (we use pbsdsh for this) if there is a hook to hang an rsh replacement on?

Thanks

--
Paul Hatton
University of Birmingham

olesen May 15, 2009 03:02

Quote:

Originally Posted by hattonps (Post 216117)
bash: orted: command not found
...
We don't allow rsh between nodes, but do allow ssh. From experience this orted error from mpirun is often due to the application expecting rsh between nodes.

According to the openmpi FAQ, ssh appears to be used by default, not rsh

http://www.open-mpi.org/faq/?category=rsh

We use OpenMPI/SGE without any problems just by specifying
mpirun SomeFoamApplication -parallel

From the FAQ, it seems that Torque is similar.
http://www.open-mpi.org/faq/?category=tm
See point 5 on the FAQ about problems with the -host parameter. Maybe you are hitting that.

hattonps May 15, 2009 03:56

Thanks you *very* much, Oleson. I was experimenting with the standard mpi 'hello world' program last night and also found that sepecifying a hostfile causes problems with shared libraries not being found under torque/OpenMPI.

I should probably start a new thread for this, but if I may pose one more question - it seems that the OpenFOAM binaries think that they are on just the master node if I just specify -np 8, for example, and no hostfile or host list to mpirun in a torque job using 4 cores on each of 2 nodes. So it seems that OpenFOAM needs the hostfile to get the host list but OpenMPI won't accept this under torque. Is this correct and, if so, is there any known way out of it? Can I build it against mvapich2, to also give us infiniband support - I guess not if OpenFOAM uses OpenMPI constructs? If there is a section of the OpenFOAM documentation that I should be looking at a pointer would be much appreciated. I look after the overall HPC service here and support many applications so I must apologise for being new to OpenFOAM - one of our research groups has asked for it.

As an aside, the same problem arises with the Computational Chemistry program Molpro which has also caused me much grief recently ....

Thanks again for the info and the URL to the OpenMPI FAQ.

--
Paul Hatton
University of Birmingham
P.S.Hatton@bham.ac.uk

olesen May 15, 2009 05:06

Quote:

Originally Posted by hattonps (Post 216190)
... it seems that the OpenFOAM binaries think that they are on just the master node if I just specify -np 8, for example, and no hostfile or host list to mpirun in a torque job using 4 cores on each of 2 nodes.

The problem is not OpenFOAM (it doesn't know anything about cores, cpus, hosts), but a general openmpi/torque problem. Your queuing system decides how many process 'slots' should be used on which hosts and passes this information to the orte. The only extra information that OpenFOAM needs is the -parallel option. Specifying '-np ...' is probably messing things up.

hattonps May 15, 2009 05:20

Thanks again. If I qsub the following script:



#!/bin/bash
#PBS -j oe
#PBS -l "walltime=1:00,nodes=2:ppn=4"
#PBS -N FOAM-n2ppn4
#PBS -q bbadmin

cd "$PBS_O_WORKDIR"

module load apps/openfoam
. /apps/OpenFOAM/OpenFOAM-1.5/etc/bashrc
export WM_PROJECT_USER_DIR=$PWD

module load intel/fce/10.1.008

mpirun oodles -parallel


I get, in the stdout job output:

+--------------------------------------------------------------------------+
| Job starting at 2009-05-15 10:12:24 for hattonps on the BlueBEAR Cluster
| Job identity jobid 1195882 jobname FOAM-n2ppn4
| Job requests nodes=2:ppn=4,pmem=1996mb,walltime=00:01:00
| Job assigned to nodes u1n002 u1n001
+--------------------------------------------------------------------------+
bool Pstream::init(int& argc, char**& argv) : attempt to run parallel on 1 processor#0 Foam::error::printStack(Foam::Ostream&) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so"
#1 Foam::error::abort() in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so"
#2 Foam::Pstream::init(int&, char**&) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/openmpi-1.2.6/libPstream.so"
#3 Foam::argList::argList(int&, char**&, bool, bool) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so"
#4 __gxx_personality_v0 in "/apps/OpenFOAM/OpenFOAM-1.5/applications/bin/linux64GccDPOpt/oodles"
#5 __libc_start_main in "/lib64/libc.so.6"
#6 Foam::regIOobject::readIfModified() in "/apps/OpenFOAM/OpenFOAM-1.5/applications/bin/linux64GccDPOpt/oodles"

and so on.

The line

bool Pstream::init(int& argc, char**& argv) : attempt to run parallel on 1 processor#0

suggests that oodles doesn't think it's running on multiple cores in this case?

I can run across cores on a node by specifying -np to mpirun, and I need to do this to get oodles to run multi-core, but then fall down when trying to run across nodes.


I'm missing something obvious here ....

--
Paul Hatton
University of Birmingham
P.S.Hatton@bham.ac.uk

olesen May 15, 2009 05:49

Unfortunately, I don't have any experience with Torque.

Quote:

#PBS -j oe
#PBS -l "walltime=1:00,nodes=2:ppn=4"
#PBS -N FOAM-n2ppn4
#PBS -q bbadmin
Is the '-l' request sufficient for Torque to know it is a particular type of parallel job? Speaking from a GridEngine perspective, I need to specify a parallel environment.

For example,
-pe threaded 2
-pe mpich 16
-pe openmpi 16

The 'threaded' (eg, used by abaqus) has a particular allocation_rule.
The 'mpich' (eg, used by STAR-CD) has some extra start/stop rules.
The 'openmpi' (eg, used by OpenFOAM) also uses a 'fill_up' allocation rule (like mpich), but without special start/stop procedures.

I can't see anything similar in your example. Is it really running in parallel at all?
There must be a job env variable something like $NSLOTS that you can echo out from your job script to check that the job script is indeed running as a parallel job. If it is, then you should check that a HelloWorld mpi job works too.

/mark

hattonps May 18, 2009 07:37

Thanks. I've been looking more closely at our OpenMP setup and threads. At present the library is built in single-threaded mode which may be a problem - I can't even get a standard 'hello world' program running correctly, although running with mpich2 is fine. I've raised this with our suppliers - Clustervision - for advice.

Torque knows that a parallel run is asked for by the

nodes=1:ppn=4

argument to the -l option, and there's no option to specify a particular parallel envoronment. I'll await advice from Clustervision and update this when I know more.

--
Paul Hatton
University of Birmingham
P.S.Hatton@bham.ac.uk

hattonps May 18, 2009 07:49

With smilies turned off this time (I wish I could disable their use in my account for all posts but I can't see how to do this ...)

~~~~

Thanks. I've been looking more closely at our OpenMP setup and threads. At present the library is built in single-threaded mode which may be a problem - I can't even get a standard 'hello world' program running correctly, although running with mpich2 is fine. I've raised this with our suppliers - Clustervision - for advice.

Torque knows that a parallel run is asked for by the

nodes=1:ppn=4

argument to the -l option, and there's no option to specify a particular parallel envoronment. I'll await advice from Clustervision and update this when I know more.


olesen May 18, 2009 07:58

Quote:

Originally Posted by hattonps (Post 216492)
I've been looking more closely at our OpenMP setup and threads. At present the library is built in single-threaded mode which may be a problem - I can't even get a standard 'hello world' program running correctly

Good that you've localized the problem a bit. While waiting for ClusterVision to answer your call, you might check the openmpi config yourself. The 'ompi_info' command should provide some information.
On my system (openSUSE 11.1), the system installed version (1.2.8) is found under /usr/lib64/mpi/gcc/openmpi/bin/ompi_info and shows that very little has been configured. My OpenFOAM version (1.3.2) is found under $MPI_ARCH_PATH/bin/ompi_info and shows lots of things have been configured - including 'gridengine'.

Maybe you are getting the wrong version, or maybe it wasn't configured to handle torque. For new openmpi versions, the GridEngine must be configured as well (--with-sge) when configuring/compiling openmpi.

chiven March 19, 2010 01:39

Hi, Paul, how did you solve this problem? I just meet the same problem as yours.

Best regards,
Chiven

hattonps March 22, 2010 16:02

I ended up rebuilding the MPI that came with OpenFoam; I couldn't get it to link to an already-existing one. To do this:

tar xzf OpenFOAM-1.5.General.gtgz
tar xzf ThirdParty.General.gtgz
tar xzf ThirdParty.linux64Gcc.gtgz
rm -r ThirdParty/openmpi-1.2.6/platforms
- to force OpenMPI build

Edit ThirdParty/Allwmake:

./configure \
--prefix=$MPI_ARCH_PATH \
--disable-mpirun-prefix-by-default \
--disable-orterun-prefix-by-default \
--enable-shared --disable-static \
--disable-mpi-f77 --disable-mpi-f90 --disable-mpi-cxx \
--disable-mpi-profile \
--with-openib=/cvos/shared/apps/ofed/1.3 \
--with-openib-libdir=/cvos/shared/apps/ofed/1.3/lib64 \
--with-tm=/cvos/shared/apps/torque/current
# These lines enable Infiniband support
# --with-openib=/usr/local/ofed \
# --with-openib-libdir=/usr/local/ofed/lib64

and then the usual build.

Deleting ThirdParty/openmpi-1.2.6/platforms tells OpenFoam to build it's own OpenMPI; adding

--with-openib=/cvos/shared/apps/ofed/1.3 \
--with-openib-libdir=/cvos/shared/apps/ofed/1.3/lib64 \
--with-tm=/cvos/shared/apps/torque/current

are the usual arguments to the OpenMPI build to tell it to pick up the OpenIB Infiniband drivers and link with torque. It ran OK after this.

HTH


All times are GMT -4. The time now is 05:06.