CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM

running without rsh between nodes

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Display Modes
Old   May 14, 2009, 13:27
Default running without rsh between nodes
  #1
New Member
 
Paul Hatton
Join Date: May 2009
Posts: 6
Rep Power: 8
hattonps is on a distinguished road
I've built OpenFOAM v1.5 on our 400-odd node cluster running Scientific Linux 5. It runs fine using all 4 cores on a node but fails when running across 2 nodes, 4 cores per node under torque with

bash: orted: command not found

from the command

mpirun --hostfile $PBS_NODEFILE -np 8 oodles -parallel

We don't allow rsh between nodes, but do allow ssh. From experience this orted error from mpirun is often due to the application expecting rsh between nodes.

The usual way of running parallel jobs on our cluster is mpiexec, not mpirun, but if I try this the openfoam application (orted in this case) thinks it's running on one core. Does anyone know:

- can OpenFOAM use mpiexec rather than mpirun?

- does multi-node OpenFOAM expect rsh access between nodes?

- if so, can it be told to use an alternative such as ssh or a script that mimics rsh (we use pbsdsh for this) if there is a hook to hang an rsh replacement on?

Thanks

--
Paul Hatton
University of Birmingham
hattonps is offline   Reply With Quote

Old   May 15, 2009, 03:02
Default
  #2
Senior Member
 
Mark Olesen
Join Date: Mar 2009
Location: http://olesenm.github.io/
Posts: 777
Rep Power: 18
olesen will become famous soon enough
Quote:
Originally Posted by hattonps View Post
bash: orted: command not found
...
We don't allow rsh between nodes, but do allow ssh. From experience this orted error from mpirun is often due to the application expecting rsh between nodes.
According to the openmpi FAQ, ssh appears to be used by default, not rsh

http://www.open-mpi.org/faq/?category=rsh

We use OpenMPI/SGE without any problems just by specifying
mpirun SomeFoamApplication -parallel

From the FAQ, it seems that Torque is similar.
http://www.open-mpi.org/faq/?category=tm
See point 5 on the FAQ about problems with the -host parameter. Maybe you are hitting that.
olesen is offline   Reply With Quote

Old   May 15, 2009, 03:56
Default
  #3
New Member
 
Paul Hatton
Join Date: May 2009
Posts: 6
Rep Power: 8
hattonps is on a distinguished road
Thanks you *very* much, Oleson. I was experimenting with the standard mpi 'hello world' program last night and also found that sepecifying a hostfile causes problems with shared libraries not being found under torque/OpenMPI.

I should probably start a new thread for this, but if I may pose one more question - it seems that the OpenFOAM binaries think that they are on just the master node if I just specify -np 8, for example, and no hostfile or host list to mpirun in a torque job using 4 cores on each of 2 nodes. So it seems that OpenFOAM needs the hostfile to get the host list but OpenMPI won't accept this under torque. Is this correct and, if so, is there any known way out of it? Can I build it against mvapich2, to also give us infiniband support - I guess not if OpenFOAM uses OpenMPI constructs? If there is a section of the OpenFOAM documentation that I should be looking at a pointer would be much appreciated. I look after the overall HPC service here and support many applications so I must apologise for being new to OpenFOAM - one of our research groups has asked for it.

As an aside, the same problem arises with the Computational Chemistry program Molpro which has also caused me much grief recently ....

Thanks again for the info and the URL to the OpenMPI FAQ.

--
Paul Hatton
University of Birmingham
P.S.Hatton@bham.ac.uk
hattonps is offline   Reply With Quote

Old   May 15, 2009, 05:06
Default
  #4
Senior Member
 
Mark Olesen
Join Date: Mar 2009
Location: http://olesenm.github.io/
Posts: 777
Rep Power: 18
olesen will become famous soon enough
Quote:
Originally Posted by hattonps View Post
... it seems that the OpenFOAM binaries think that they are on just the master node if I just specify -np 8, for example, and no hostfile or host list to mpirun in a torque job using 4 cores on each of 2 nodes.
The problem is not OpenFOAM (it doesn't know anything about cores, cpus, hosts), but a general openmpi/torque problem. Your queuing system decides how many process 'slots' should be used on which hosts and passes this information to the orte. The only extra information that OpenFOAM needs is the -parallel option. Specifying '-np ...' is probably messing things up.
olesen is offline   Reply With Quote

Old   May 15, 2009, 05:20
Default
  #5
New Member
 
Paul Hatton
Join Date: May 2009
Posts: 6
Rep Power: 8
hattonps is on a distinguished road
Thanks again. If I qsub the following script:



#!/bin/bash
#PBS -j oe
#PBS -l "walltime=1:00,nodes=2pn=4"
#PBS -N FOAM-n2ppn4
#PBS -q bbadmin

cd "$PBS_O_WORKDIR"

module load apps/openfoam
. /apps/OpenFOAM/OpenFOAM-1.5/etc/bashrc
export WM_PROJECT_USER_DIR=$PWD

module load intel/fce/10.1.008

mpirun oodles -parallel


I get, in the stdout job output:

+--------------------------------------------------------------------------+
| Job starting at 2009-05-15 10:12:24 for hattonps on the BlueBEAR Cluster
| Job identity jobid 1195882 jobname FOAM-n2ppn4
| Job requests nodes=2pn=4,pmem=1996mb,walltime=00:01:00
| Job assigned to nodes u1n002 u1n001
+--------------------------------------------------------------------------+
bool Pstream::init(int& argc, char**& argv) : attempt to run parallel on 1 processor#0 Foam::error:rintStack(Foam::Ostream&) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so"
#1 Foam::error::abort() in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so"
#2 Foam::Pstream::init(int&, char**&) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/openmpi-1.2.6/libPstream.so"
#3 Foam::argList::argList(int&, char**&, bool, bool) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so"
#4 __gxx_personality_v0 in "/apps/OpenFOAM/OpenFOAM-1.5/applications/bin/linux64GccDPOpt/oodles"
#5 __libc_start_main in "/lib64/libc.so.6"
#6 Foam::regIOobject::readIfModified() in "/apps/OpenFOAM/OpenFOAM-1.5/applications/bin/linux64GccDPOpt/oodles"

and so on.

The line

bool Pstream::init(int& argc, char**& argv) : attempt to run parallel on 1 processor#0

suggests that oodles doesn't think it's running on multiple cores in this case?

I can run across cores on a node by specifying -np to mpirun, and I need to do this to get oodles to run multi-core, but then fall down when trying to run across nodes.


I'm missing something obvious here ....

--
Paul Hatton
University of Birmingham
P.S.Hatton@bham.ac.uk
hattonps is offline   Reply With Quote

Old   May 15, 2009, 05:49
Default
  #6
Senior Member
 
Mark Olesen
Join Date: Mar 2009
Location: http://olesenm.github.io/
Posts: 777
Rep Power: 18
olesen will become famous soon enough
Unfortunately, I don't have any experience with Torque.

Quote:
#PBS -j oe
#PBS -l "walltime=1:00,nodes=2pn=4"
#PBS -N FOAM-n2ppn4
#PBS -q bbadmin
Is the '-l' request sufficient for Torque to know it is a particular type of parallel job? Speaking from a GridEngine perspective, I need to specify a parallel environment.

For example,
-pe threaded 2
-pe mpich 16
-pe openmpi 16

The 'threaded' (eg, used by abaqus) has a particular allocation_rule.
The 'mpich' (eg, used by STAR-CD) has some extra start/stop rules.
The 'openmpi' (eg, used by OpenFOAM) also uses a 'fill_up' allocation rule (like mpich), but without special start/stop procedures.

I can't see anything similar in your example. Is it really running in parallel at all?
There must be a job env variable something like $NSLOTS that you can echo out from your job script to check that the job script is indeed running as a parallel job. If it is, then you should check that a HelloWorld mpi job works too.

/mark
olesen is offline   Reply With Quote

Old   May 18, 2009, 07:37
Default
  #7
New Member
 
Paul Hatton
Join Date: May 2009
Posts: 6
Rep Power: 8
hattonps is on a distinguished road
Thanks. I've been looking more closely at our OpenMP setup and threads. At present the library is built in single-threaded mode which may be a problem - I can't even get a standard 'hello world' program running correctly, although running with mpich2 is fine. I've raised this with our suppliers - Clustervision - for advice.

Torque knows that a parallel run is asked for by the

nodes=1pn=4

argument to the -l option, and there's no option to specify a particular parallel envoronment. I'll await advice from Clustervision and update this when I know more.

--
Paul Hatton
University of Birmingham
P.S.Hatton@bham.ac.uk
hattonps is offline   Reply With Quote

Old   May 18, 2009, 07:49
Default
  #8
New Member
 
Paul Hatton
Join Date: May 2009
Posts: 6
Rep Power: 8
hattonps is on a distinguished road
With smilies turned off this time (I wish I could disable their use in my account for all posts but I can't see how to do this ...)

~~~~

Thanks. I've been looking more closely at our OpenMP setup and threads. At present the library is built in single-threaded mode which may be a problem - I can't even get a standard 'hello world' program running correctly, although running with mpich2 is fine. I've raised this with our suppliers - Clustervision - for advice.

Torque knows that a parallel run is asked for by the

nodes=1:ppn=4

argument to the -l option, and there's no option to specify a particular parallel envoronment. I'll await advice from Clustervision and update this when I know more.

__________________
--
Paul Hatton
The University of Birmingham
hattonps is offline   Reply With Quote

Old   May 18, 2009, 07:58
Default
  #9
Senior Member
 
Mark Olesen
Join Date: Mar 2009
Location: http://olesenm.github.io/
Posts: 777
Rep Power: 18
olesen will become famous soon enough
Quote:
Originally Posted by hattonps View Post
I've been looking more closely at our OpenMP setup and threads. At present the library is built in single-threaded mode which may be a problem - I can't even get a standard 'hello world' program running correctly
Good that you've localized the problem a bit. While waiting for ClusterVision to answer your call, you might check the openmpi config yourself. The 'ompi_info' command should provide some information.
On my system (openSUSE 11.1), the system installed version (1.2.8) is found under /usr/lib64/mpi/gcc/openmpi/bin/ompi_info and shows that very little has been configured. My OpenFOAM version (1.3.2) is found under $MPI_ARCH_PATH/bin/ompi_info and shows lots of things have been configured - including 'gridengine'.

Maybe you are getting the wrong version, or maybe it wasn't configured to handle torque. For new openmpi versions, the GridEngine must be configured as well (--with-sge) when configuring/compiling openmpi.
olesen is offline   Reply With Quote

Old   March 19, 2010, 01:39
Default
  #10
Senior Member
 
J. Cai
Join Date: Apr 2009
Posts: 180
Rep Power: 8
chiven is on a distinguished road
Hi, Paul, how did you solve this problem? I just meet the same problem as yours.

Best regards,
Chiven
chiven is offline   Reply With Quote

Old   March 22, 2010, 16:02
Default
  #11
New Member
 
Paul Hatton
Join Date: May 2009
Posts: 6
Rep Power: 8
hattonps is on a distinguished road
I ended up rebuilding the MPI that came with OpenFoam; I couldn't get it to link to an already-existing one. To do this:

tar xzf OpenFOAM-1.5.General.gtgz
tar xzf ThirdParty.General.gtgz
tar xzf ThirdParty.linux64Gcc.gtgz
rm -r ThirdParty/openmpi-1.2.6/platforms
- to force OpenMPI build

Edit ThirdParty/Allwmake:

./configure \
--prefix=$MPI_ARCH_PATH \
--disable-mpirun-prefix-by-default \
--disable-orterun-prefix-by-default \
--enable-shared --disable-static \
--disable-mpi-f77 --disable-mpi-f90 --disable-mpi-cxx \
--disable-mpi-profile \
--with-openib=/cvos/shared/apps/ofed/1.3 \
--with-openib-libdir=/cvos/shared/apps/ofed/1.3/lib64 \
--with-tm=/cvos/shared/apps/torque/current
# These lines enable Infiniband support
# --with-openib=/usr/local/ofed \
# --with-openib-libdir=/usr/local/ofed/lib64

and then the usual build.

Deleting ThirdParty/openmpi-1.2.6/platforms tells OpenFoam to build it's own OpenMPI; adding

--with-openib=/cvos/shared/apps/ofed/1.3 \
--with-openib-libdir=/cvos/shared/apps/ofed/1.3/lib64 \
--with-tm=/cvos/shared/apps/torque/current

are the usual arguments to the OpenMPI build to tell it to pick up the OpenIB Infiniband drivers and link with torque. It ran OK after this.

HTH
__________________
--
Paul Hatton
The University of Birmingham
hattonps is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
RSH problem for parallel running in CFX Nicola CFX 5 June 18, 2012 18:31
Statically Compiling OpenFOAM Issues herzfeldd OpenFOAM Installation 21 January 6, 2009 10:38
Kubuntu uses dash breaks All scripts in tutorials platopus OpenFOAM Bugs 8 April 15, 2008 07:52
RSH does't connect for two WIN XP nodes Ali CFX 4 June 17, 2006 14:25
CFX4.3 -build analysis form Chie Min CFX 5 July 12, 2001 23:19


All times are GMT -4. The time now is 19:28.