CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM Running, Solving & CFD

Job Scheduler for parallel processing

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Display Modes
Old   March 6, 2006, 17:22
Default Hello friends I saw in the
  #1
Senior Member
 
kumar
Join Date: Mar 2009
Posts: 111
Rep Power: 7
kumar2 is on a distinguished road
Hello friends

I saw in the manual ( Users manual , U-63 , damBreakCase ) that OpenFoam parallel runs can only be executed from the command line. Does this mean that we cannot use a job scheduler ?

Thanks in advance

kumar
kumar2 is offline   Reply With Quote

Old   March 7, 2006, 03:04
Default Can anyone please give some in
  #2
Senior Member
 
kumar
Join Date: Mar 2009
Posts: 111
Rep Power: 7
kumar2 is on a distinguished road
Can anyone please give some inputs ?

thanks

kumar
kumar2 is offline   Reply With Quote

Old   March 7, 2006, 04:37
Default You can easily run OpenFOAM us
  #3
Senior Member
 
Francesco Del Citto
Join Date: Mar 2009
Location: Zürich Area, Switzerland
Posts: 209
Rep Power: 8
fra76 is on a distinguished road
You can easily run OpenFOAM using a job scheduler, even for a parallel job. Here follows a simple script I used with Torque (PBS).
I hope it can be helpful!

#!/bin/bash

#PBS -N fiume
#PBS -j oe

#lamboot -v $PBS_O_MACHINES

cd $PBS_O_WORKDIR
export LAMRSH=ssh
lamboot $PBS_NODEFILE

mpiexec interFoam <rootcase> <casedir> -parallel </dev/null> output.out 2>&1

lamhalt -d
fra76 is offline   Reply With Quote

Old   March 7, 2006, 04:38
Default You can easily run OpenFOAM us
  #4
Senior Member
 
Francesco Del Citto
Join Date: Mar 2009
Location: Zürich Area, Switzerland
Posts: 209
Rep Power: 8
fra76 is on a distinguished road
You can easily run OpenFOAM using a job scheduler, even for a parallel job. Here follows a simple script I used with Torque (PBS).
I hope it can be helpful!

#!/bin/bash

#PBS -N jobname
#PBS -j oe

#lamboot -v $PBS_O_MACHINES

cd $PBS_O_WORKDIR
export LAMRSH=ssh
lamboot $PBS_NODEFILE

mpiexec interFoam <rootcase> <casedir> -parallel </dev/null> output.out 2>&1

lamhalt -d
fra76 is offline   Reply With Quote

Old   March 7, 2006, 04:38
Default You can easily run OpenFOAM us
  #5
Senior Member
 
Francesco Del Citto
Join Date: Mar 2009
Location: Zürich Area, Switzerland
Posts: 209
Rep Power: 8
fra76 is on a distinguished road
You can easily run OpenFOAM using a job scheduler, even for a parallel job. Here follows a simple script I used with Torque (PBS).
I hope it can be helpful!

#!/bin/bash

#PBS -N jobname
#PBS -j oe

cd $PBS_O_WORKDIR
export LAMRSH=ssh
lamboot $PBS_NODEFILE

mpiexec interFoam <rootcase> <casedir> -parallel </dev/null> output.out 2>&1

lamhalt -d
fra76 is offline   Reply With Quote

Old   March 7, 2006, 05:12
Default HI Francesco, I tried your
  #6
New Member
 
Pierre Maruzewski
Join Date: Mar 2009
Posts: 6
Rep Power: 7
pierrot is on a distinguished road
HI Francesco,

I tried your script BUT :

LAM attempted to execute a process on the remote node "node191",
but received some output on the standard error. This heuristic
assumes that any output on the standard error indicates a fatal error,
and therefore aborts. You can disable this behavior (i.e., have LAM
ignore output on standard error) in the rsh boot module by setting the
SSI parameter boot_rsh_ignore_stderr to 1.

LAM tried to use the remote agent command "ssh"
to invoke "echo $SHELL" on the remote node.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a (non-inclusive) list of items
that you should check on the remote node:

- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell

Try invoking the following command at the unix command line:

ssh node191 -n 'echo $SHELL'

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

Lamnodes Failed!
Check if you had booted lam before calling mpiexec else use -machinefile to pass host file to mpiexec



So can you help me ???

Pierre
pierrot is offline   Reply With Quote

Old   March 7, 2006, 05:38
Default Hi Francesco Thanks a lot for
  #7
Senior Member
 
kumar
Join Date: Mar 2009
Posts: 111
Rep Power: 7
kumar2 is on a distinguished road
Hi Francesco
Thanks a lot for the script. Let me give it a try.

Hi Pierre , I will get back after trying out Francesco's script.

Regards

Kumar
kumar2 is offline   Reply With Quote

Old   March 7, 2006, 05:53
Default This is qhat I get while execu
  #8
Senior Member
 
Francesco Del Citto
Join Date: Mar 2009
Location: Zürich Area, Switzerland
Posts: 209
Rep Power: 8
fra76 is on a distinguished road
This is qhat I get while executing commands contained in the script from an interactive job:
-------------------------------------------------
[carlo@epsilon runFiume]$ qsub -I -l nodes=2
qsub: waiting for job 2270.epsilon to start
qsub: job 2270.epsilon ready

Executing: /server/carlo/OpenFOAM/OpenFOAM-1.2/.bashrc
Executing: /server/carlo/OpenFOAM/OpenFOAM-1.2/.OpenFOAM-1.2/apps/ensightFoam/bashrc
Executing: /server/carlo/OpenFOAM/OpenFOAM-1.2/.OpenFOAM-1.2/apps/paraview/bashrc
[carlo@node2 ~]$ cd $PBS_O_WORKDIR
[carlo@node2 runFiume]$ export LAMRSH=ssh
[carlo@node2 runFiume]$ lamboot $PBS_NODEFILE

LAM 7.1.1 - Indiana University

[carlo@node2 runFiume]$
-------------------------------------------------

I guess the proble is the configuration of ssh/rsh in your cluster. I've configured my cluster so that a group of users can access with ssh from node to node without beeing asked for a password. This can be made easily because nodes share a NFS file system on the server, and the autentication is provided by a NIS server. So, the only thing I've done is adding the public key conatined in ~/.ssh/id_dsa.pub to the .ssh/authorized_keys? files.
This allows LAM to use ssh without be asked for the password. Another issue can be a possible error while ssh tries to forward X11 connection. This can be avoided with
export LAMRSH="ssh -x"
The standard behaviour of LAM distributed with OpenFOAM is trying to connect from a node to another using rsh. If you can access a node with rsh, you can try deleting the export LAMRSH line from the script.
For debuggin purposes, you can try this:
lamboot -v $PBS_NODEFILE
And post the result.
Francesco
fra76 is offline   Reply With Quote

Old   March 8, 2006, 06:15
Default Dear Francesco, I tried lamb
  #9
New Member
 
Pierre Maruzewski
Join Date: Mar 2009
Posts: 6
Rep Power: 7
pierrot is on a distinguished road
Dear Francesco,
I tried lamboot -v $PBS_NODEFILE
And that's the result :

lamboot -v $PBS_NODEFILE

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

n-1<5503> ssi:boot:base:linear: booting n0 (node223)
n-1<5503> ssi:boot:base:linear: booting n1 (node224)
ERROR: LAM/MPI unexpectedly received the following on stderr:

------------------------- /usr/local/Modules/versions --------------------------
3.1.6

--------------------- /usr/local/Modules/3.1.6/modulefiles ---------------------
dot module-cvs module-info modules null use.own

------------------------ /usr/local/Modules/modulefiles ------------------------
NAGWare_f95-amd64_glibc22/22 goto/0.96-2(default)
NAGWare_f95-amd64_glibc23/23 gsl/1.6(default)
acml/2.5.1(default) intel-cc/8.1.024
acml_generic_pgi/2.5.0 intel-fc/8.1.021
acml_pathscale/2.5.1 mpich-gm-1.2.6..14b/pathscale-2.3
acml_scalapack_generic_pgi/2.5.0 mpich-gm-1.2.6..14b/pgi-6.0(default)
ansys/10.0 mpich_pathscale/1.2.6..13b
ansys/10.0_SP1(default) mpich_pgi/1.2.6..13b
cernlib/2005(default) nag/21
fftw-2.1.5/pgi-6.0 pathscale/2.0(default)
fftw-3.0.1/pgi-6.0 pathscale/2.3
fluent/6.0.20 pgi/5.2(default)
fluent/6.2.16(default) pgi/6.0
gcc/4.0.2
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "node224",
but received some output on the standard error. This heuristic
assumes that any output on the standard error indicates a fatal error,
and therefore aborts. You can disable this behavior (i.e., have LAM
ignore output on standard error) in the rsh boot module by setting the
SSI parameter boot_rsh_ignore_stderr to 1.

LAM tried to use the remote agent command "ssh"
to invoke "echo $SHELL" on the remote node.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a (non-inclusive) list of items
that you should check on the remote node:

- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell

Try invoking the following command at the unix command line:

ssh -x node224 -n 'echo $SHELL'

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<5503> ssi:boot:base:linear: Failed to boot n1 (node224)
n-1<5503> ssi:boot:base:linear: aborted!
n-1<5508> ssi:boot:base:linear: booting n0 (node223)
n-1<5508> ssi:boot:base:linear: booting n1 (node224)
ERROR: LAM/MPI unexpectedly received the following on stderr:

------------------------- /usr/local/Modules/versions --------------------------
3.1.6

--------------------- /usr/local/Modules/3.1.6/modulefiles ---------------------
dot module-cvs module-info modules null use.own

------------------------ /usr/local/Modules/modulefiles ------------------------
NAGWare_f95-amd64_glibc22/22 goto/0.96-2(default)
NAGWare_f95-amd64_glibc23/23 gsl/1.6(default)
acml/2.5.1(default) intel-cc/8.1.024
acml_generic_pgi/2.5.0 intel-fc/8.1.021
acml_pathscale/2.5.1 mpich-gm-1.2.6..14b/pathscale-2.3
acml_scalapack_generic_pgi/2.5.0 mpich-gm-1.2.6..14b/pgi-6.0(default)
ansys/10.0 mpich_pathscale/1.2.6..13b
ansys/10.0_SP1(default) mpich_pgi/1.2.6..13b
cernlib/2005(default) nag/21
fftw-2.1.5/pgi-6.0 pathscale/2.0(default)
fftw-3.0.1/pgi-6.0 pathscale/2.3
fluent/6.0.20 pgi/5.2(default)
fluent/6.2.16(default) pgi/6.0
gcc/4.0.2
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "node224",
but received some output on the standard error. This heuristic
assumes that any output on the standard error indicates a fatal error,
and therefore aborts. You can disable this behavior (i.e., have LAM
ignore output on standard error) in the rsh boot module by setting the
SSI parameter boot_rsh_ignore_stderr to 1.

LAM tried to use the remote agent command "ssh"
to invoke "echo $SHELL" on the remote node.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a (non-inclusive) list of items
that you should check on the remote node:

- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell

Try invoking the following command at the unix command line:

ssh -x node224 -n 'echo $SHELL'

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<5508> ssi:boot:base:linear: Failed to boot n1 (node224)
n-1<5508> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully
pierrot is offline   Reply With Quote

Old   March 8, 2006, 06:42
Default Look at the message: Try
  #10
Senior Member
 
Francesco Del Citto
Join Date: Mar 2009
Location: Zürich Area, Switzerland
Posts: 209
Rep Power: 8
fra76 is on a distinguished road
Look at the message:

[...]
Try invoking the following command at the unix command line:

ssh -x node224 -n 'echo $SHELL'

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
[...]

Have you tried that command?

Another issue is the "module" command (it seems you have almost the same configuration of my cluster...). Look at the result of thees commands:

---------------------------------------------------------------
[francesco@epsilon ~]$ module av

---------------------------- /usr/local/modulefiles ----------------------------
dot module-cvs module-info modules null use.own

--------------------------- /opt/Modules/modulefiles ---------------------------
gnu gnu41 intel8 intel9 lam mpich openmpi
[francesco@epsilon ~]$ module av 2>/dev/null
[francesco@epsilon ~]$
---------------------------------------------------------------

This means that the command "module available" print its output on the standard error stream. LAM returns an error if there is any output in the standard error while executing remote shell command (ssh or rsh). That's why they suggest to use "ssh -x", because it doesn't even try to open an X connection.

I guess you have the command "module available" in your ~/.bashrc or in the global /etc/bashrc (I'm supposing you're using bash). You have to remove that command, or substitute it with "module available 2>&1", redirecting stderr to stdout. So it should not confuse LAM anymore.

Furthermore, I think you can always remove LAM distributed with OpenFOAM an install it from source, activating the support for Torque, so that neither rsh nor ssh would be required. I've done it, and it works fine with other mpi applications, but I've never tried with OpenFOAM.

I hope this can help you.
Francesco
fra76 is offline   Reply With Quote

Old   March 8, 2006, 07:09
Default Hi Francesco, It's runnig p
  #11
New Member
 
Pierre Maruzewski
Join Date: Mar 2009
Posts: 6
Rep Power: 7
pierrot is on a distinguished road
Hi Francesco,

It's runnig perfectly thanks to "module available 2>&1"

Pierre
pierrot is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
job scheduler qq Main CFD Forum 7 January 2, 2009 00:04
parallel processing V 3.24 Phil D CD-adapco 3 November 10, 2007 05:31
parallel processing mvee FLUENT 3 September 18, 2007 04:18
Parallel Processing AJ CD-adapco 1 September 10, 2005 13:02
MPI for parallel processing Chuck Leakeas Main CFD Forum 4 January 14, 2002 21:18


All times are GMT -4. The time now is 07:10.