CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (http://www.cfd-online.com/Forums/openfoam-solving/)
-   -   Sun Grid Engine (http://www.cfd-online.com/Forums/openfoam-solving/59044-sun-grid-engine.html)

grtabor January 28, 2008 10:19

Dear All, I'm starting to u
 
Dear All,

I'm starting to use OpenFOAM on a new machine. Does anyone have any experience with using OpenFOAM with Sun Grid Engine? Comments on this would be useful; submission scripts would be _really_ useful.

Gavin

olesen January 28, 2008 10:40

Getting OpenFOAM working with
 
Getting OpenFOAM working with OpenMPI and GridEngine okay (much, much better than trying to get LAM working).

1. Check that the OPAL_PREFIX is properly set by your Foam installation.

2. Assuming that you don't have the OpenFOAM settings being sourced within your bashrc/cshrc, or you are using sh/ksh, the job script should include this sourcing information.

I've attached a script snippet http://www.cfd-online.com/OpenFOAM_D...s/mime_txt.gif qFoam-snippet that should help get you going.
The snippet CANNOT be used as is. I'd rather not send the entire script, since there a number of interdependencies with our site-specific scripting and it would likely be too confusing anyhow.

Since I have it set up to run in the cwd, there is no need to pass the root/case information to the script, but you do need to tell it which application should run.

You will not only need site-specific changes, you will also notice funny looking "%{STUFF}" constructs throughout. These placeholders are replaced with the appropriate environment variables to create the final job script. There is also some odd bits with an "etc/" directory. This is simply a link to the appropriate OpenFOAM-VERSION/.OpenFOAM-VERSION directory.

luca January 28, 2008 11:00

Hi Gavin, I use SGE job sch
 
Hi Gavin,

I use SGE job scheduler with OpenFOAM in our cluster. I wrote this rules:



PeHostfile2MachineFile()
{
cat $1 | while read line; do
# echo $line
host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
nslots=`echo $line|cut -f2 -d" "`
i=1
# while [ $i -le $nslots ]; do
# # add here code to map regular hostnames into ATM hostnames
echo $host cpu=$nslots
# i=`expr $i + 1`
# done
done
}
touch OFmachines
PeHostfile2MachineFile $1 | cat >> OFmachines
mhost=`echo $2|cut -f1 -d"."`
echo $mhost >> mhost




with this batch that creates the SGE script:




#!/bin/bash
echo Enter a casename:
read casename
echo "Enter definition WDir:"
read Wdir
#echo Enter Solver :
#read Solver
echo "Number of processors:"
read cpunumb
#
if [ $cpunumb = "1" ]; then
touch Foam-$casename.sh
chmod +x Foam-$casename.sh
echo '#!/bin/bash' >> Foam-$casename.sh
echo '### SGE ###' >> Foam-$casename.sh
echo '#$ -S /bin/sh -j y -cwd' >> Foam-$casename.sh
echo 'read masthost <mhost'>> Foam-$casename.sh
echo 'ssh $masthost "cd $PWD;'SteadyCompFoam' '$Wdir' '$casename' "' >> OFoam-$casename.sh
echo 'rm -f OFmachines' >> Foam-$casename.sh
echo 'rm -f mhost' >> Foam-$casename.sh
echo 'rm -f 'Foam-$casename.sh' ' >> Foam-$casename.sh
qsub -pe OFnet $cpunumb -masterq tom02.q,tom03.q,tom04.q,tom05.q,tom06.q,tom22.q,to m23.q,tom24.q,tom25.
q Foam-$casename.sh
else
touch Foam-$casename.sh
chmod +x Foam-$casename.sh
echo '#!/bin/bash' >> Foam-$casename.sh
echo '### SGE ###' >> Foam-$casename.sh
echo '#$ -S /bin/sh -j y -cwd' >> Foam-$casename.sh
echo 'read masthost <mhost'>> Foam-$casename.sh
echo 'ssh $masthost "export LAMRSH=ssh;cd $PWD;lamboot -v -s OFmachines"' >> Foam-$c
asename.sh
echo 'ssh $masthost "cd $PWD;mpirun -np '$cpunumb' 'SteadyCompFoam' '$Wdir' '$casename' -parallel" ' >>
Foam-$casename.sh
echo 'ssh $masthost "cd $PWD;lamhalt -d"' >> Foam-$c
asename.sh
echo 'rm -f OFmachines' >> Foam-$casename.sh
echo 'rm -f mhost' >> Foam-$casename.sh
echo 'rm -f 'Foam-$casename.sh' ' >> Foam-$casename.sh
qsub -pe OFnet $cpunumb -masterq tom02.q,tom03.q,tom04.q,tom05.q,tom06.q,tom22.q,to m23.q,tom24.q,tom25.
q Foam-$casename.sh
fi


The stuff works with LAM mpi libraries. You can submit the job but at the moment you have to stop your calculation by the controlDict and not by the qmon interface.

We can start from this to develop a better one

Luca

grtabor January 30, 2008 06:30

Dear Luca, Mark, Thanks for
 
Dear Luca, Mark,

Thanks for your scripts - I can kind of make sense of them!! I've managed to get things running now for single-processor jobs using a simplified version of what you suggest.

For the parallel running case; am I right that $nslots is a variable giving the number of processors being allocated for the parallel run? How is this being set in SGE?

Gavin

olesen January 30, 2008 08:39

The $NSLOT (all uppercase) is
 
The $NSLOT (all uppercase) is set by GridEngine. The qsub manpage is the best starting point for finding out more about which env variables are used.

Based on personal experience, I would really try to avoid LAM with GridEngine and use OpenMPI instead.

BTW: killing the job via qdel (or qmon) works fine (it doesn't leave around any half-dead processes), but obviously won't have OpenFOAM write results before exiting.

Using the '-notify' option for qsub would give you a chance to trap the signals. But apart from some OpenMPI issues in the past, it is not certain that a particular OpenFOAM solver could finish its iteration *and* write the results before the true kill signal gets sent. Increasing the notify period before pulling the plug maynot be the correct answer either.

For the moment, I've modified a few solvers to recognize the presence of an 'ABORT' file and quit and write if it exists. This is usually quite a bit easier than modifying the controlDict.
I think there is another solution, but still need to think about it a bit.

/mark

nico765 February 7, 2008 13:32

hello, I am also trying to
 
hello,

I am also trying to use qsub to run in parallel using 4 cpus.

My command is:

qsub -q queue_name.q sge_script.sh

in sge_script.sh:

source /net/c3m/opt/OpenFOAM/OpenFOAM-1.4.1/.OpenFOAM-1.4.1/bashrc
source /net/c3m/opt/OpenFOAM/OpenFOAM-1.4.1/.bashrc
mpirun -np 4 simpleFoam .. case -parallel


This for some reasons does not work.

I get this output:
error: executing task of job 25585 failed:
[c4n26:16009] ERROR: A daemon on node c4n26 failed to start as expected.
[c4n26:16009] ERROR: There may be more information available from
[c4n26:16009] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[c4n26:16009] ERROR: If the problem persists, please restart the
[c4n26:16009] ERROR: Grid Engine PE job
[c4n26:16009] ERROR: The daemon exited unexpectedly with status 1.

This occurs only if i use mpirun, if i use the same command in serial (simpleFoam .. case) it works ok.

Also, if I ssh into the node and start the script there it runs ok.

Nicolas

olesen February 8, 2008 04:10

Nishant, I am re-directing
 
Nishant,

I am re-directing your thread ( http://www.cfd-online.com/cgi-bin/Op...how.cgi?1/6598 ) to here, since this is where the relevant information is.

If you read the thread, you'll notice that my response (with the qFoam snippet) addressed running with OpenMPI, whereas the information from Luca was for LAM. If you are not using LAM, then you don't need any of that stuff and don't need to worry about it.

The qFoam snippet is a template run script. The '%{STUFF}' placeholders must be replaced with the relevant information before it can be submitted with the usual qsub -pe NAME slots.

How exactly you wish to use the template to create your job script is left to you. Some people might want an interactive solution (like Luca showed) others might want to wrap it with Perl, Python or Ruby. We generally use Perl to create the final shell script and feed it to qsub via stdin.

From you original question about using something like "mpirun -machinefile machine -np 4 case root etc". Why do you want to generate a machine file and specify the number of processes? This is the purpose of the OpenMPI and GridEngine integration and you are ignoring it.

As you can also see from the qFoam snippet, there is no need to use -machinefile or -np when using OpenMPI and GridEngine. All the bits are already done for you. Have you already consulted your site support people?

nishant_hull February 9, 2008 08:40

Thanks Mark. I hope this sho
 
Thanks Mark.
I hope this should help. I will update you soon in this regard.

Nishant

nishant_hull February 9, 2008 12:42

Hi marks.. I edited my
 
Hi marks..

I edited my qfoam-snippet.sh file as under and run it. The output is suggesting error at line containing __DATA__ and line 25-26.
Please see my file and suggest required editing.

17 rootName=interFoam
18 caseName=$PWD/dam-dumy
19 jobName=$caseName
20 # avoid generic names
21 case "$jobName" in
22 foam | OpenFOAM )
23 jobName=$(dirname $PWD)
24 jobName=$(basename $jobName)
25 ;;
26 ecas
27
28 # ----------------------------------------
29 # OpenFOAM (re)initialization
30 #
31 unset FOAM_SILENT
32 FOAM_INST_DIR=$HOME/$WM_PROJECT
33 FOAM_ETC=$WM_PROJECT_DIR/.OpenFOAM-1.4.1
34
35 # source based on parallel environment
36 for i in $FOAM_ETC/bashrc-$PE $FOAM_ETC/bashrc

http://www.cfd-online.com/OpenFOAM_D...s/mime_txt.gif qfoam-snippet.txt

nishant_hull February 9, 2008 15:25

sorry but the error posted on
 
sorry but the error posted on my last message somes when I tried running on single proccessor, using qsub qfoam-sippet.sh
When I try running it on 4 processor, using qsub -pe qfoam-sippet.sh 4 with modified attached script, job didnt get submitted on the cluster.
please see the script.

http://www.cfd-online.com/OpenFOAM_D...s/mime_txt.gif qfoam-snippet.txt

olesen February 11, 2008 04:07

The __DATA__ is a remnant from
 
The __DATA__ is a remnant from the original Perl wrapper and should be deleted.
The "ecas" is a typo from a last minute edit and should obviously be "esac".
The idea of the snippet was to give an idea of what to do, not to provide a finished solution.

nishant_hull February 11, 2008 16:24

Now the error is changed to:
 
Now the error is changed to:

(EE) /.OpenFOAM-1.4.1/bashrc cannot be found

Actually I sourced my code with .bashrc. how can I make this code to run on cluster now?

Nishant

nico765 February 14, 2008 05:15

any ideas on my problem descri
 
any ideas on my problem described above?

Thanks,

Nicolas

nico765 February 15, 2008 14:27

Hello, I finally fixed my p
 
Hello,

I finally fixed my problem.

'mpirun' should be replaced by 'mpirun -prefix $OPENMPI_ARCH_PATH'

Nicolas

nishant_hull February 20, 2008 13:15

Hi .. i am using $PE mpich
 
Hi ..

i am using $PE mpich for runing paralell damBreak problem on 4 processor. However I am getting this error:
Got 4 processors.
Machines:

mpirun --prefix ~/OpenFOAM/OpenFOAM-1.4.1/src/openmpi-1.2.3/platforms/linux64GccDPOpt/ -np 4 -machinefile /tmp/802.1.parallel.q/machines interFoam . dam-dumy -parallel
[comp20:32553] mca: base: component_find: unable to open paffinity linux: file not found (ignored)
[comp20:32553] mca: base: component_find: unable to open ns proxy: file not found (ignored)
[comp20:32553] mca: base: component_find: unable to open ns replica: file not found (ignored)
[comp20:32553] mca: base: component_find: unable to open errmgr hnp: file not found (ignored)
[comp20:32553] mca: base: component_find: unable to open errmgr orted: file not found (ignored)
[comp20:32553] mca: base: component_find: unable to open errmgr proxy: file not found (ignored)
[comp20:32553] mca: base: component_find: unable to open rml oob: file not found (ignored)
[comp20:32553] mca: base: component_find: unable to open gpr null: file not found (ignored)
[comp20:32553] mca: base: component_find: unable to open gpr proxy: file not found (ignored)
[comp20:32553] mca: base: component_find: unable to open gpr replica: file not found (ignored)
[comp20:32553] mca: base: component_find: unable to open sds env: file not found (ignored)
[comp20:32553] mca: base: component_find: unable to open sds pipe: file not found (ignored)
[comp20:32553] mca: base: component_find: unable to open sds seed: file not found (ignored)
[comp20:32553] mca: base: component_find: unable to open sds singleton: file not found (ignored)
[comp20:32553] mca: base: component_find: unable to open sds slurm: file not found (ignored)
[comp20:32553] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at line 214
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

orte_sds_base_select failed
--> Returned value -13 instead of ORTE_SUCCESS

--------------------------------------------------------------------------
[comp20:32553] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_system_init.c at line 42
[comp20:32553] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52
--------------------------------------------------------------------------
Open RTE was unable to initialize properly. The error occured while
attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS.

can anybody help?

nishant_hull February 21, 2008 08:52

After providing path to the re
 
After providing path to the relevant library file of OF, Now I am getting this error:-

Got 4 processors.
Machines:
[comp30:06445] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at line 214
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
orte_init:startup:internal-failure
from the file:
help-orte-runtime
But I couldn't find any file matching that name. Sorry!
--------------------------------------------------------------------------
[comp30:06445] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_system_init.c at line 42
[comp30:06445] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
orterun:init-failure
from the file:
help-orterun.txt
But I couldn't find any file matching that name. Sorry!
--------------------------------------------------------------------------


can anybody comment on it?

nishant

olesen February 21, 2008 09:02

If OPAL_PREFIX is set, the fil
 
If OPAL_PREFIX is set, the file should be found.

nishant_hull February 21, 2008 10:08

My script file contains these
 
My script file contains these lines:

#!/bin/sh
#
# Your job name
#$ -N OMPI_Dumy
#
# Use current working directory
#$ -cwd
#
# Join stdout and stderr
#$ -j y
#
# pe request for MPICH. Set your number of processors here.
# Make sure you use the "mpich" parallel environemnt.
#$ -pe mpich 4
#
# Run job through bash shell
#$ -S /bin/bash
#
# The following is for reporting only. It is not really needed
# to run the job. It will show up in your output file.
echo "Got $NSLOTS processors."
echo "Machines:"
#
# These exports needed for OpenMPI, when using the command line
# these are set with the modules command, but with SGE scripts
# we assume nothing!
export PATH=~/OpenFOAM/OpenFOAM-1.4.1/src/openmpi-1.2.3/platforms/linux64GccDPOpt/bin:$ PATH
export LD_LIBRARY_PATH=~/OpenFOAM/OpenFOAM-1.4.1/src/openmpi-1.2.3/platforms/linux64Gcc DPOpt/lib:$PATH

# Use full pathname to make sure we are using the right mpirun
# Need to use prefix for nodes
~/OpenFOAM/OpenFOAM-1.4.1/src/openmpi-1.2.3/platforms/linux64GccDPOpt/bin/mpirun --prefix ~/OpenFOAM/OpenFOAM-1.4.1/src/openmpi-1.2.3/platforms/linux64GccDPOpt/ -np $NSLOTS -machinefile ~/.mpich/mpich_hosts.$JOB_ID interFoam . dam-dumy -parallel


Did i set the OPAL_PREFIX right?

Nishant

nishant_hull February 21, 2008 10:12

i think I am doing something w
 
i think I am doing something wrong with the path of machinefile/hostfile. Don't I? The current path is for MPICH hostfle path. Should it be openfoam's open mpi path? can you tell me something about that?

Nishant

nishant_hull February 25, 2008 09:59

While I was trying to debug th
 
While I was trying to debug the problem on my sge cluster, I used the ompi_info command to see the problem.

The error registered by ompi_info is:

ompi_info: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory

Can anybody tell me, whats going wrong now?

Nishant


All times are GMT -4. The time now is 06:30.