CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM Installation (http://www.cfd-online.com/Forums/openfoam-installation/)
-   -   Lamboot trouble (http://www.cfd-online.com/Forums/openfoam-installation/57708-lamboot-trouble.html)

r2d2 October 14, 2005 05:01

Hi, Something changed in my
 
Hi,
Something changed in my system lately and I donīt know what it is. I was able to boot LAM with no problems in the past, but now I get the following (rather long) message:

radu@nodo1-2:~$ lamboot -d ./machines_foam
n-1<3811> ssi:boot:open: opening
n-1<3811> ssi:boot:open: opening boot module globus
n-1<3811> ssi:boot:open: opened boot module globus
n-1<3811> ssi:boot:open: opening boot module rsh
n-1<3811> ssi:boot:open: opened boot module rsh
n-1<3811> ssi:boot:open: opening boot module slurm
n-1<3811> ssi:boot:open: opened boot module slurm
n-1<3811> ssi:boot:select: initializing boot module slurm
n-1<3811> ssi:boot:slurm: not running under SLURM
n-1<3811> ssi:boot:select: boot module not available: slurm
n-1<3811> ssi:boot:select: initializing boot module rsh
n-1<3811> ssi:boot:rsh: module initializing
n-1<3811> ssi:boot:rsh:agent: rsh
n-1<3811> ssi:boot:rsh:username: <same>
n-1<3811> ssi:boot:rsh:verbose: 1000
n-1<3811> ssi:boot:rsh:algorithm: linear
n-1<3811> ssi:boot:rsh:no_n: 0
n-1<3811> ssi:boot:rsh:no_profile: 0
n-1<3811> ssi:boot:rsh:fast: 0
n-1<3811> ssi:boot:rsh:ignore_stderr: 0
n-1<3811> ssi:boot:rsh:priority: 10
n-1<3811> ssi:boot:select: boot module available: rsh, priority: 10
n-1<3811> ssi:boot:select: initializing boot module globus
n-1<3811> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<3811> ssi:boot:select: boot module not available: globus
n-1<3811> ssi:boot:select: finalizing boot module slurm
n-1<3811> ssi:boot:slurm: finalizing
n-1<3811> ssi:boot:select: closing boot module slurm
n-1<3811> ssi:boot:select: finalizing boot module globus
n-1<3811> ssi:boot:globus: finalizing
n-1<3811> ssi:boot:select: closing boot module globus
n-1<3811> ssi:boot:select: selected boot module rsh

LAM 7.1.1 - Indiana University

n-1<3811> ssi:boot:base: looking for boot schema in following directories:
n-1<3811> ssi:boot:base: <current>
n-1<3811> ssi:boot:base: $TROLLIUSHOME/etc
n-1<3811> ssi:boot:base: $LAMHOME/etc
n-1<3811> ssi:boot:base: /home/dm2/henry/OpenFOAM/OpenFOAM-1.2/src/lam-7.1.1/platforms/linuxGcc4Opt/etc
n-1<3811> ssi:boot:base: looking for boot schema file:
n-1<3811> ssi:boot:base: ./machines_foam
n-1<3811> ssi:boot:base: found boot schema: ./machines_foam
n-1<3811> ssi:boot:rsh: found the following hosts:
n-1<3811> ssi:boot:rsh: n0 nodo1-2 (cpu=1)
n-1<3811> ssi:boot:rsh: resolved hosts:
n-1<3811> ssi:boot:rsh: n0 nodo1-2 --> 192.168.3.2 (origin)
n-1<3811> ssi:boot:rsh: starting RTE procs
n-1<3811> ssi:boot:base:linear: starting
n-1<3811> ssi:boot:base:server: opening server TCP socket
n-1<3811> ssi:boot:base:server: opened port 43936
n-1<3811> ssi:boot:base:linear: booting n0 (nodo1-2)
n-1<3811> ssi:boot:rsh: starting lamd on (nodo1-2)
n-1<3811> ssi:boot:rsh: starting on n0 (nodo1-2): hboot -t -c lam-conf.lamd -d -I -H 192.168.3.2 -P 43936 -n 0 -o 0
n-1<3811> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
mkdir: Permission denied
tkill: got killname back: /tmp/lam-radu@nodo1-2/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-radu@nodo1-2/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-radu@nodo1-2/lam-io-socket
tkill: f_kill = "/tmp/lam-radu@nodo1-2/lam-killfile"
tkill: nothing to kill: "/tmp/lam-radu@nodo1-2/lam-killfile"
hboot: booting...
hboot: fork /mnt/store1/radu/OpenFOAM/OpenFOAM-1.2/src/lam-7.1.1/platforms/linuxGcc4Opt/bin/ lamd
[1] 3814 lamd -H 192.168.3.2 -P 43936 -n 0 -o 0 -d
n-1<3811> ssi:boot:rsh: successfully launched on n0 (nodo1-2)
n-1<3811> ssi:boot:base:server: expecting connection from finite list
hboot: attempting to execute
mkdir: Permission denied
chdir failed!: No such file or directory
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:

- There are network filters between the lamboot agent host and
the remote host such that communication on random TCP ports
is blocked
- Network routing from the remote host to the local host isn't
properly configured (this is uncommon)

You can check these things by watching the output from "lamboot -d".

1. On the command line for hboot, there are two important parameters:
one is the IP address of where the lamboot agent was invoked, the
other is the port number that the lamboot agent is expecting the
newly-booted process to call back on (this will be a random
integer).

2. Manually login to the remote machine and try to telnet to the port
indicated on the hboot command line. For example,
telnet <ipnumber> <portnumber>
If all goes well, you should get a "Connection refused" error. If
you get any other kind of error, it could indicate either of the
two conditions above. Consult with your system/network
administrator.
-----------------------------------------------------------------------------
n-1<3811> ssi:boot:base:server: failed to connect to remote lamd!
n-1<3811> ssi:boot:base:server: closing server socket
n-1<3811> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully


I did what it says in 2. above and it worked well, i.e.

radu@nodo1-2:~$ telnet nodo1-2 43936
Trying 192.168.3.2...
telnet: Unable to connect to remote host: Connection refused

...so I donīt really know what happens...
Any ideas, please? I see that it fails in some mkdir, but I can mkdir anywhere in the list of nodes..
Thank yoy in advance,
Radu

mattijs October 16, 2005 09:44

- try 'ssh' to the machines -
 
- try 'ssh' to the machines
- try 'ssh ls' to the machine
- can you write to all files needed
- can you do mkdir /tmp/lam-radu@nodo1-2

r2d2 October 17, 2005 04:10

Hi Mattijs, 1&2 work fine
 
Hi Mattijs,
1&2 work fine
3 -- donīt know what "needed" files are
4 -- no I cannot mkdir in /tmp of any of the nodes in the list...will ask the admin....I guess thatīs the trouble.
Thank you,
Radu

niklas October 17, 2005 04:19

You can create a tmp in your h
 
You can create a tmp in your home directory and
point the system to that location instead, using
setenv TMPHOME $HOME/tmp
or TMP_HOME or something like that...

N

r2d2 October 17, 2005 04:27

Thanks Niklas, The problem i
 
Thanks Niklas,
The problem is solved now. I got the rights to write in the /tmp of the nodes so now it boots allright.
Radu


All times are GMT -4. The time now is 09:38.