CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM Installation

Lamboot trouble

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Display Modes
Old   October 14, 2005, 05:01
Default Hi, Something changed in my
  #1
Member
 
Radu Mustata
Join Date: Mar 2009
Location: Zaragoza, Spain
Posts: 96
Rep Power: 8
r2d2 is on a distinguished road
Hi,
Something changed in my system lately and I donīt know what it is. I was able to boot LAM with no problems in the past, but now I get the following (rather long) message:

radu@nodo1-2:~$ lamboot -d ./machines_foam
n-1<3811> ssi:boot:open: opening
n-1<3811> ssi:boot:open: opening boot module globus
n-1<3811> ssi:boot:open: opened boot module globus
n-1<3811> ssi:boot:open: opening boot module rsh
n-1<3811> ssi:boot:open: opened boot module rsh
n-1<3811> ssi:boot:open: opening boot module slurm
n-1<3811> ssi:boot:open: opened boot module slurm
n-1<3811> ssi:boot:select: initializing boot module slurm
n-1<3811> ssi:boot:slurm: not running under SLURM
n-1<3811> ssi:boot:select: boot module not available: slurm
n-1<3811> ssi:boot:select: initializing boot module rsh
n-1<3811> ssi:boot:rsh: module initializing
n-1<3811> ssi:boot:rsh:agent: rsh
n-1<3811> ssi:boot:rsh:username: <same>
n-1<3811> ssi:boot:rsh:verbose: 1000
n-1<3811> ssi:boot:rsh:algorithm: linear
n-1<3811> ssi:boot:rsh:no_n: 0
n-1<3811> ssi:boot:rsh:no_profile: 0
n-1<3811> ssi:boot:rsh:fast: 0
n-1<3811> ssi:boot:rsh:ignore_stderr: 0
n-1<3811> ssi:boot:rsh:priority: 10
n-1<3811> ssi:boot:select: boot module available: rsh, priority: 10
n-1<3811> ssi:boot:select: initializing boot module globus
n-1<3811> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<3811> ssi:boot:select: boot module not available: globus
n-1<3811> ssi:boot:select: finalizing boot module slurm
n-1<3811> ssi:boot:slurm: finalizing
n-1<3811> ssi:boot:select: closing boot module slurm
n-1<3811> ssi:boot:select: finalizing boot module globus
n-1<3811> ssi:boot:globus: finalizing
n-1<3811> ssi:boot:select: closing boot module globus
n-1<3811> ssi:boot:select: selected boot module rsh

LAM 7.1.1 - Indiana University

n-1<3811> ssi:boot:base: looking for boot schema in following directories:
n-1<3811> ssi:boot:base: <current>
n-1<3811> ssi:boot:base: $TROLLIUSHOME/etc
n-1<3811> ssi:boot:base: $LAMHOME/etc
n-1<3811> ssi:boot:base: /home/dm2/henry/OpenFOAM/OpenFOAM-1.2/src/lam-7.1.1/platforms/linuxGcc4Opt/etc
n-1<3811> ssi:boot:base: looking for boot schema file:
n-1<3811> ssi:boot:base: ./machines_foam
n-1<3811> ssi:boot:base: found boot schema: ./machines_foam
n-1<3811> ssi:boot:rsh: found the following hosts:
n-1<3811> ssi:boot:rsh: n0 nodo1-2 (cpu=1)
n-1<3811> ssi:boot:rsh: resolved hosts:
n-1<3811> ssi:boot:rsh: n0 nodo1-2 --> 192.168.3.2 (origin)
n-1<3811> ssi:boot:rsh: starting RTE procs
n-1<3811> ssi:boot:base:linear: starting
n-1<3811> ssi:boot:base:server: opening server TCP socket
n-1<3811> ssi:boot:base:server: opened port 43936
n-1<3811> ssi:boot:base:linear: booting n0 (nodo1-2)
n-1<3811> ssi:boot:rsh: starting lamd on (nodo1-2)
n-1<3811> ssi:boot:rsh: starting on n0 (nodo1-2): hboot -t -c lam-conf.lamd -d -I -H 192.168.3.2 -P 43936 -n 0 -o 0
n-1<3811> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
mkdir: Permission denied
tkill: got killname back: /tmp/lam-radu@nodo1-2/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-radu@nodo1-2/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-radu@nodo1-2/lam-io-socket
tkill: f_kill = "/tmp/lam-radu@nodo1-2/lam-killfile"
tkill: nothing to kill: "/tmp/lam-radu@nodo1-2/lam-killfile"
hboot: booting...
hboot: fork /mnt/store1/radu/OpenFOAM/OpenFOAM-1.2/src/lam-7.1.1/platforms/linuxGcc4Opt/bin/ lamd
[1] 3814 lamd -H 192.168.3.2 -P 43936 -n 0 -o 0 -d
n-1<3811> ssi:boot:rsh: successfully launched on n0 (nodo1-2)
n-1<3811> ssi:boot:base:server: expecting connection from finite list
hboot: attempting to execute
mkdir: Permission denied
chdir failed!: No such file or directory
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:

- There are network filters between the lamboot agent host and
the remote host such that communication on random TCP ports
is blocked
- Network routing from the remote host to the local host isn't
properly configured (this is uncommon)

You can check these things by watching the output from "lamboot -d".

1. On the command line for hboot, there are two important parameters:
one is the IP address of where the lamboot agent was invoked, the
other is the port number that the lamboot agent is expecting the
newly-booted process to call back on (this will be a random
integer).

2. Manually login to the remote machine and try to telnet to the port
indicated on the hboot command line. For example,
telnet <ipnumber> <portnumber>
If all goes well, you should get a "Connection refused" error. If
you get any other kind of error, it could indicate either of the
two conditions above. Consult with your system/network
administrator.
-----------------------------------------------------------------------------
n-1<3811> ssi:boot:base:server: failed to connect to remote lamd!
n-1<3811> ssi:boot:base:server: closing server socket
n-1<3811> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully


I did what it says in 2. above and it worked well, i.e.

radu@nodo1-2:~$ telnet nodo1-2 43936
Trying 192.168.3.2...
telnet: Unable to connect to remote host: Connection refused

...so I donīt really know what happens...
Any ideas, please? I see that it fails in some mkdir, but I can mkdir anywhere in the list of nodes..
Thank yoy in advance,
Radu
r2d2 is offline   Reply With Quote

Old   October 16, 2005, 09:44
Default - try 'ssh' to the machines -
  #2
Super Moderator
 
Mattijs Janssens
Join Date: Mar 2009
Posts: 1,416
Rep Power: 16
mattijs is on a distinguished road
- try 'ssh' to the machines
- try 'ssh ls' to the machine
- can you write to all files needed
- can you do mkdir /tmp/lam-radu@nodo1-2
mattijs is offline   Reply With Quote

Old   October 17, 2005, 04:10
Default Hi Mattijs, 1&2 work fine
  #3
Member
 
Radu Mustata
Join Date: Mar 2009
Location: Zaragoza, Spain
Posts: 96
Rep Power: 8
r2d2 is on a distinguished road
Hi Mattijs,
1&2 work fine
3 -- donīt know what "needed" files are
4 -- no I cannot mkdir in /tmp of any of the nodes in the list...will ask the admin....I guess thatīs the trouble.
Thank you,
Radu
r2d2 is offline   Reply With Quote

Old   October 17, 2005, 04:19
Default You can create a tmp in your h
  #4
Super Moderator
 
niklas's Avatar
 
Niklas Nordin
Join Date: Mar 2009
Location: Stockholm, Sweden
Posts: 693
Rep Power: 19
niklas will become famous soon enough
You can create a tmp in your home directory and
point the system to that location instead, using
setenv TMPHOME $HOME/tmp
or TMP_HOME or something like that...

N
niklas is offline   Reply With Quote

Old   October 17, 2005, 04:27
Default Thanks Niklas, The problem i
  #5
Member
 
Radu Mustata
Join Date: Mar 2009
Location: Zaragoza, Spain
Posts: 96
Rep Power: 8
r2d2 is on a distinguished road
Thanks Niklas,
The problem is solved now. I got the rights to write in the /tmp of the nodes so now it boots allright.
Radu
r2d2 is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
UDS trouble Jenny FLUENT 0 July 7, 2008 03:27
UDS trouble Jenny FLUENT 0 July 6, 2008 04:09
Please help cannot start lamboot hsieh OpenFOAM Installation 8 May 24, 2007 14:44
Lamboot and ssh dmoroian OpenFOAM Running, Solving & CFD 1 November 1, 2006 06:53
Lamboot and mpirun r2d2 OpenFOAM Running, Solving & CFD 2 January 10, 2006 12:31


All times are GMT -4. The time now is 21:46.