CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM Running, Solving & CFD

OpenORTE/mpi problem

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Display Modes
Old   July 31, 2009, 14:26
Default OpenORTE/mpi problem
  #1
Senior Member
 
Tomislav Maric
Join Date: Mar 2009
Location: Darmstadt, Germany
Posts: 260
Blog Entries: 5
Rep Power: 11
tomislav_maric is on a distinguished road
Hello,

I'm trying to run damBreak on a LAN. When I execute

mpirun --hostnames machines -np 4 interFoam -parallel

I get prompted to enter my pass for two nodes (mario & marija), but something goes wrong (interFoam runs on icarus host, and # are my comments):

tomislav@icarus:damBreak$ mpirun --hostfile machines -np 4 interFoam -parallel

# first of all, ssh works on both nodes and I can log on and do whatever I
# want. why do both prompts for a pass at different nodes (LAN hosts)
# appear in the same line?

tomislav@mario's password: tomislav@marija's password:

# I've entered my pass above and then there's a pause before this:
bash: orted: command not found
# I've googled and found an answer here:
# http://www.open-mpi.org/community/li...07/08/3876.php
# but it didn't help at all

[icarus:15321] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275
[icarus:15321] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166
[icarus:15321] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
[icarus:15321] ERROR: A daemon on node mario failed to start as expected.
[icarus:15321] ERROR: There may be more information available from
[icarus:15321] ERROR: the remote shell (see above).
[icarus:15321] ERROR: The daemon exited unexpectedly with status 127.
[icarus:15321] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188
[icarus:15321] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1198
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------------

Has anyone here encountered anything similar or can direct me where to look for an answer? Both nodes are running slax live dvd.

Thank you,

Tomislav
tomislav_maric is offline   Reply With Quote

Old   August 1, 2009, 11:02
Default
  #2
bjj
New Member
 
Bjarne Jensen
Join Date: Mar 2009
Location: Denmark
Posts: 7
Rep Power: 8
bjj is on a distinguished road
The problem may be your password for ssh. You should have a password-less login for ssh on your system for openMPI to work between several nodes.

Regards,
Bjarne
bjj is offline   Reply With Quote

Old   August 1, 2009, 11:34
Default
  #3
Senior Member
 
Tomislav Maric
Join Date: Mar 2009
Location: Darmstadt, Germany
Posts: 260
Blog Entries: 5
Rep Power: 11
tomislav_maric is on a distinguished road
Quote:
Originally Posted by bjj View Post
The problem may be your password for ssh. You should have a password-less login for ssh on your system for openMPI to work between several nodes.
thank you very much for your answer, I've been going through instructions on Open MPI site, but without success:

ssh-keygen -t dsa command gave me ~/.ssh directory with public/private keys. I didn't enter a passphrase. I've copied the .ssh directory to the node (mario), executed successfully ssh-add command and tried ssh mario
command, but I got an error message: "ssh: connect to host mario port 22: Connection refused".

ping mario works fine.

Tomislav
tomislav_maric is offline   Reply With Quote

Old   August 1, 2009, 13:14
Default
  #4
Senior Member
 
Tomislav Maric
Join Date: Mar 2009
Location: Darmstadt, Germany
Posts: 260
Blog Entries: 5
Rep Power: 11
tomislav_maric is on a distinguished road
ok, I've tried again and weird things happened. I've followed again the instructions on the Open MPI site from my first post. I'm trying to run interFoam on host marija and use host mario as a slave node. This is what happens:

slax@marija:~/damBreak$ mpirun --hostfile hosts -np 2 interFoam -parallel
ssh: Could not resolve hostname marija: Name or service not known
--------------------------------------------------------------------------
A daemon (pid 8759) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

slax@marija:~$ ssh -v 192.168.1.66
OpenSSH_5.1p1, OpenSSL 0.9.8i 15 Sep 2008
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Connecting to 192.168.1.66 [192.168.1.66] port 22.
debug1: Connection established.
debug1: identity file /home/slax/.ssh/identity type -1
debug1: identity file /home/slax/.ssh/id_rsa type -1
debug1: identity file /home/slax/.ssh/id_dsa type 2
debug1: Remote protocol version 2.0, remote software version OpenSSH_5.1
debug1: match: OpenSSH_5.1 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_5.1
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-cbc hmac-md5 none
debug1: kex: client->server aes128-cbc hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
The authenticity of host '192.168.1.66 (192.168.1.66)' can't be established.
RSA key fingerprint is 8a:94:0a:55:2f:df:b2:82:7a:bc:b2:f9:6a:b7:f6:dc.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.1.66' (RSA) to the list of known hosts.
debug1: ssh_rsa_verify: signature correct
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,password,keyboard-interactive
debug1: Next authentication method: publickey
debug1: Trying private key: /home/slax/.ssh/identity
debug1: Trying private key: /home/slax/.ssh/id_rsa
debug1: Offering public key: /home/slax/.ssh/id_dsa
debug1: Authentications that can continue: publickey,password,keyboard-interactive
debug1: Next authentication method: keyboard-interactive
debug1: Authentications that can continue: publickey,password,keyboard-interactive
debug1: Next authentication method: password
slax@192.168.1.66's password:
debug1: Authentication succeeded (password).
debug1: channel 0: new [client-session]
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
Last login: Sat Aug 1 18:20:50 2009
Linux 2.6.27.8.

slax@marija:~$ ssh mario
Last login: Sat Aug 1 16:41:41 2009 from mario
Linux 2.6.27.8.
slax@marija:~$ exit
logout
Connection to mario closed.
slax@marija:~$
What's weird is that in the last lines I'm exiting from connection with mario host, but there was no visible trace of the connection in the first place. ls showed me the contets of slax home directory on marija host, and the prompt shows that I'm running slax user on host marija. What do I need to do to make ssh work without a password besides the instructions in the link?

Thanks in advance,
Tomislav
tomislav_maric is offline   Reply With Quote

Old   August 1, 2009, 15:11
Default solved
  #5
Senior Member
 
Tomislav Maric
Join Date: Mar 2009
Location: Darmstadt, Germany
Posts: 260
Blog Entries: 5
Rep Power: 11
tomislav_maric is on a distinguished road
I've managed to set the wrong IP adress for the host.


it works now!!!
tomislav_maric is offline   Reply With Quote

Old   October 27, 2010, 12:03
Default starting parallel runs in OpenFOAM
  #6
New Member
 
ANON
Join Date: Oct 2010
Posts: 2
Rep Power: 0
FluidsExpert is on a distinguished road
Quote:
Originally Posted by tomislav_maric View Post
I've managed to set the wrong IP adress for the host.


it works now!!!
Hello tomislav,

I have just recently started parallel runs using the damBreak tutorial and am having difficulties. I am using sge on ROCKS and I can't get the parallel command to work. I hope you can help in this regards.

Thanks
FluidsExpert is offline   Reply With Quote

Old   October 28, 2010, 04:52
Default
  #7
Senior Member
 
Tomislav Maric
Join Date: Mar 2009
Location: Darmstadt, Germany
Posts: 260
Blog Entries: 5
Rep Power: 11
tomislav_maric is on a distinguished road
Quote:
Originally Posted by FluidsExpert View Post
Hello tomislav,

I have just recently started parallel runs using the damBreak tutorial and am having difficulties. I am using sge on ROCKS and I can't get the parallel command to work. I hope you can help in this regards.

Thanks
You have to make sure you've done these two things:

1) The installation of OpenFOAM is directed towards an NFS export directory, the best choice is /share/apps as the ROCKS user manual suggests for the applications that are not installed via rolls or .rpm.

2) The execution in parallel must be done with the full pathnames in order for the orte to pick up the proper paths on the nodes. For this purpose, you can use the expansion signs "`":

`which mpirun` -machinefile MACHINES -np N `which SOLVER` -parallel

where MACHINES is the full pathname of the machinefile for the mpirun, N is the number of cores and SOLVER is the solver you wish to run.

Hope this helps,

Tomislav

P.S. If it doesn't work via SGE, try the manual parallel run. Does this work?
tomislav_maric is offline   Reply With Quote

Old   October 28, 2010, 10:03
Default Could not resolve hostname
  #8
New Member
 
rlobosco
Join Date: Nov 2009
Posts: 5
Rep Power: 7
rlobosco is on a distinguished road
I have just trying to run the damBreak tutorial in parallel but I am in trouble with it. Maybe someone can help me.
I have two machines alfa and beta. I can do ssh between both without password and I have no problems.
But when I try the command:
mpirun --hostfile machines -np 2 interFoam -parallel > log
It gives me the following message error:

ssh: Could not resolve hostname beta: Name or service not known
--------------------------------------------------------------------------
A daemon (pid 5947) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
alfa - daemon did not report back when launched
beta - daemon did not report back when launched

My machines file is in the directory ~/.ssh and have just the flowing lines:
alfa
beta
Can someone give me some hints?
With best regard,
rlobosco
rlobosco is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
UDF compiling problem Wouter Fluent UDF and Scheme Programming 6 June 6, 2012 04:43
Incoherent problem table in hollow-fiber spinning Gianni FLUENT 0 April 5, 2008 10:33
natural convection problem for a CHT problem Se-Hee CFX 2 June 10, 2007 06:29
Adiabatic and Rotating wall (Convection problem) ParodDav CFX 5 April 29, 2007 19:13
Is this problem well posed? Thomas P. Abraham Main CFD Forum 5 September 8, 1999 14:52


All times are GMT -4. The time now is 19:23.