CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (http://www.cfd-online.com/Forums/openfoam-solving/)
-   -   MPI PROBLEMS (http://www.cfd-online.com/Forums/openfoam-solving/59399-mpi-problems.html)

gtg627e September 24, 2007 11:20

Hello Forum, I am trying to
 
Hello Forum,

I am trying to run a case in parallel. I can lamboot all the machines in my host file by issuing lamboot -v <file> and I get:

[gtg627eOpenFOAM@ruzzene03 turbFoam]$ lamboot -v machineNetwork

LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

n-1<27397> ssi:boot:base:linear: booting n0 (ruzzene03)
n-1<27397> ssi:boot:base:linear: booting n1 (ruzzene01)
n-1<27397> ssi:boot:base:linear: booting n2 (ruzzene02)
n-1<27397> ssi:boot:base:linear: booting n3 (ruzzene04)
n-1<27397> ssi:boot:base:linear: booting n4 (ruzzene05)
n-1<27397> ssi:boot:base:linear: booting n5 (ruzzene06)
n-1<27397> ssi:boot:base:linear: finished

------------------------------------------------

Then I issue:

mpirun --hostfile machineNetwork -np 12 turbFoam . AlexMovingMesh -parallel > log &

and I get:

[gtg627eOpenFOAM@ruzzene03 turbFoam]$ mpirun --hostfile machineNetwork -np 12 turbFoam . AlexMovingMesh -parallel > log &
[1] 27507
[gtg627eOpenFOAM@ruzzene03 turbFoam]$ bash: orted: command not found
bash: orted: command not found
bash: orted: command not found
bash: orted: command not found
bash: orted: command not found
[ruzzene03:27507] ERROR: A daemon on node ruzzene01 failed to start as expected.
[ruzzene03:27507] ERROR: There may be more information available from
[ruzzene03:27507] ERROR: the remote shell (see above).
[ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127.
[ruzzene03:27507] ERROR: A daemon on node ruzzene05 failed to start as expected.
[ruzzene03:27507] ERROR: There may be more information available from
[ruzzene03:27507] ERROR: the remote shell (see above).
[ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127.
[ruzzene03:27507] ERROR: A daemon on node ruzzene04 failed to start as expected.
[ruzzene03:27507] ERROR: There may be more information available from
[ruzzene03:27507] ERROR: the remote shell (see above).
[ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127.
[ruzzene03:27507] ERROR: A daemon on node ruzzene02 failed to start as expected.
[ruzzene03:27507] ERROR: There may be more information available from
[ruzzene03:27507] ERROR: the remote shell (see above).
[ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127.
[ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275
[ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164
[ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
[ruzzene03:27507] ERROR: A daemon on node ruzzene06 failed to start as expected.
[ruzzene03:27507] ERROR: There may be more information available from
[ruzzene03:27507] ERROR: the remote shell (see above).
[ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127.
[ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188
[ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS.

--------------------------------------------------------------------------

Does anybody know what might cause this error?
Is it possible that my bashrc file is not providing a correct path to OpenFOAM/OpenFOAM-1.4.1/src/openmpi-1.2.3/orte ?

Thank you in advance,

Alessandro Spadoni

olesen September 24, 2007 12:24

Hi Alessando, You (thankful
 
Hi Alessando,

You (thankfully) don't need lamboot etc if you are using Open-MPI.

From the error messages, it could look like Open-MPI is not available on the NFS-share on the remote machines or you can't rsh to the remote machines.

At our installation, the orte uses the GridEngine and the environment is inherited without needing any OpenFOAM settings in ~/.profile or ~/.bashrc

Make sure that you can execute the usual mpi "hello world" program (see the open-mpi FAQs).

BTW: if the hostfile is correctly formatted, you don't need the -np option.

gtg627e September 24, 2007 14:20

Hello Mark Olesen, First of
 
Hello Mark Olesen,

First of all thank you very much for your quick response!

I realized I had lam on the client machines, and openmpi on the host machine only. I copied all openmpi files to my client machines. I can ssh into them without password.

Then I issue the following


[gtg627eOpenFOAM@ruzzene03 turbFoam]$ mpirun --hostfile machineNetwork turbFoam . AlexMovingMesh -parallel > log &
[1] 9537
[gtg627eOpenFOAM@ruzzene03 turbFoam]$ --------------------------------------------------------------------------
Failed to find the following executable:

Host: ruzzene01
Executable: turbFoam

Cannot continue.
--------------------------------------------------------------------------
mpirun noticed that job rank 0 with PID 9541 on node ruzzene03 exited on signal 15 (Terminated).


So, my question at this point is: do I need OpenFoam on all client machines?

olesen September 24, 2007 15:53

So, my question at this point
 
Quote:

So, my question at this point is: do I need OpenFoam on all client machines?
Oh yes. You definitely need the executables and libraries that will be run on the remote host.
An NFS-share is certainly the easist.

If you are curious about the dependencies, use 'type turbFoam' or 'which turbFoam' to find out where the executable lies.
You can used 'ldd -v' to see which libraries will be used/required.

gtg627e September 24, 2007 16:55

Thank you Mark, I will migr
 
Thank you Mark,

I will migrate the librarries/solvers etc. to the client machines.
Thank you again for the prompt response.
I am testing a dynamicFvMesh library that will handle imposed elastic deformations. I will keep you updated.

Alessandro

gtg627e September 24, 2007 22:24

Hello Mark, I am still havi
 
Hello Mark,

I am still having some problems.
I now have OpenFOAM on all the client machines on the network. From my host machine I do the following:

-->cd OpenFOAM/gtg627e-1.4.1/run/tutorials/turbFoam
-->mpirun --hostfile machineNetwork turbFoam . AlexMovingMesh -parallel > log &.

This is the content of the log:

MPI Pstream initialized with:
[4] Date : Sep 24 2007
[4] Time : 22:14:56
[4] Host : ruzzene02.ae.gatech.edu
[4] PID : 6282
[5] Date : Sep 24 2007
[5] Time : 22:14:56
[5] Host : ruzzene02.ae.gatech.edu
[5] PID : 6283
[8] Date : Sep 24 2007
[8] Time : 22:14:56
[8] Host : ruzzene05
[8] PID : 25254
[2] Date : Sep 24 2007
[2] Time : 22:14:56
[2] Host : ruzzene01
[2] PID : 21776
[9] Date : Sep 24 2007
[9] Time : 22:14:56
[9] Host : ruzzene05
[9] PID : 25255
[3] Date : Sep 24 2007
[3] Time : 22:14:56
[3] Host : ruzzene01
[3] PID : 21777
[6] Date : Sep 24 2007
[7] Date : Sep 24 2007
[7] Time : 22:14:56
[7] Host : ruzzene04
[7] PID : 18245
[10] Date : Sep 24 2007
floatTransfer : 1
nProcsSimpleSum : 0
scheduledTransfer : 0

/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.4.1 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

Exec : turbFoam /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ AlexMovingMesh -parallel
[0] Date : Sep 24 2007
[0] Time : 22:14:56
[0] Host : ruzzene03
[0] PID : 8827
[1] Date : Sep 24 2007
[1] Time : 22:14:56
[1] Host : ruzzene03
[1] PID : 8828
[6] Time : 22:14:56
[6] Host : ruzzene04
[6] PID : 18244
[1] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[1] Case : AlexMovingMesh
[1] Nprocs : 12
[2] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[2] Case : AlexMovingMesh
[2] Nprocs : 12
[4] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[4] Case : AlexMovingMesh
[4] Nprocs : 12
[3] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[3] Case : AlexMovingMesh
[3] Nprocs : 12
[6] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[6] Case : AlexMovingMesh
[6] Nprocs : 12
[7] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[7] Case : AlexMovingMesh
[7] Nprocs : 12
[8] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[8] Case : AlexMovingMesh
[8] Nprocs : 12
[5] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[5] Case : AlexMovingMesh
[5] Nprocs : 12
[9] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[9] Case : AlexMovingMesh
[9] Nprocs : 12
[0] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[0] Case : AlexMovingMesh
[0] Nprocs : 12
[0] Slaves :
[0] 11
[0] (
[0] ruzzene03.8828
[0] ruzzene01.21776
[0] ruzzene01.21777
[0] ruzzene02.ae.gatech.edu.6282
[0] ruzzene02.ae.gatech.edu.6283
[0] ruzzene04.18244
[0] ruzzene04.18245
[0] ruzzene05.25254
[0] ruzzene05.25255
[0] ruzzene06.3853
[0] ruzzene06.3854
[0] )
[0]
Create time

[10] Time : 22:14:56
[10] Host : ruzzene06
[10] PID : 3853
[10] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[10] Case : AlexMovingMesh
[10] Nprocs : 12
[11] Date : Sep 24 2007
[11] Time : 22:14:56
[11] Host : ruzzene06
[11] PID : 3854
[11] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[11] Case : AlexMovingMesh
[11] Nprocs : 12
1 additional process aborted (not shown)

-----------------------------------------------------

Then the output switches to the terminal as follows:

-------------------------------------------------

[gtg627eOpenFOAM@ruzzene03 turbFoam]$ mpirun --hostfile machineNetwork turbFoam $FOAM_RUN/tutorials/turbFoam/ AlexMovingMesh -parallel > log &
[1] 8818
[gtg627eOpenFOAM@ruzzene03 turbFoam]$ [2]
[2]
[2] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[2]
[2]
FOAM parallel run exiting
[2]
[ruzzene01:21776] MPI_ABORT invoked on rank 2 in communicator MPI_COMM_WORLD with errorcode 1
[4] [3]
[4]
[4] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[4]
[4]
FOAM parallel run exiting

[3]
[3] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[3]
[3]
FOAM parallel run exiting
[3]
[6] [8] [ruzzene01:21777] MPI_ABORT invoked on rank 3 in communicator MPI_COMM_WORLD with errorcode 1
[4]

[6]
[6] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[6]
[6]
FOAM parallel run exiting
[6]
[8]
[8] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[8]
[ruzzene02.ae.gatech.edu:06282] MPI_ABORT invoked on rank 4 in communicator MPI_COMM_WORLD with errorcode 1

[8]
FOAM parallel run exiting
[8]
[5] [ruzzene04:18244] MPI_ABORT invoked on rank 6 in communicator MPI_COMM_WORLD with errorcode 1

[5]
[5] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[5]
[5]
FOAM parallel run exiting
[5]
[7] [ruzzene02.ae.gatech.edu:06283] MPI_ABORT invoked on rank 5 in communicator MPI_COMM_WORLD with errorcode 1

[7]
[7] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[7]
[7]
FOAM parallel run exiting
[7]
[ruzzene05:25254] MPI_ABORT invoked on rank 8 in communicator MPI_COMM_WORLD with errorcode 1
[ruzzene04:18245] MPI_ABORT invoked on rank 7 in communicator MPI_COMM_WORLD with errorcode 1
[9]
[9]
[9] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[9]
[9]
FOAM parallel run exiting
[9]
[ruzzene05:25255] MPI_ABORT invoked on rank 9 in communicator MPI_COMM_WORLD with errorcode 1
[ruzzene03][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=111
[10]
[10]
[10] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[10]
[10]
FOAM parallel run exiting
[10]
[ruzzene06:03853] MPI_ABORT invoked on rank 10 in communicator MPI_COMM_WORLD with errorcode 1
[11]
[11]
[11] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[11]
[11]
FOAM parallel run exiting
[11]
[ruzzene06:03854] MPI_ABORT invoked on rank 11 in communicator MPI_COMM_WORLD with errorcode 1
mpirun noticed that job rank 0 with PID 8827 on node ruzzene03 exited on signal 15 (Terminated).

-----------------------------------------------------

Do I have to have my case AlexMovingMesh with all its subdirectories processor0, processor1...0, constant, system on each client machine? or, do I need to make my case AlexMovingMesh a shared directory for all netowrk machines?

Thank you again for your help,

Alessandro

thomas September 25, 2007 02:03

Hi Alessandro, the easiest
 
Hi Alessandro,

the easiest way is to have the whole OF installation and also every file and directory of your case on every cluster node. To run properly, all the environment variables have to be set on every node.
Also thw whole directory structure has to be similar.

The error you posted above, looks like you do not have enough rights to write into the corresponding directory or it is simple not present.

Thomas

olesen September 25, 2007 03:18

Hi Alessandro, Yes, you do
 
Hi Alessandro,

Yes, you do need to have the case and subdirs available on all nodes. Again, the easiest is if it is on a NFS-share.

Also, as Thomas mentioned, the OpenFOAM env variables will also need to be set on the remote machines. But this should be taken care of by the orte. From your error messages it looks like you are okay there.

gtg627e September 25, 2007 10:22

Hi Thomas and Mark, I copie
 
Hi Thomas and Mark,

I copied all case files to each node, and I still don't get OpenFOAM to run. This is the log content:

--------------------------------------------------------

MPI Pstream initialized with:
floatTransfer : 1
nProcsSimpleSum : 0
scheduledTransfer : 0

/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.4.1 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

Exec : turbFoam /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam AlexMovingMesh -parallel
[0] Date : Sep 25 2007
[0] Time : 10:17:27
[0] Host : ruzzene03
[0] PID : 21772
[4] Date : Sep 25 2007
[4] Time : 10:17:27
[4] Host : ruzzene02.ae.gatech.edu
[4] PID : 7834
[3] Date : Sep 25 2007
[10] Date : Sep 25 2007
[10] Time : 10:17:27
[10] Host : ruzzene06
[10] PID : 5367
[1] Date : Sep 25 2007
[1] Time : 10:17:27
[1] Host : ruzzene03
[1] PID : 21773
[1] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[1] Case : AlexMovingMesh
[1] Nprocs : 12
[5] Date : Sep 25 2007
[5] Time : 10:17:27
[5] Host : ruzzene02.ae.gatech.edu
[5] PID : 7835
[7] Date : Sep 25 2007
[7] Time : 10:17:27
[7] Host : ruzzene04
[7] PID : 19915
[3] Time : 10:17:27
[3] Host : ruzzene01
[3] PID : 23397
[8] Date : Sep 25 2007
[8] Time : 10:17:27
[8] Host : ruzzene05
[8] PID : 26770
[11] Date : Sep 25 2007
[11] Time : 10:17:27
[11] Host : ruzzene06
[11] PID : 5368
[0] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[0] Case : AlexMovingMesh
[0] Nprocs : 12
[0] Slaves :
[0] 11
[0] (
[0] ruzzene03.21773
[0] ruzzene01.23396
[0] ruzzene01.23397
[0] ruzzene02.ae.gatech.edu.7834
[0] ruzzene02.ae.gatech.edu.7835
[0] ruzzene04.19914
[0] ruzzene04.19915
[0] ruzzene05.26770
[0] ruzzene05.26771
[0] ruzzene06.5367
[0] ruzzene06.5368
[0] )
[0]
Create time

[4] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[4] Case : AlexMovingMesh
[4] Nprocs : 12
[6] Date : Sep 25 2007
[6] Time : 10:17:28
[6] Host : ruzzene04
[6] PID : 19914
[2] Date : Sep 25 2007
[2] Time : 10:17:28
[2] Host : ruzzene01
[2] PID : 23396
[2] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[2] Case : AlexMovingMesh
[2] Nprocs : 12
[9] Date : Sep 25 2007
[9] Time : 10:17:27
[9] Host : ruzzene05
[9] PID : 26771
[10] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[10] Case : AlexMovingMesh
[10] Nprocs : 12
[5] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/t[6] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[6] Case : AlexMovingMesh
[6] Nprocs : 12
[3] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[3] Case : AlexMovingMesh
[3] Nprocs : 12
[8] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[8] Case : AlexMovingMesh
[8] Nprocs : 12
urbFoam
[5] Case : AlexMovingMesh
[5] Nprocs : 12
[7] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[7] Case : AlexMovingMesh
[7] Nprocs : 12
[11] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[11] Case : AlexMovingMesh
[11] Nprocs : 12
[9] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[9] Case : AlexMovingMesh
[9] Nprocs : 12

------------------------------------------------

and this is the output to the terminal:

---------------------------------------------------

[gtg627eOpenFOAM@ruzzene03 ~]$ mpirun --hostfile machineNetwork turbFoam $FOAM_RUN/tutorials/turbFoam AlexMovingMesh -parallel > log &
[3] 21762
[gtg627eOpenFOAM@ruzzene03 ~]$ [ruzzene06][0,1,10][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect ] connect() failed with errno=113
[ruzzene01][0,1,3][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113
[ruzzene05][0,1,8][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113
[ruzzene02][0,1,5][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113
[ruzzene04][0,1,7][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113
[ruzzene06][0,1,11][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect ] connect() failed with errno=113
[ruzzene05][0,1,9][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113

-------------------------------------------------------------------------------- --

Do you happen to know what causes this?
Sorry about all these trivial questions, but it is the first time I run mpi with any application.

Thank you again for you help,

Alessandro

gschaider September 25, 2007 12:16

Hi Alessandro! If this is t
 
Hi Alessandro!

If this is the first time you run MPI on your machines, maybe you should start less ambitious. Get a micky-mouse problem (damBreak for instance) and try to run it on only one machine and two CPUs. If that does work progress to using two machines. That way it is easier to pinpoint what your problem might be (individual MPI installations or machine interconnect)

Bernhard

olesen September 26, 2007 02:34

Hi Alessandro, Did you get
 
Hi Alessandro,

Did you get the usual mpi "hello world" program (see the open-mpi FAQs) to work? It would be good to ensure that all MPI issues are sorted out first.

/mark

gtg627e September 27, 2007 15:50

Hello Mark and Bernhard, I
 
Hello Mark and Bernhard,

I am able to run both the hello world example and my OpenFOAM case on my master machine in mpi mode. I have 2 cpu's on my master machine.
So, I am trying to figure out what went wrong with the case I posted above.
So, I looked up this error:

[ruzzene06][0,1,10][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect ] connect() failed with errno=113

Then, I issued the following command to see what the above error means:

--> perl -e 'die&!=113'

and I got:

Can't modify non-lvalue subroutine call at -e line 1.

Does this mean I don't have permission to write on files on the slave machines?
Has any of you ever had a similar error?

Thank you again,

Alessandro

gschaider September 27, 2007 17:41

Hi Alessandro! Have avoided
 
Hi Alessandro!

Have avoided perl for a long time, but if I try your command with 114 (and 115 and any other number), I get the same error message. So I guess the lvalue message is a perl-error.

But if I say this (using a decent language):
python -c "import os;print os.strerror(113)"
I get this:
No route to host
(which is the information you wanted to get, I guess)

The problem seems to be that MPI can't reach a host in the host-file. Check the host file (sorry, I guess you did that already twice). Try using full host names (with the complete domain). Try using IP-numbers. Tell us what happened.

Bernhard

PS: I am fully aware that my solution needs approximately twice as many characters and therefore is unaccaptable for a perl-head (although it works).

gtg627e September 27, 2007 23:39

Hello Bernhard, After succe
 
Hello Bernhard,

After succesfully running hello world on the master node, and the master node and one slave node, I am working on running my OF case. Inexplicably, I logged in to the slave node, copied my case files with scp and everything now runs perfectly. My first attempt runs my OF case on the master node and 1 slave node (1 cpu per node). I got that runnig ok.

I read the open mpi FAQ's and found out that the hostfile is written as follows:

machinex.host slots=2 max-slots=2
machiney.host slots=2 max-slots=2

In the OF UserGuide.pdf, the host file uses cpu instead of the slots specification. I don't know if this makes a difference, but I think this solved some of my problems.

So far I have tried three mpi configurations:

Run my case divided in 4 processes on my master node only (it is a quad core)--> clock time 616 s.

Run my case divided in 2 processes (1 process on my master node and 1 process on 1 slave node)--> clock time 2666 s.

Run my case without mpi on my master node only --> clock time 656 s.

My 6 quad-core nodes are connected via ethernet with a belkin gigabit switch box.

I understand the above clock numbers may not reflect the general increase in performance by dividing a case into many processes. I understand my case may be small enough to actually suffer from being split up into many processes. Do you happen to know where to find information about cluster setup and the best way to connect difference machines?
Do the above clock numbers make sense to you?
In your experience with OF, do you gain in performance splitting up even small cases?

Thank you,

Alessandro

mbeaudoin September 27, 2007 23:48

For the Unix head, another way
 
For the Unix head, another way to find the same answer (for example on a Centos 4.4 host):

[prompt]$ find /usr/include -iname \*errno\* -exec grep 113 {} \; -print
#define EHOSTUNREACH 113 /* No route to host */
/usr/include/asm-x86_64/errno.h
#define EHOSTUNREACH 113 /* No route to host */
/usr/include/asm-i386/errno.h

Or even simpler:

[prompt ]$ grep 113 /usr/include/*.h /usr/include/*/*.h | grep errno
/usr/include/asm-i386/errno.h:#define EHOSTUNREACH 113 /* No route to host */
/usr/include/asm-x86_64/errno.h:#define EHOSTUNREACH 113 /* No route to host */

Just replace 113 with your own errno number; you will get a good hint about where to start looking for solving your problem.

Martin

mbeaudoin September 28, 2007 00:28

Hello Alessandro, About the
 
Hello Alessandro,

About the clock time value you are reporting with your 2 processes case:

Yes, the size of the mesh might be a performance limiting factor here if it is too small.

But another limiting factor might be the configuration of your gigabit network.

Make sure that:
  • your Ethernet cabling to the Belkin switch is using CAT5E or CAT6 cables. Don't use standard CAT5 cables with Gigabit switches.
  • Your Ethernet adapters are not limited to 10/100 Mbits, but also support 1Gbit speed. This is trivial, but double-checking the basics is sometimes necessary.
  • All your Ethernet adapters are configured for full duplex mode. Under Linux, the commande ethtool is usually your friend here.
  • On many modern switches, there is usually a port setting called portfast (or something similar) that once enabled, could substantially improve your Ethernet port throughput. Read your Belkin configuration manual carefully.
Martin

olesen September 28, 2007 03:43

Hi Alessando, --> perl -e
 
Hi Alessando,

Quote:

--> perl -e 'die&!=113'

and I got:

Can't modify non-lvalue subroutine call at -e line 1.
This means you tried to explicitly call the (non-existing) subroutine '!' and simultaneously assign it a value. This is what the error message is trying telling you: Do not do this.
You might have better success with one of the following:

--> perl -e '$!=113; die'
or
--> perl -e 'die $!=113'


Since Berhard was trolling about decent languages, I can't resist. In this case he is right though.
Writing the equivalent incorrect program in Python may take a few more characters, but looks prettier when it fails ;)

>>> def errno():
... nil
... errno() = 113
File "<stdin>", line 3
errno() = 113
^
SyntaxError: invalid syntax


BTW: for the really lazy people (like me), searching google for 'errno', '113' and 'mpi' is even faster than all the other solutions and reveals the OpenMPI archives at the top of the list. This also probably the place where you should be looking for debugging your MPI problems. It doesn't look like your problems are being caused by OpenFOAM.

/mark

gschaider October 1, 2007 05:20

Hi Alessandro! cpu=x in the
 
Hi Alessandro!

cpu=x in the hostfile is propably a leftover from the LAM-days (but I don't know for sure)

If you were doing your runs on the damBreak-case your numbers do not surprise me at all. For such small cases the main issue is latency and for that Ethernet is notoriously bad (it was designed for stability not for speed). And in that case Amdhal (http://en.wikipedia.org/wiki/Amdahl%27s_law) kicks in (with a longer non-parallelizable part than you would have for a serial case).

About cluster setup: if you find a good text that describes all that, let us know. ("Building Clustered Linux Systems" by Robert W. Lucke is a nice start)

Bernhard

PS: Sorry about starting a "language war", but you must admit: The opportunity was too good to let it pass. I'm just disappointed that nobody came up with a Haskell-solution to the problem (wrong group propably)

olwi October 1, 2007 10:11

Hi, We run Fluent and OF on
 
Hi,

We run Fluent and OF on a linux cluster with AMD64 processors (two cpu:s per node), with ordinary gigabit ethernet. As Bernhard points out, latency is the limiting factor. You need to give the cpu:s enough work to do, before they want to communicate with other cpu:s, so that they spend more time iterating than time waiting for the network...

To get decent speed-up, our rule of thumb is to have 100.000-200.000 cells per cpu. For example, we would typically run problems of 1e6 cells on two nodes (= 4 cpu:s).

/Ola

gtg627e October 2, 2007 20:34

Dear Forum, Thank you all f
 
Dear Forum,

Thank you all for your suggestions.
As a summary that can be useful for lost people like me:

1. Make sure all your nodes are accessible via ssh without needing to enter a password from your master node. This is done with ssh-keygen command.

2. Create file decompoParDict in you case/system/ directory. You can copy this file from other tutorials. If you don't know which tutorial, do the following:

> cd
> find -name 'decomposeParDict*'

This will tell you where you can find the file decomposeParDict.

The number of subdomains specified in decomposParDict should be equal to the number of cpu's you want to use, master+slaves.

3. Run decomposePar on our case directory, and copy all case files to all the slave nodes.

4. create a hostfile with master and slaves dns addresses, specifying the number of cpu per node, example:

master.host slots=2 max-slots=2
slave1.host slots=1 max-slots=1
slave2.host slots=4 max-slots=4

etc...

5. Run you parallel simulation by:

> mpirun --hostfile "yourHostFile" turbFoam $FOAM_RUN/tutorials/turbFoam MyCase -parallel > log &

I am starting to run a case with arbitrary motion of mesh boundaries (in my case from a structural modal analysis) with both turb and LES models. I will keep you updated.

Thank you again,

Alessandro Spadoni


All times are GMT -4. The time now is 13:21.