CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM Running, Solving & CFD

MPI PROBLEMS

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Display Modes
Old   September 24, 2007, 11:20
Default Hello Forum, I am trying to
  #1
Member
 
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 8
gtg627e is on a distinguished road
Hello Forum,

I am trying to run a case in parallel. I can lamboot all the machines in my host file by issuing lamboot -v <file> and I get:

[gtg627eOpenFOAM@ruzzene03 turbFoam]$ lamboot -v machineNetwork

LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

n-1<27397> ssi:boot:base:linear: booting n0 (ruzzene03)
n-1<27397> ssi:boot:base:linear: booting n1 (ruzzene01)
n-1<27397> ssi:boot:base:linear: booting n2 (ruzzene02)
n-1<27397> ssi:boot:base:linear: booting n3 (ruzzene04)
n-1<27397> ssi:boot:base:linear: booting n4 (ruzzene05)
n-1<27397> ssi:boot:base:linear: booting n5 (ruzzene06)
n-1<27397> ssi:boot:base:linear: finished

------------------------------------------------

Then I issue:

mpirun --hostfile machineNetwork -np 12 turbFoam . AlexMovingMesh -parallel > log &

and I get:

[gtg627eOpenFOAM@ruzzene03 turbFoam]$ mpirun --hostfile machineNetwork -np 12 turbFoam . AlexMovingMesh -parallel > log &
[1] 27507
[gtg627eOpenFOAM@ruzzene03 turbFoam]$ bash: orted: command not found
bash: orted: command not found
bash: orted: command not found
bash: orted: command not found
bash: orted: command not found
[ruzzene03:27507] ERROR: A daemon on node ruzzene01 failed to start as expected.
[ruzzene03:27507] ERROR: There may be more information available from
[ruzzene03:27507] ERROR: the remote shell (see above).
[ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127.
[ruzzene03:27507] ERROR: A daemon on node ruzzene05 failed to start as expected.
[ruzzene03:27507] ERROR: There may be more information available from
[ruzzene03:27507] ERROR: the remote shell (see above).
[ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127.
[ruzzene03:27507] ERROR: A daemon on node ruzzene04 failed to start as expected.
[ruzzene03:27507] ERROR: There may be more information available from
[ruzzene03:27507] ERROR: the remote shell (see above).
[ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127.
[ruzzene03:27507] ERROR: A daemon on node ruzzene02 failed to start as expected.
[ruzzene03:27507] ERROR: There may be more information available from
[ruzzene03:27507] ERROR: the remote shell (see above).
[ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127.
[ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275
[ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164
[ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
[ruzzene03:27507] ERROR: A daemon on node ruzzene06 failed to start as expected.
[ruzzene03:27507] ERROR: There may be more information available from
[ruzzene03:27507] ERROR: the remote shell (see above).
[ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127.
[ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188
[ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS.

--------------------------------------------------------------------------

Does anybody know what might cause this error?
Is it possible that my bashrc file is not providing a correct path to OpenFOAM/OpenFOAM-1.4.1/src/openmpi-1.2.3/orte ?

Thank you in advance,

Alessandro Spadoni
gtg627e is offline   Reply With Quote

Old   September 24, 2007, 12:24
Default Hi Alessando, You (thankful
  #2
Senior Member
 
Mark Olesen
Join Date: Mar 2009
Location: http://olesenm.github.io/
Posts: 777
Rep Power: 18
olesen will become famous soon enough
Hi Alessando,

You (thankfully) don't need lamboot etc if you are using Open-MPI.

From the error messages, it could look like Open-MPI is not available on the NFS-share on the remote machines or you can't rsh to the remote machines.

At our installation, the orte uses the GridEngine and the environment is inherited without needing any OpenFOAM settings in ~/.profile or ~/.bashrc

Make sure that you can execute the usual mpi "hello world" program (see the open-mpi FAQs).

BTW: if the hostfile is correctly formatted, you don't need the -np option.
olesen is offline   Reply With Quote

Old   September 24, 2007, 14:20
Default Hello Mark Olesen, First of
  #3
Member
 
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 8
gtg627e is on a distinguished road
Hello Mark Olesen,

First of all thank you very much for your quick response!

I realized I had lam on the client machines, and openmpi on the host machine only. I copied all openmpi files to my client machines. I can ssh into them without password.

Then I issue the following


[gtg627eOpenFOAM@ruzzene03 turbFoam]$ mpirun --hostfile machineNetwork turbFoam . AlexMovingMesh -parallel > log &
[1] 9537
[gtg627eOpenFOAM@ruzzene03 turbFoam]$ --------------------------------------------------------------------------
Failed to find the following executable:

Host: ruzzene01
Executable: turbFoam

Cannot continue.
--------------------------------------------------------------------------
mpirun noticed that job rank 0 with PID 9541 on node ruzzene03 exited on signal 15 (Terminated).


So, my question at this point is: do I need OpenFoam on all client machines?
gtg627e is offline   Reply With Quote

Old   September 24, 2007, 15:53
Default So, my question at this point
  #4
Senior Member
 
Mark Olesen
Join Date: Mar 2009
Location: http://olesenm.github.io/
Posts: 777
Rep Power: 18
olesen will become famous soon enough
Quote:
So, my question at this point is: do I need OpenFoam on all client machines?
Oh yes. You definitely need the executables and libraries that will be run on the remote host.
An NFS-share is certainly the easist.

If you are curious about the dependencies, use 'type turbFoam' or 'which turbFoam' to find out where the executable lies.
You can used 'ldd -v' to see which libraries will be used/required.
olesen is offline   Reply With Quote

Old   September 24, 2007, 16:55
Default Thank you Mark, I will migr
  #5
Member
 
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 8
gtg627e is on a distinguished road
Thank you Mark,

I will migrate the librarries/solvers etc. to the client machines.
Thank you again for the prompt response.
I am testing a dynamicFvMesh library that will handle imposed elastic deformations. I will keep you updated.

Alessandro
gtg627e is offline   Reply With Quote

Old   September 24, 2007, 22:24
Default Hello Mark, I am still havi
  #6
Member
 
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 8
gtg627e is on a distinguished road
Hello Mark,

I am still having some problems.
I now have OpenFOAM on all the client machines on the network. From my host machine I do the following:

-->cd OpenFOAM/gtg627e-1.4.1/run/tutorials/turbFoam
-->mpirun --hostfile machineNetwork turbFoam . AlexMovingMesh -parallel > log &.

This is the content of the log:

MPI Pstream initialized with:
[4] Date : Sep 24 2007
[4] Time : 22:14:56
[4] Host : ruzzene02.ae.gatech.edu
[4] PID : 6282
[5] Date : Sep 24 2007
[5] Time : 22:14:56
[5] Host : ruzzene02.ae.gatech.edu
[5] PID : 6283
[8] Date : Sep 24 2007
[8] Time : 22:14:56
[8] Host : ruzzene05
[8] PID : 25254
[2] Date : Sep 24 2007
[2] Time : 22:14:56
[2] Host : ruzzene01
[2] PID : 21776
[9] Date : Sep 24 2007
[9] Time : 22:14:56
[9] Host : ruzzene05
[9] PID : 25255
[3] Date : Sep 24 2007
[3] Time : 22:14:56
[3] Host : ruzzene01
[3] PID : 21777
[6] Date : Sep 24 2007
[7] Date : Sep 24 2007
[7] Time : 22:14:56
[7] Host : ruzzene04
[7] PID : 18245
[10] Date : Sep 24 2007
floatTransfer : 1
nProcsSimpleSum : 0
scheduledTransfer : 0

/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.4.1 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

Exec : turbFoam /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ AlexMovingMesh -parallel
[0] Date : Sep 24 2007
[0] Time : 22:14:56
[0] Host : ruzzene03
[0] PID : 8827
[1] Date : Sep 24 2007
[1] Time : 22:14:56
[1] Host : ruzzene03
[1] PID : 8828
[6] Time : 22:14:56
[6] Host : ruzzene04
[6] PID : 18244
[1] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[1] Case : AlexMovingMesh
[1] Nprocs : 12
[2] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[2] Case : AlexMovingMesh
[2] Nprocs : 12
[4] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[4] Case : AlexMovingMesh
[4] Nprocs : 12
[3] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[3] Case : AlexMovingMesh
[3] Nprocs : 12
[6] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[6] Case : AlexMovingMesh
[6] Nprocs : 12
[7] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[7] Case : AlexMovingMesh
[7] Nprocs : 12
[8] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[8] Case : AlexMovingMesh
[8] Nprocs : 12
[5] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[5] Case : AlexMovingMesh
[5] Nprocs : 12
[9] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[9] Case : AlexMovingMesh
[9] Nprocs : 12
[0] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[0] Case : AlexMovingMesh
[0] Nprocs : 12
[0] Slaves :
[0] 11
[0] (
[0] ruzzene03.8828
[0] ruzzene01.21776
[0] ruzzene01.21777
[0] ruzzene02.ae.gatech.edu.6282
[0] ruzzene02.ae.gatech.edu.6283
[0] ruzzene04.18244
[0] ruzzene04.18245
[0] ruzzene05.25254
[0] ruzzene05.25255
[0] ruzzene06.3853
[0] ruzzene06.3854
[0] )
[0]
Create time

[10] Time : 22:14:56
[10] Host : ruzzene06
[10] PID : 3853
[10] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[10] Case : AlexMovingMesh
[10] Nprocs : 12
[11] Date : Sep 24 2007
[11] Time : 22:14:56
[11] Host : ruzzene06
[11] PID : 3854
[11] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/
[11] Case : AlexMovingMesh
[11] Nprocs : 12
1 additional process aborted (not shown)

-----------------------------------------------------

Then the output switches to the terminal as follows:

-------------------------------------------------

[gtg627eOpenFOAM@ruzzene03 turbFoam]$ mpirun --hostfile machineNetwork turbFoam $FOAM_RUN/tutorials/turbFoam/ AlexMovingMesh -parallel > log &
[1] 8818
[gtg627eOpenFOAM@ruzzene03 turbFoam]$ [2]
[2]
[2] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[2]
[2]
FOAM parallel run exiting
[2]
[ruzzene01:21776] MPI_ABORT invoked on rank 2 in communicator MPI_COMM_WORLD with errorcode 1
[4] [3]
[4]
[4] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[4]
[4]
FOAM parallel run exiting

[3]
[3] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[3]
[3]
FOAM parallel run exiting
[3]
[6] [8] [ruzzene01:21777] MPI_ABORT invoked on rank 3 in communicator MPI_COMM_WORLD with errorcode 1
[4]

[6]
[6] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[6]
[6]
FOAM parallel run exiting
[6]
[8]
[8] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[8]
[ruzzene02.ae.gatech.edu:06282] MPI_ABORT invoked on rank 4 in communicator MPI_COMM_WORLD with errorcode 1

[8]
FOAM parallel run exiting
[8]
[5] [ruzzene04:18244] MPI_ABORT invoked on rank 6 in communicator MPI_COMM_WORLD with errorcode 1

[5]
[5] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[5]
[5]
FOAM parallel run exiting
[5]
[7] [ruzzene02.ae.gatech.edu:06283] MPI_ABORT invoked on rank 5 in communicator MPI_COMM_WORLD with errorcode 1

[7]
[7] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[7]
[7]
FOAM parallel run exiting
[7]
[ruzzene05:25254] MPI_ABORT invoked on rank 8 in communicator MPI_COMM_WORLD with errorcode 1
[ruzzene04:18245] MPI_ABORT invoked on rank 7 in communicator MPI_COMM_WORLD with errorcode 1
[9]
[9]
[9] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[9]
[9]
FOAM parallel run exiting
[9]
[ruzzene05:25255] MPI_ABORT invoked on rank 9 in communicator MPI_COMM_WORLD with errorcode 1
[ruzzene03][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=111
[10]
[10]
[10] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[10]
[10]
FOAM parallel run exiting
[10]
[ruzzene06:03853] MPI_ABORT invoked on rank 10 in communicator MPI_COMM_WORLD with errorcode 1
[11]
[11]
[11] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/"
[11]
[11]
FOAM parallel run exiting
[11]
[ruzzene06:03854] MPI_ABORT invoked on rank 11 in communicator MPI_COMM_WORLD with errorcode 1
mpirun noticed that job rank 0 with PID 8827 on node ruzzene03 exited on signal 15 (Terminated).

-----------------------------------------------------

Do I have to have my case AlexMovingMesh with all its subdirectories processor0, processor1...0, constant, system on each client machine? or, do I need to make my case AlexMovingMesh a shared directory for all netowrk machines?

Thank you again for your help,

Alessandro
gtg627e is offline   Reply With Quote

Old   September 25, 2007, 02:03
Default Hi Alessandro, the easiest
  #7
New Member
 
Thomas Gallinger
Join Date: Mar 2009
Posts: 28
Rep Power: 8
thomas is on a distinguished road
Hi Alessandro,

the easiest way is to have the whole OF installation and also every file and directory of your case on every cluster node. To run properly, all the environment variables have to be set on every node.
Also thw whole directory structure has to be similar.

The error you posted above, looks like you do not have enough rights to write into the corresponding directory or it is simple not present.

Thomas
thomas is offline   Reply With Quote

Old   September 25, 2007, 03:18
Default Hi Alessandro, Yes, you do
  #8
Senior Member
 
Mark Olesen
Join Date: Mar 2009
Location: http://olesenm.github.io/
Posts: 777
Rep Power: 18
olesen will become famous soon enough
Hi Alessandro,

Yes, you do need to have the case and subdirs available on all nodes. Again, the easiest is if it is on a NFS-share.

Also, as Thomas mentioned, the OpenFOAM env variables will also need to be set on the remote machines. But this should be taken care of by the orte. From your error messages it looks like you are okay there.
olesen is offline   Reply With Quote

Old   September 25, 2007, 10:22
Default Hi Thomas and Mark, I copie
  #9
Member
 
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 8
gtg627e is on a distinguished road
Hi Thomas and Mark,

I copied all case files to each node, and I still don't get OpenFOAM to run. This is the log content:

--------------------------------------------------------

MPI Pstream initialized with:
floatTransfer : 1
nProcsSimpleSum : 0
scheduledTransfer : 0

/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.4.1 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

Exec : turbFoam /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam AlexMovingMesh -parallel
[0] Date : Sep 25 2007
[0] Time : 10:17:27
[0] Host : ruzzene03
[0] PID : 21772
[4] Date : Sep 25 2007
[4] Time : 10:17:27
[4] Host : ruzzene02.ae.gatech.edu
[4] PID : 7834
[3] Date : Sep 25 2007
[10] Date : Sep 25 2007
[10] Time : 10:17:27
[10] Host : ruzzene06
[10] PID : 5367
[1] Date : Sep 25 2007
[1] Time : 10:17:27
[1] Host : ruzzene03
[1] PID : 21773
[1] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[1] Case : AlexMovingMesh
[1] Nprocs : 12
[5] Date : Sep 25 2007
[5] Time : 10:17:27
[5] Host : ruzzene02.ae.gatech.edu
[5] PID : 7835
[7] Date : Sep 25 2007
[7] Time : 10:17:27
[7] Host : ruzzene04
[7] PID : 19915
[3] Time : 10:17:27
[3] Host : ruzzene01
[3] PID : 23397
[8] Date : Sep 25 2007
[8] Time : 10:17:27
[8] Host : ruzzene05
[8] PID : 26770
[11] Date : Sep 25 2007
[11] Time : 10:17:27
[11] Host : ruzzene06
[11] PID : 5368
[0] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[0] Case : AlexMovingMesh
[0] Nprocs : 12
[0] Slaves :
[0] 11
[0] (
[0] ruzzene03.21773
[0] ruzzene01.23396
[0] ruzzene01.23397
[0] ruzzene02.ae.gatech.edu.7834
[0] ruzzene02.ae.gatech.edu.7835
[0] ruzzene04.19914
[0] ruzzene04.19915
[0] ruzzene05.26770
[0] ruzzene05.26771
[0] ruzzene06.5367
[0] ruzzene06.5368
[0] )
[0]
Create time

[4] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[4] Case : AlexMovingMesh
[4] Nprocs : 12
[6] Date : Sep 25 2007
[6] Time : 10:17:28
[6] Host : ruzzene04
[6] PID : 19914
[2] Date : Sep 25 2007
[2] Time : 10:17:28
[2] Host : ruzzene01
[2] PID : 23396
[2] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[2] Case : AlexMovingMesh
[2] Nprocs : 12
[9] Date : Sep 25 2007
[9] Time : 10:17:27
[9] Host : ruzzene05
[9] PID : 26771
[10] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[10] Case : AlexMovingMesh
[10] Nprocs : 12
[5] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/t[6] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[6] Case : AlexMovingMesh
[6] Nprocs : 12
[3] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[3] Case : AlexMovingMesh
[3] Nprocs : 12
[8] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[8] Case : AlexMovingMesh
[8] Nprocs : 12
urbFoam
[5] Case : AlexMovingMesh
[5] Nprocs : 12
[7] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[7] Case : AlexMovingMesh
[7] Nprocs : 12
[11] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[11] Case : AlexMovingMesh
[11] Nprocs : 12
[9] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam
[9] Case : AlexMovingMesh
[9] Nprocs : 12

------------------------------------------------

and this is the output to the terminal:

---------------------------------------------------

[gtg627eOpenFOAM@ruzzene03 ~]$ mpirun --hostfile machineNetwork turbFoam $FOAM_RUN/tutorials/turbFoam AlexMovingMesh -parallel > log &
[3] 21762
[gtg627eOpenFOAM@ruzzene03 ~]$ [ruzzene06][0,1,10][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect ] connect() failed with errno=113
[ruzzene01][0,1,3][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113
[ruzzene05][0,1,8][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113
[ruzzene02][0,1,5][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113
[ruzzene04][0,1,7][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113
[ruzzene06][0,1,11][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect ] connect() failed with errno=113
[ruzzene05][0,1,9][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113

-------------------------------------------------------------------------------- --

Do you happen to know what causes this?
Sorry about all these trivial questions, but it is the first time I run mpi with any application.

Thank you again for you help,

Alessandro
gtg627e is offline   Reply With Quote

Old   September 25, 2007, 12:16
Default Hi Alessandro! If this is t
  #10
Assistant Moderator
 
Bernhard Gschaider
Join Date: Mar 2009
Posts: 3,915
Rep Power: 40
gschaider will become famous soon enoughgschaider will become famous soon enough
Hi Alessandro!

If this is the first time you run MPI on your machines, maybe you should start less ambitious. Get a micky-mouse problem (damBreak for instance) and try to run it on only one machine and two CPUs. If that does work progress to using two machines. That way it is easier to pinpoint what your problem might be (individual MPI installations or machine interconnect)

Bernhard
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request
gschaider is offline   Reply With Quote

Old   September 26, 2007, 02:34
Default Hi Alessandro, Did you get
  #11
Senior Member
 
Mark Olesen
Join Date: Mar 2009
Location: http://olesenm.github.io/
Posts: 777
Rep Power: 18
olesen will become famous soon enough
Hi Alessandro,

Did you get the usual mpi "hello world" program (see the open-mpi FAQs) to work? It would be good to ensure that all MPI issues are sorted out first.

/mark
olesen is offline   Reply With Quote

Old   September 27, 2007, 15:50
Default Hello Mark and Bernhard, I
  #12
Member
 
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 8
gtg627e is on a distinguished road
Hello Mark and Bernhard,

I am able to run both the hello world example and my OpenFOAM case on my master machine in mpi mode. I have 2 cpu's on my master machine.
So, I am trying to figure out what went wrong with the case I posted above.
So, I looked up this error:

[ruzzene06][0,1,10][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect ] connect() failed with errno=113

Then, I issued the following command to see what the above error means:

--> perl -e 'die&!=113'

and I got:

Can't modify non-lvalue subroutine call at -e line 1.

Does this mean I don't have permission to write on files on the slave machines?
Has any of you ever had a similar error?

Thank you again,

Alessandro
gtg627e is offline   Reply With Quote

Old   September 27, 2007, 17:41
Default Hi Alessandro! Have avoided
  #13
Assistant Moderator
 
Bernhard Gschaider
Join Date: Mar 2009
Posts: 3,915
Rep Power: 40
gschaider will become famous soon enoughgschaider will become famous soon enough
Hi Alessandro!

Have avoided perl for a long time, but if I try your command with 114 (and 115 and any other number), I get the same error message. So I guess the lvalue message is a perl-error.

But if I say this (using a decent language):
python -c "import os;print os.strerror(113)"
I get this:
No route to host
(which is the information you wanted to get, I guess)

The problem seems to be that MPI can't reach a host in the host-file. Check the host file (sorry, I guess you did that already twice). Try using full host names (with the complete domain). Try using IP-numbers. Tell us what happened.

Bernhard

PS: I am fully aware that my solution needs approximately twice as many characters and therefore is unaccaptable for a perl-head (although it works).
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request
gschaider is offline   Reply With Quote

Old   September 27, 2007, 23:39
Default Hello Bernhard, After succe
  #14
Member
 
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 8
gtg627e is on a distinguished road
Hello Bernhard,

After succesfully running hello world on the master node, and the master node and one slave node, I am working on running my OF case. Inexplicably, I logged in to the slave node, copied my case files with scp and everything now runs perfectly. My first attempt runs my OF case on the master node and 1 slave node (1 cpu per node). I got that runnig ok.

I read the open mpi FAQ's and found out that the hostfile is written as follows:

machinex.host slots=2 max-slots=2
machiney.host slots=2 max-slots=2

In the OF UserGuide.pdf, the host file uses cpu instead of the slots specification. I don't know if this makes a difference, but I think this solved some of my problems.

So far I have tried three mpi configurations:

Run my case divided in 4 processes on my master node only (it is a quad core)--> clock time 616 s.

Run my case divided in 2 processes (1 process on my master node and 1 process on 1 slave node)--> clock time 2666 s.

Run my case without mpi on my master node only --> clock time 656 s.

My 6 quad-core nodes are connected via ethernet with a belkin gigabit switch box.

I understand the above clock numbers may not reflect the general increase in performance by dividing a case into many processes. I understand my case may be small enough to actually suffer from being split up into many processes. Do you happen to know where to find information about cluster setup and the best way to connect difference machines?
Do the above clock numbers make sense to you?
In your experience with OF, do you gain in performance splitting up even small cases?

Thank you,

Alessandro
gtg627e is offline   Reply With Quote

Old   September 27, 2007, 23:48
Default For the Unix head, another way
  #15
Senior Member
 
Martin Beaudoin
Join Date: Mar 2009
Posts: 330
Rep Power: 13
mbeaudoin will become famous soon enough
For the Unix head, another way to find the same answer (for example on a Centos 4.4 host):

[prompt]$ find /usr/include -iname \*errno\* -exec grep 113 {} \; -print
#define EHOSTUNREACH 113 /* No route to host */
/usr/include/asm-x86_64/errno.h
#define EHOSTUNREACH 113 /* No route to host */
/usr/include/asm-i386/errno.h

Or even simpler:

[prompt ]$ grep 113 /usr/include/*.h /usr/include/*/*.h | grep errno
/usr/include/asm-i386/errno.h:#define EHOSTUNREACH 113 /* No route to host */
/usr/include/asm-x86_64/errno.h:#define EHOSTUNREACH 113 /* No route to host */

Just replace 113 with your own errno number; you will get a good hint about where to start looking for solving your problem.

Martin
mbeaudoin is offline   Reply With Quote

Old   September 28, 2007, 00:28
Default Hello Alessandro, About the
  #16
Senior Member
 
Martin Beaudoin
Join Date: Mar 2009
Posts: 330
Rep Power: 13
mbeaudoin will become famous soon enough
Hello Alessandro,

About the clock time value you are reporting with your 2 processes case:

Yes, the size of the mesh might be a performance limiting factor here if it is too small.

But another limiting factor might be the configuration of your gigabit network.

Make sure that:
  • your Ethernet cabling to the Belkin switch is using CAT5E or CAT6 cables. Don't use standard CAT5 cables with Gigabit switches.
  • Your Ethernet adapters are not limited to 10/100 Mbits, but also support 1Gbit speed. This is trivial, but double-checking the basics is sometimes necessary.
  • All your Ethernet adapters are configured for full duplex mode. Under Linux, the commande ethtool is usually your friend here.
  • On many modern switches, there is usually a port setting called portfast (or something similar) that once enabled, could substantially improve your Ethernet port throughput. Read your Belkin configuration manual carefully.
Martin
mbeaudoin is offline   Reply With Quote

Old   September 28, 2007, 03:43
Default Hi Alessando, --> perl -e
  #17
Senior Member
 
Mark Olesen
Join Date: Mar 2009
Location: http://olesenm.github.io/
Posts: 777
Rep Power: 18
olesen will become famous soon enough
Hi Alessando,

Quote:
--> perl -e 'die&!=113'

and I got:

Can't modify non-lvalue subroutine call at -e line 1.
This means you tried to explicitly call the (non-existing) subroutine '!' and simultaneously assign it a value. This is what the error message is trying telling you: Do not do this.
You might have better success with one of the following:

--> perl -e '$!=113; die'
or
--> perl -e 'die $!=113'


Since Berhard was trolling about decent languages, I can't resist. In this case he is right though.
Writing the equivalent incorrect program in Python may take a few more characters, but looks prettier when it fails ;)

>>> def errno():
... nil
... errno() = 113
File "<stdin>", line 3
errno() = 113
^
SyntaxError: invalid syntax


BTW: for the really lazy people (like me), searching google for 'errno', '113' and 'mpi' is even faster than all the other solutions and reveals the OpenMPI archives at the top of the list. This also probably the place where you should be looking for debugging your MPI problems. It doesn't look like your problems are being caused by OpenFOAM.

/mark
olesen is offline   Reply With Quote

Old   October 1, 2007, 05:20
Default Hi Alessandro! cpu=x in the
  #18
Assistant Moderator
 
Bernhard Gschaider
Join Date: Mar 2009
Posts: 3,915
Rep Power: 40
gschaider will become famous soon enoughgschaider will become famous soon enough
Hi Alessandro!

cpu=x in the hostfile is propably a leftover from the LAM-days (but I don't know for sure)

If you were doing your runs on the damBreak-case your numbers do not surprise me at all. For such small cases the main issue is latency and for that Ethernet is notoriously bad (it was designed for stability not for speed). And in that case Amdhal (http://en.wikipedia.org/wiki/Amdahl%27s_law) kicks in (with a longer non-parallelizable part than you would have for a serial case).

About cluster setup: if you find a good text that describes all that, let us know. ("Building Clustered Linux Systems" by Robert W. Lucke is a nice start)

Bernhard

PS: Sorry about starting a "language war", but you must admit: The opportunity was too good to let it pass. I'm just disappointed that nobody came up with a Haskell-solution to the problem (wrong group propably)
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request
gschaider is offline   Reply With Quote

Old   October 1, 2007, 10:11
Default Hi, We run Fluent and OF on
  #19
Member
 
Ola Widlund
Join Date: Mar 2009
Location: Sweden
Posts: 87
Rep Power: 8
olwi is on a distinguished road
Hi,

We run Fluent and OF on a linux cluster with AMD64 processors (two cpu:s per node), with ordinary gigabit ethernet. As Bernhard points out, latency is the limiting factor. You need to give the cpu:s enough work to do, before they want to communicate with other cpu:s, so that they spend more time iterating than time waiting for the network...

To get decent speed-up, our rule of thumb is to have 100.000-200.000 cells per cpu. For example, we would typically run problems of 1e6 cells on two nodes (= 4 cpu:s).

/Ola
olwi is offline   Reply With Quote

Old   October 2, 2007, 20:34
Default Dear Forum, Thank you all f
  #20
Member
 
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 8
gtg627e is on a distinguished road
Dear Forum,

Thank you all for your suggestions.
As a summary that can be useful for lost people like me:

1. Make sure all your nodes are accessible via ssh without needing to enter a password from your master node. This is done with ssh-keygen command.

2. Create file decompoParDict in you case/system/ directory. You can copy this file from other tutorials. If you don't know which tutorial, do the following:

> cd
> find -name 'decomposeParDict*'

This will tell you where you can find the file decomposeParDict.

The number of subdomains specified in decomposParDict should be equal to the number of cpu's you want to use, master+slaves.

3. Run decomposePar on our case directory, and copy all case files to all the slave nodes.

4. create a hostfile with master and slaves dns addresses, specifying the number of cpu per node, example:

master.host slots=2 max-slots=2
slave1.host slots=1 max-slots=1
slave2.host slots=4 max-slots=4

etc...

5. Run you parallel simulation by:

> mpirun --hostfile "yourHostFile" turbFoam $FOAM_RUN/tutorials/turbFoam MyCase -parallel > log &

I am starting to run a case with arbitrary motion of mesh boundaries (in my case from a structural modal analysis) with both turb and LES models. I will keep you updated.

Thank you again,

Alessandro Spadoni
gtg627e is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
UDF problems Winnie FLUENT 21 March 15, 2011 10:26
Problems with Fluent on simple 1D problems agg FLUENT 3 November 21, 2008 12:55
problems Rogerio Fernandes Brito CFX 1 May 5, 2008 21:08
UDF problems Paolo Lampitella FLUENT 5 September 8, 2005 20:43
problems pirmohammadi Main CFD Forum 2 December 21, 2004 11:34


All times are GMT -4. The time now is 00:25.