|
[Sponsors] |
September 24, 2007, 12:20 |
Hello Forum,
I am trying to
|
#1 |
Member
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 17 |
Hello Forum,
I am trying to run a case in parallel. I can lamboot all the machines in my host file by issuing lamboot -v <file> and I get: [gtg627eOpenFOAM@ruzzene03 turbFoam]$ lamboot -v machineNetwork LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University n-1<27397> ssi:boot:base:linear: booting n0 (ruzzene03) n-1<27397> ssi:boot:base:linear: booting n1 (ruzzene01) n-1<27397> ssi:boot:base:linear: booting n2 (ruzzene02) n-1<27397> ssi:boot:base:linear: booting n3 (ruzzene04) n-1<27397> ssi:boot:base:linear: booting n4 (ruzzene05) n-1<27397> ssi:boot:base:linear: booting n5 (ruzzene06) n-1<27397> ssi:boot:base:linear: finished ------------------------------------------------ Then I issue: mpirun --hostfile machineNetwork -np 12 turbFoam . AlexMovingMesh -parallel > log & and I get: [gtg627eOpenFOAM@ruzzene03 turbFoam]$ mpirun --hostfile machineNetwork -np 12 turbFoam . AlexMovingMesh -parallel > log & [1] 27507 [gtg627eOpenFOAM@ruzzene03 turbFoam]$ bash: orted: command not found bash: orted: command not found bash: orted: command not found bash: orted: command not found bash: orted: command not found [ruzzene03:27507] ERROR: A daemon on node ruzzene01 failed to start as expected. [ruzzene03:27507] ERROR: There may be more information available from [ruzzene03:27507] ERROR: the remote shell (see above). [ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127. [ruzzene03:27507] ERROR: A daemon on node ruzzene05 failed to start as expected. [ruzzene03:27507] ERROR: There may be more information available from [ruzzene03:27507] ERROR: the remote shell (see above). [ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127. [ruzzene03:27507] ERROR: A daemon on node ruzzene04 failed to start as expected. [ruzzene03:27507] ERROR: There may be more information available from [ruzzene03:27507] ERROR: the remote shell (see above). [ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127. [ruzzene03:27507] ERROR: A daemon on node ruzzene02 failed to start as expected. [ruzzene03:27507] ERROR: There may be more information available from [ruzzene03:27507] ERROR: the remote shell (see above). [ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127. [ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164 [ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [ruzzene03:27507] ERROR: A daemon on node ruzzene06 failed to start as expected. [ruzzene03:27507] ERROR: There may be more information available from [ruzzene03:27507] ERROR: the remote shell (see above). [ruzzene03:27507] ERROR: The daemon exited unexpectedly with status 127. [ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [ruzzene03:27507] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196 -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. -------------------------------------------------------------------------- Does anybody know what might cause this error? Is it possible that my bashrc file is not providing a correct path to OpenFOAM/OpenFOAM-1.4.1/src/openmpi-1.2.3/orte ? Thank you in advance, Alessandro Spadoni |
|
September 24, 2007, 13:24 |
Hi Alessando,
You (thankful
|
#2 |
Senior Member
Mark Olesen
Join Date: Mar 2009
Location: https://olesenm.github.io/
Posts: 1,714
Rep Power: 40 |
Hi Alessando,
You (thankfully) don't need lamboot etc if you are using Open-MPI. From the error messages, it could look like Open-MPI is not available on the NFS-share on the remote machines or you can't rsh to the remote machines. At our installation, the orte uses the GridEngine and the environment is inherited without needing any OpenFOAM settings in ~/.profile or ~/.bashrc Make sure that you can execute the usual mpi "hello world" program (see the open-mpi FAQs). BTW: if the hostfile is correctly formatted, you don't need the -np option. |
|
September 24, 2007, 15:20 |
Hello Mark Olesen,
First of
|
#3 |
Member
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 17 |
Hello Mark Olesen,
First of all thank you very much for your quick response! I realized I had lam on the client machines, and openmpi on the host machine only. I copied all openmpi files to my client machines. I can ssh into them without password. Then I issue the following [gtg627eOpenFOAM@ruzzene03 turbFoam]$ mpirun --hostfile machineNetwork turbFoam . AlexMovingMesh -parallel > log & [1] 9537 [gtg627eOpenFOAM@ruzzene03 turbFoam]$ -------------------------------------------------------------------------- Failed to find the following executable: Host: ruzzene01 Executable: turbFoam Cannot continue. -------------------------------------------------------------------------- mpirun noticed that job rank 0 with PID 9541 on node ruzzene03 exited on signal 15 (Terminated). So, my question at this point is: do I need OpenFoam on all client machines? |
|
September 24, 2007, 16:53 |
So, my question at this point
|
#4 | |
Senior Member
Mark Olesen
Join Date: Mar 2009
Location: https://olesenm.github.io/
Posts: 1,714
Rep Power: 40 |
Quote:
An NFS-share is certainly the easist. If you are curious about the dependencies, use 'type turbFoam' or 'which turbFoam' to find out where the executable lies. You can used 'ldd -v' to see which libraries will be used/required. |
||
September 24, 2007, 17:55 |
Thank you Mark,
I will migr
|
#5 |
Member
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 17 |
Thank you Mark,
I will migrate the librarries/solvers etc. to the client machines. Thank you again for the prompt response. I am testing a dynamicFvMesh library that will handle imposed elastic deformations. I will keep you updated. Alessandro |
|
September 24, 2007, 23:24 |
Hello Mark,
I am still havi
|
#6 |
Member
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 17 |
Hello Mark,
I am still having some problems. I now have OpenFOAM on all the client machines on the network. From my host machine I do the following: -->cd OpenFOAM/gtg627e-1.4.1/run/tutorials/turbFoam -->mpirun --hostfile machineNetwork turbFoam . AlexMovingMesh -parallel > log &. This is the content of the log: MPI Pstream initialized with: [4] Date : Sep 24 2007 [4] Time : 22:14:56 [4] Host : ruzzene02.ae.gatech.edu [4] PID : 6282 [5] Date : Sep 24 2007 [5] Time : 22:14:56 [5] Host : ruzzene02.ae.gatech.edu [5] PID : 6283 [8] Date : Sep 24 2007 [8] Time : 22:14:56 [8] Host : ruzzene05 [8] PID : 25254 [2] Date : Sep 24 2007 [2] Time : 22:14:56 [2] Host : ruzzene01 [2] PID : 21776 [9] Date : Sep 24 2007 [9] Time : 22:14:56 [9] Host : ruzzene05 [9] PID : 25255 [3] Date : Sep 24 2007 [3] Time : 22:14:56 [3] Host : ruzzene01 [3] PID : 21777 [6] Date : Sep 24 2007 [7] Date : Sep 24 2007 [7] Time : 22:14:56 [7] Host : ruzzene04 [7] PID : 18245 [10] Date : Sep 24 2007 floatTransfer : 1 nProcsSimpleSum : 0 scheduledTransfer : 0 /*---------------------------------------------------------------------------*\ | ========= | | | \ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \ / O peration | Version: 1.4.1 | | \ / A nd | Web: http://www.openfoam.org | | \/ M anipulation | | \*---------------------------------------------------------------------------*/ Exec : turbFoam /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ AlexMovingMesh -parallel [0] Date : Sep 24 2007 [0] Time : 22:14:56 [0] Host : ruzzene03 [0] PID : 8827 [1] Date : Sep 24 2007 [1] Time : 22:14:56 [1] Host : ruzzene03 [1] PID : 8828 [6] Time : 22:14:56 [6] Host : ruzzene04 [6] PID : 18244 [1] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ [1] Case : AlexMovingMesh [1] Nprocs : 12 [2] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ [2] Case : AlexMovingMesh [2] Nprocs : 12 [4] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ [4] Case : AlexMovingMesh [4] Nprocs : 12 [3] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ [3] Case : AlexMovingMesh [3] Nprocs : 12 [6] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ [6] Case : AlexMovingMesh [6] Nprocs : 12 [7] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ [7] Case : AlexMovingMesh [7] Nprocs : 12 [8] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ [8] Case : AlexMovingMesh [8] Nprocs : 12 [5] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ [5] Case : AlexMovingMesh [5] Nprocs : 12 [9] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ [9] Case : AlexMovingMesh [9] Nprocs : 12 [0] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ [0] Case : AlexMovingMesh [0] Nprocs : 12 [0] Slaves : [0] 11 [0] ( [0] ruzzene03.8828 [0] ruzzene01.21776 [0] ruzzene01.21777 [0] ruzzene02.ae.gatech.edu.6282 [0] ruzzene02.ae.gatech.edu.6283 [0] ruzzene04.18244 [0] ruzzene04.18245 [0] ruzzene05.25254 [0] ruzzene05.25255 [0] ruzzene06.3853 [0] ruzzene06.3854 [0] ) [0] Create time [10] Time : 22:14:56 [10] Host : ruzzene06 [10] PID : 3853 [10] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ [10] Case : AlexMovingMesh [10] Nprocs : 12 [11] Date : Sep 24 2007 [11] Time : 22:14:56 [11] Host : ruzzene06 [11] PID : 3854 [11] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/ [11] Case : AlexMovingMesh [11] Nprocs : 12 1 additional process aborted (not shown) ----------------------------------------------------- Then the output switches to the terminal as follows: ------------------------------------------------- [gtg627eOpenFOAM@ruzzene03 turbFoam]$ mpirun --hostfile machineNetwork turbFoam $FOAM_RUN/tutorials/turbFoam/ AlexMovingMesh -parallel > log & [1] 8818 [gtg627eOpenFOAM@ruzzene03 turbFoam]$ [2] [2] [2] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/" [2] [2] FOAM parallel run exiting [2] [ruzzene01:21776] MPI_ABORT invoked on rank 2 in communicator MPI_COMM_WORLD with errorcode 1 [4] [3] [4] [4] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/" [4] [4] FOAM parallel run exiting [3] [3] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/" [3] [3] FOAM parallel run exiting [3] [6] [8] [ruzzene01:21777] MPI_ABORT invoked on rank 3 in communicator MPI_COMM_WORLD with errorcode 1 [4] [6] [6] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/" [6] [6] FOAM parallel run exiting [6] [8] [8] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/" [8] [ruzzene02.ae.gatech.edu:06282] MPI_ABORT invoked on rank 4 in communicator MPI_COMM_WORLD with errorcode 1 [8] FOAM parallel run exiting [8] [5] [ruzzene04:18244] MPI_ABORT invoked on rank 6 in communicator MPI_COMM_WORLD with errorcode 1 [5] [5] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/" [5] [5] FOAM parallel run exiting [5] [7] [ruzzene02.ae.gatech.edu:06283] MPI_ABORT invoked on rank 5 in communicator MPI_COMM_WORLD with errorcode 1 [7] [7] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/" [7] [7] FOAM parallel run exiting [7] [ruzzene05:25254] MPI_ABORT invoked on rank 8 in communicator MPI_COMM_WORLD with errorcode 1 [ruzzene04:18245] MPI_ABORT invoked on rank 7 in communicator MPI_COMM_WORLD with errorcode 1 [9] [9] [9] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/" [9] [9] FOAM parallel run exiting [9] [ruzzene05:25255] MPI_ABORT invoked on rank 9 in communicator MPI_COMM_WORLD with errorcode 1 [ruzzene03][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=111 [10] [10] [10] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/" [10] [10] FOAM parallel run exiting [10] [ruzzene06:03853] MPI_ABORT invoked on rank 10 in communicator MPI_COMM_WORLD with errorcode 1 [11] [11] [11] --> FOAM FATAL ERROR : turbFoam: cannot open root directory "/home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam/" [11] [11] FOAM parallel run exiting [11] [ruzzene06:03854] MPI_ABORT invoked on rank 11 in communicator MPI_COMM_WORLD with errorcode 1 mpirun noticed that job rank 0 with PID 8827 on node ruzzene03 exited on signal 15 (Terminated). ----------------------------------------------------- Do I have to have my case AlexMovingMesh with all its subdirectories processor0, processor1...0, constant, system on each client machine? or, do I need to make my case AlexMovingMesh a shared directory for all netowrk machines? Thank you again for your help, Alessandro |
|
September 25, 2007, 03:03 |
Hi Alessandro,
the easiest
|
#7 |
New Member
Thomas Gallinger
Join Date: Mar 2009
Posts: 28
Rep Power: 17 |
Hi Alessandro,
the easiest way is to have the whole OF installation and also every file and directory of your case on every cluster node. To run properly, all the environment variables have to be set on every node. Also thw whole directory structure has to be similar. The error you posted above, looks like you do not have enough rights to write into the corresponding directory or it is simple not present. Thomas |
|
September 25, 2007, 04:18 |
Hi Alessandro,
Yes, you do
|
#8 |
Senior Member
Mark Olesen
Join Date: Mar 2009
Location: https://olesenm.github.io/
Posts: 1,714
Rep Power: 40 |
Hi Alessandro,
Yes, you do need to have the case and subdirs available on all nodes. Again, the easiest is if it is on a NFS-share. Also, as Thomas mentioned, the OpenFOAM env variables will also need to be set on the remote machines. But this should be taken care of by the orte. From your error messages it looks like you are okay there. |
|
September 25, 2007, 11:22 |
Hi Thomas and Mark,
I copie
|
#9 |
Member
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 17 |
Hi Thomas and Mark,
I copied all case files to each node, and I still don't get OpenFOAM to run. This is the log content: -------------------------------------------------------- MPI Pstream initialized with: floatTransfer : 1 nProcsSimpleSum : 0 scheduledTransfer : 0 /*---------------------------------------------------------------------------*\ | ========= | | | \ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \ / O peration | Version: 1.4.1 | | \ / A nd | Web: http://www.openfoam.org | | \/ M anipulation | | \*---------------------------------------------------------------------------*/ Exec : turbFoam /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam AlexMovingMesh -parallel [0] Date : Sep 25 2007 [0] Time : 10:17:27 [0] Host : ruzzene03 [0] PID : 21772 [4] Date : Sep 25 2007 [4] Time : 10:17:27 [4] Host : ruzzene02.ae.gatech.edu [4] PID : 7834 [3] Date : Sep 25 2007 [10] Date : Sep 25 2007 [10] Time : 10:17:27 [10] Host : ruzzene06 [10] PID : 5367 [1] Date : Sep 25 2007 [1] Time : 10:17:27 [1] Host : ruzzene03 [1] PID : 21773 [1] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam [1] Case : AlexMovingMesh [1] Nprocs : 12 [5] Date : Sep 25 2007 [5] Time : 10:17:27 [5] Host : ruzzene02.ae.gatech.edu [5] PID : 7835 [7] Date : Sep 25 2007 [7] Time : 10:17:27 [7] Host : ruzzene04 [7] PID : 19915 [3] Time : 10:17:27 [3] Host : ruzzene01 [3] PID : 23397 [8] Date : Sep 25 2007 [8] Time : 10:17:27 [8] Host : ruzzene05 [8] PID : 26770 [11] Date : Sep 25 2007 [11] Time : 10:17:27 [11] Host : ruzzene06 [11] PID : 5368 [0] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam [0] Case : AlexMovingMesh [0] Nprocs : 12 [0] Slaves : [0] 11 [0] ( [0] ruzzene03.21773 [0] ruzzene01.23396 [0] ruzzene01.23397 [0] ruzzene02.ae.gatech.edu.7834 [0] ruzzene02.ae.gatech.edu.7835 [0] ruzzene04.19914 [0] ruzzene04.19915 [0] ruzzene05.26770 [0] ruzzene05.26771 [0] ruzzene06.5367 [0] ruzzene06.5368 [0] ) [0] Create time [4] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam [4] Case : AlexMovingMesh [4] Nprocs : 12 [6] Date : Sep 25 2007 [6] Time : 10:17:28 [6] Host : ruzzene04 [6] PID : 19914 [2] Date : Sep 25 2007 [2] Time : 10:17:28 [2] Host : ruzzene01 [2] PID : 23396 [2] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam [2] Case : AlexMovingMesh [2] Nprocs : 12 [9] Date : Sep 25 2007 [9] Time : 10:17:27 [9] Host : ruzzene05 [9] PID : 26771 [10] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam [10] Case : AlexMovingMesh [10] Nprocs : 12 [5] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/t[6] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam [6] Case : AlexMovingMesh [6] Nprocs : 12 [3] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam [3] Case : AlexMovingMesh [3] Nprocs : 12 [8] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam [8] Case : AlexMovingMesh [8] Nprocs : 12 urbFoam [5] Case : AlexMovingMesh [5] Nprocs : 12 [7] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam [7] Case : AlexMovingMesh [7] Nprocs : 12 [11] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam [11] Case : AlexMovingMesh [11] Nprocs : 12 [9] Root : /home/gtg627eOpenFOAM/OpenFOAM/gtg627eOpenFOAM-1.4.1/run/tutorials/turbFoam [9] Case : AlexMovingMesh [9] Nprocs : 12 ------------------------------------------------ and this is the output to the terminal: --------------------------------------------------- [gtg627eOpenFOAM@ruzzene03 ~]$ mpirun --hostfile machineNetwork turbFoam $FOAM_RUN/tutorials/turbFoam AlexMovingMesh -parallel > log & [3] 21762 [gtg627eOpenFOAM@ruzzene03 ~]$ [ruzzene06][0,1,10][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect ] connect() failed with errno=113 [ruzzene01][0,1,3][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113 [ruzzene05][0,1,8][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113 [ruzzene02][0,1,5][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113 [ruzzene04][0,1,7][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113 [ruzzene06][0,1,11][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect ] connect() failed with errno=113 [ruzzene05][0,1,9][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect] connect() failed with errno=113 -------------------------------------------------------------------------------- -- Do you happen to know what causes this? Sorry about all these trivial questions, but it is the first time I run mpi with any application. Thank you again for you help, Alessandro |
|
September 25, 2007, 13:16 |
Hi Alessandro!
If this is t
|
#10 |
Assistant Moderator
Bernhard Gschaider
Join Date: Mar 2009
Posts: 4,225
Rep Power: 51 |
Hi Alessandro!
If this is the first time you run MPI on your machines, maybe you should start less ambitious. Get a micky-mouse problem (damBreak for instance) and try to run it on only one machine and two CPUs. If that does work progress to using two machines. That way it is easier to pinpoint what your problem might be (individual MPI installations or machine interconnect) Bernhard
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request |
|
September 26, 2007, 03:34 |
Hi Alessandro,
Did you get
|
#11 |
Senior Member
Mark Olesen
Join Date: Mar 2009
Location: https://olesenm.github.io/
Posts: 1,714
Rep Power: 40 |
Hi Alessandro,
Did you get the usual mpi "hello world" program (see the open-mpi FAQs) to work? It would be good to ensure that all MPI issues are sorted out first. /mark |
|
September 27, 2007, 16:50 |
Hello Mark and Bernhard,
I
|
#12 |
Member
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 17 |
Hello Mark and Bernhard,
I am able to run both the hello world example and my OpenFOAM case on my master machine in mpi mode. I have 2 cpu's on my master machine. So, I am trying to figure out what went wrong with the case I posted above. So, I looked up this error: [ruzzene06][0,1,10][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_comple te_connect ] connect() failed with errno=113 Then, I issued the following command to see what the above error means: --> perl -e 'die&!=113' and I got: Can't modify non-lvalue subroutine call at -e line 1. Does this mean I don't have permission to write on files on the slave machines? Has any of you ever had a similar error? Thank you again, Alessandro |
|
September 27, 2007, 18:41 |
Hi Alessandro!
Have avoided
|
#13 |
Assistant Moderator
Bernhard Gschaider
Join Date: Mar 2009
Posts: 4,225
Rep Power: 51 |
Hi Alessandro!
Have avoided perl for a long time, but if I try your command with 114 (and 115 and any other number), I get the same error message. So I guess the lvalue message is a perl-error. But if I say this (using a decent language): python -c "import os;print os.strerror(113)" I get this: No route to host (which is the information you wanted to get, I guess) The problem seems to be that MPI can't reach a host in the host-file. Check the host file (sorry, I guess you did that already twice). Try using full host names (with the complete domain). Try using IP-numbers. Tell us what happened. Bernhard PS: I am fully aware that my solution needs approximately twice as many characters and therefore is unaccaptable for a perl-head (although it works).
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request |
|
September 28, 2007, 00:39 |
Hello Bernhard,
After succe
|
#14 |
Member
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 17 |
Hello Bernhard,
After succesfully running hello world on the master node, and the master node and one slave node, I am working on running my OF case. Inexplicably, I logged in to the slave node, copied my case files with scp and everything now runs perfectly. My first attempt runs my OF case on the master node and 1 slave node (1 cpu per node). I got that runnig ok. I read the open mpi FAQ's and found out that the hostfile is written as follows: machinex.host slots=2 max-slots=2 machiney.host slots=2 max-slots=2 In the OF UserGuide.pdf, the host file uses cpu instead of the slots specification. I don't know if this makes a difference, but I think this solved some of my problems. So far I have tried three mpi configurations: Run my case divided in 4 processes on my master node only (it is a quad core)--> clock time 616 s. Run my case divided in 2 processes (1 process on my master node and 1 process on 1 slave node)--> clock time 2666 s. Run my case without mpi on my master node only --> clock time 656 s. My 6 quad-core nodes are connected via ethernet with a belkin gigabit switch box. I understand the above clock numbers may not reflect the general increase in performance by dividing a case into many processes. I understand my case may be small enough to actually suffer from being split up into many processes. Do you happen to know where to find information about cluster setup and the best way to connect difference machines? Do the above clock numbers make sense to you? In your experience with OF, do you gain in performance splitting up even small cases? Thank you, Alessandro |
|
September 28, 2007, 00:48 |
For the Unix head, another way
|
#15 |
Senior Member
Martin Beaudoin
Join Date: Mar 2009
Posts: 332
Rep Power: 22 |
For the Unix head, another way to find the same answer (for example on a Centos 4.4 host):
[prompt]$ find /usr/include -iname \*errno\* -exec grep 113 {} \; -print #define EHOSTUNREACH 113 /* No route to host */ /usr/include/asm-x86_64/errno.h #define EHOSTUNREACH 113 /* No route to host */ /usr/include/asm-i386/errno.h Or even simpler: [prompt ]$ grep 113 /usr/include/*.h /usr/include/*/*.h | grep errno /usr/include/asm-i386/errno.h:#define EHOSTUNREACH 113 /* No route to host */ /usr/include/asm-x86_64/errno.h:#define EHOSTUNREACH 113 /* No route to host */ Just replace 113 with your own errno number; you will get a good hint about where to start looking for solving your problem. Martin |
|
September 28, 2007, 01:28 |
Hello Alessandro,
About the
|
#16 |
Senior Member
Martin Beaudoin
Join Date: Mar 2009
Posts: 332
Rep Power: 22 |
Hello Alessandro,
About the clock time value you are reporting with your 2 processes case: Yes, the size of the mesh might be a performance limiting factor here if it is too small. But another limiting factor might be the configuration of your gigabit network. Make sure that:
|
|
September 28, 2007, 04:43 |
Hi Alessando,
--> perl -e
|
#17 | |
Senior Member
Mark Olesen
Join Date: Mar 2009
Location: https://olesenm.github.io/
Posts: 1,714
Rep Power: 40 |
Hi Alessando,
Quote:
You might have better success with one of the following: --> perl -e '$!=113; die' or --> perl -e 'die $!=113' Since Berhard was trolling about decent languages, I can't resist. In this case he is right though. Writing the equivalent incorrect program in Python may take a few more characters, but looks prettier when it fails ;) >>> def errno(): ... nil ... errno() = 113 File "<stdin>", line 3 errno() = 113 ^ SyntaxError: invalid syntax BTW: for the really lazy people (like me), searching google for 'errno', '113' and 'mpi' is even faster than all the other solutions and reveals the OpenMPI archives at the top of the list. This also probably the place where you should be looking for debugging your MPI problems. It doesn't look like your problems are being caused by OpenFOAM. /mark |
||
October 1, 2007, 06:20 |
Hi Alessandro!
cpu=x in the
|
#18 |
Assistant Moderator
Bernhard Gschaider
Join Date: Mar 2009
Posts: 4,225
Rep Power: 51 |
Hi Alessandro!
cpu=x in the hostfile is propably a leftover from the LAM-days (but I don't know for sure) If you were doing your runs on the damBreak-case your numbers do not surprise me at all. For such small cases the main issue is latency and for that Ethernet is notoriously bad (it was designed for stability not for speed). And in that case Amdhal (http://en.wikipedia.org/wiki/Amdahl%27s_law) kicks in (with a longer non-parallelizable part than you would have for a serial case). About cluster setup: if you find a good text that describes all that, let us know. ("Building Clustered Linux Systems" by Robert W. Lucke is a nice start) Bernhard PS: Sorry about starting a "language war", but you must admit: The opportunity was too good to let it pass. I'm just disappointed that nobody came up with a Haskell-solution to the problem (wrong group propably)
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request |
|
October 1, 2007, 11:11 |
Hi,
We run Fluent and OF on
|
#19 |
Member
Ola Widlund
Join Date: Mar 2009
Location: Sweden
Posts: 87
Rep Power: 17 |
Hi,
We run Fluent and OF on a linux cluster with AMD64 processors (two cpu:s per node), with ordinary gigabit ethernet. As Bernhard points out, latency is the limiting factor. You need to give the cpu:s enough work to do, before they want to communicate with other cpu:s, so that they spend more time iterating than time waiting for the network... To get decent speed-up, our rule of thumb is to have 100.000-200.000 cells per cpu. For example, we would typically run problems of 1e6 cells on two nodes (= 4 cpu:s). /Ola |
|
October 2, 2007, 21:34 |
Dear Forum,
Thank you all f
|
#20 |
Member
Alessandro Spadoni
Join Date: Mar 2009
Location: Atlanta, GA
Posts: 65
Rep Power: 17 |
Dear Forum,
Thank you all for your suggestions. As a summary that can be useful for lost people like me: 1. Make sure all your nodes are accessible via ssh without needing to enter a password from your master node. This is done with ssh-keygen command. 2. Create file decompoParDict in you case/system/ directory. You can copy this file from other tutorials. If you don't know which tutorial, do the following: > cd > find -name 'decomposeParDict*' This will tell you where you can find the file decomposeParDict. The number of subdomains specified in decomposParDict should be equal to the number of cpu's you want to use, master+slaves. 3. Run decomposePar on our case directory, and copy all case files to all the slave nodes. 4. create a hostfile with master and slaves dns addresses, specifying the number of cpu per node, example: master.host slots=2 max-slots=2 slave1.host slots=1 max-slots=1 slave2.host slots=4 max-slots=4 etc... 5. Run you parallel simulation by: > mpirun --hostfile "yourHostFile" turbFoam $FOAM_RUN/tutorials/turbFoam MyCase -parallel > log & I am starting to run a case with arbitrary motion of mesh boundaries (in my case from a structural modal analysis) with both turb and LES models. I will keep you updated. Thank you again, Alessandro Spadoni |
|
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
UDF problems | Winnie | FLUENT | 22 | February 29, 2020 05:32 |
Problems with Fluent on simple 1D problems | agg | FLUENT | 3 | November 21, 2008 12:55 |
problems | Rogerio Fernandes Brito | CFX | 1 | May 5, 2008 22:08 |
UDF problems | Paolo Lampitella | FLUENT | 5 | September 8, 2005 21:43 |
problems | pirmohammadi | Main CFD Forum | 2 | December 21, 2004 11:34 |