Problems about distributed parallel runs
Hi all,
I'm trying to launch some distributed parallel runs on a CentOS based cluster (server and nodes have the same OS version installed), but this is what I have obtained running the foamJob script from the server: [krastev@epsilon morris60_SA_secondamesh]$ foamJob -s -p sonicAdaptiveFoam Parallel processing using OPENMPI with 4 processors Executing: mpirun -np 4 -hostfile machines /server/krastev/OpenFOAM/OpenFOAM-1.7.1/bin/foamExec sonicAdaptiveFoam -parallel | tee log /*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: 1.7.1 | | \\ / A nd | Web: www.OpenFOAM.com | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : 1.7.1-03e7e056c215 Exec : sonicAdaptiveFoam -parallel Date : Oct 24 2011 Time : 17:46:36 Host : node64-1.sub.uniroma2.it PID : 25824 [node64-1.sub.uniroma2.it][[39888,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_comple te_connect] connect() to 10.1.1.102 failed: No route to host (113) Could please someone give me some hint to find where the problem is? Thanks a lot V. PS-Some additional information: 1) I have installed OF-1.7.1 locally in the home folder of my account (I'm not the server administrator and I need to have my own free-compiling version) 2) the same run works perfectly on the single nodes (they are quad-core nodes) by launching either the foamJob command or directly the mpirun -np 4 etc. etc. syntax |
Hi Vesselin,
The problem is somewhat simple: one or more IPs on each machine isn't/aren't visible from any point of view in the cluster. A solution should be something like this: Quote:
Code:
/sbin/ifconfig Best regards, Bruno |
Hi Bruno, and thanks a lot for your answer!
Following your suggestions I've tried first to type (from the server): mpirun --mca btl_tcp_if_exclude lo -hostfile machines -np 4 foamExec sonicAdaptiveFoam -parallel but this is the result: bash: orted: command not found -------------------------------------------------------------------------- A daemon (pid 20171) died unexpectedly with status 127 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- bash: orted: command not found mpirun: clean termination accomplished After that, I've checked the IP's of each node with /sbin/ifconfig and all seems fine (each node has reported an active eth0 connection with an assigned IP). Finally, I've checked also the /etc/hosts files and they look like the following: 10.1.1.102 node64-2.sub.uniroma2.it node64-2 # Added by NetworkManager 127.0.0.1 localhost.localdomain localhost ::1 node64-2.sub.uniroma2.it node64-2 localhost6.localdomain6 localhost6 with 10.1.1.102 being the same inet adress reported by /sbin/ifconfig, except for the node 1, where the hosts file looks like this: 10.1.1.101 node64-1.sub.uniroma2.it node64-1 127.0.0.1 localhost localhost.localdomain In addition, I can tell you that each node "knows" about the others in therms of interactive ssh connections, because they share a common .ssh/known_hosts file containing all the proper IP's. Probably I'm missing something trivial, but the fact is that I would really like to run my simulations independently from the administrator (remember that I have no root access to the server, neither to the nodes)... Thanks once again V. |
Hi Vesselin,
Quote:
Code:
`which mpirun` --mca btl_tcp_if_exclude lo -hostfile machines -np 4 `which foamExec` sonicAdaptiveFoam -parallel And remember that this way, both mpirun and OpenFOAM should be visible on the same path. I assume that your home folder is shared among all nodes. Quote:
Nonetheless, if the "machines" file for your case indicates only IPs, then it should work as intended. Best regards and good luck! Bruno |
Quote:
/*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: 1.7.1 | | \\ / A nd | Web: www.OpenFOAM.com | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : 1.7.1-03e7e056c215 Exec : sonicAdaptiveFoam -parallel Date : Oct 25 2011 Time : 11:48:36 Host : node64-1.sub.uniroma2.it PID : 31167 [node64-1.sub.uniroma2.it][[62160,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_comple te_connect] connect() to 10.1.1.102 failed: No route to host (113) And nothing changes if i put directly the IP's inside the machines file (e.g. 10.1.1.101 instead of node64-1). Also, it is the same if I use the server as the master node and any other node as the slave one. Any further idea? Thanks V. |
I think I had a similar problem recently, nodes were not able to communicate because mpirun by default used an interface which was not connected. This was installed by default for some application related with virtual machines (I do not remember the name, neither I can check it right now).
The solution was to kill this (unused in our case) interface. |
Hi Vesselin,
I still think the problem might be the incomplete "/etc/hosts" file, but I could be wrong. Several days ago I wrote a couple of posts about trying to isolate-and-conquer issues when running in parallel: Segmentation fault in interFoam run through openMPI posts #8 and #10. You might want to read the whole thread, just to understand better what's being talked about on those two posts :) I've also been collecting more information about running OpenFOAM in parallel on this blog post of mine: Notes about running OpenFOAM in parallel - These might come in handy for you as well. Good luck! Bruno |
Quote:
thanks for the answer but I need a more precise information about this application before starting killing blindly something on the cluster (remember also that I have no administrator rights). V. |
Quote:
Thanks once again V. |
Recently, I meet the same problem and still not solved yet. Additional information: I can use other node from one node. For example, I can use node14 from node16. But not use two or more nodes. Anyone has some other advices? :confused:
|
Greetings star shower,
Quote:
Best regards, Bruno |
All times are GMT -4. The time now is 04:40. |