CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)
-   -   Problems about distributed parallel runs (https://www.cfd-online.com/Forums/openfoam-solving/93708-problems-about-distributed-parallel-runs.html)

vkrastev October 24, 2011 13:09

Problems about distributed parallel runs
 
Hi all,
I'm trying to launch some distributed parallel runs on a CentOS based cluster (server and nodes have the same OS version installed), but this is what I have obtained running the foamJob script from the server:

[krastev@epsilon morris60_SA_secondamesh]$ foamJob -s -p sonicAdaptiveFoam
Parallel processing using OPENMPI with 4 processors
Executing: mpirun -np 4 -hostfile machines /server/krastev/OpenFOAM/OpenFOAM-1.7.1/bin/foamExec sonicAdaptiveFoam -parallel | tee log
/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 1.7.1 |
| \\ / A nd | Web: www.OpenFOAM.com |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : 1.7.1-03e7e056c215
Exec : sonicAdaptiveFoam -parallel
Date : Oct 24 2011
Time : 17:46:36
Host : node64-1.sub.uniroma2.it
PID : 25824
[node64-1.sub.uniroma2.it][[39888,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_comple te_connect] connect() to 10.1.1.102 failed: No route to host (113)

Could please someone give me some hint to find where the problem is?

Thanks a lot

V.

PS-Some additional information:
1) I have installed OF-1.7.1 locally in the home folder of my account (I'm not the server administrator and I need to have my own free-compiling version)
2) the same run works perfectly on the single nodes (they are quad-core nodes) by launching either the foamJob command or directly the mpirun -np 4 etc. etc. syntax

wyldckat October 24, 2011 14:53

Hi Vesselin,

The problem is somewhat simple: one or more IPs on each machine isn't/aren't visible from any point of view in the cluster.

A solution should be something like this:
Quote:

Originally Posted by pkr (Post 292700)
When using MPI_reduce, the OpenMPI was trying to establish TCP through a different interface. The problem is solved if the following command is used:
mpirun --mca btl_tcp_if_exclude lo,virbr0 -hostfile machines -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec interFoam -parallel

The above command will restrict MPI to use certain networks (lo, vibro in this case).

You can also check the "/etc/hosts" files on each node to check which IPs each one knows about and also try running on each one:
Code:

/sbin/ifconfig
to figure out which cards are associated to which IPs.

Best regards,
Bruno

vkrastev October 25, 2011 06:24

Hi Bruno, and thanks a lot for your answer!

Following your suggestions I've tried first to type (from the server):

mpirun --mca btl_tcp_if_exclude lo -hostfile machines -np 4 foamExec sonicAdaptiveFoam -parallel

but this is the result:

bash: orted: command not found
--------------------------------------------------------------------------
A daemon (pid 20171) died unexpectedly with status 127 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
bash: orted: command not found
mpirun: clean termination accomplished

After that, I've checked the IP's of each node with /sbin/ifconfig and all seems fine (each node has reported an active eth0 connection with an assigned IP). Finally, I've checked also the /etc/hosts files and they look like the following:

10.1.1.102 node64-2.sub.uniroma2.it node64-2 # Added by NetworkManager
127.0.0.1 localhost.localdomain localhost
::1 node64-2.sub.uniroma2.it node64-2 localhost6.localdomain6 localhost6

with 10.1.1.102 being the same inet adress reported by /sbin/ifconfig, except for the node 1, where the hosts file looks like this:

10.1.1.101 node64-1.sub.uniroma2.it node64-1
127.0.0.1 localhost localhost.localdomain

In addition, I can tell you that each node "knows" about the others in therms of interactive ssh connections, because they share a common .ssh/known_hosts file containing all the proper IP's.
Probably I'm missing something trivial, but the fact is that I would really like to run my simulations independently from the administrator (remember that I have no root access to the server, neither to the nodes)...

Thanks once again

V.

wyldckat October 25, 2011 06:37

Hi Vesselin,

Quote:

Originally Posted by vkrastev (Post 329320)
Following your suggestions I've tried first to type (from the server):

mpirun --mca btl_tcp_if_exclude lo -hostfile machines -np 4 foamExec sonicAdaptiveFoam -parallel

Ah, try this instead:
Code:

`which mpirun` --mca btl_tcp_if_exclude lo -hostfile machines -np 4 `which foamExec` sonicAdaptiveFoam -parallel
This way the full path is provided during launch.
And remember that this way, both mpirun and OpenFOAM should be visible on the same path. I assume that your home folder is shared among all nodes.


Quote:

Originally Posted by vkrastev (Post 329320)
In addition, I can tell you that each node "knows" about the others in therms of interactive ssh connections, because they share a common .ssh/known_hosts file containing all the proper IP's.

"known_hosts" probably isn't used by Open-MPI during the normal operation, since it's "/etc/hosts" that should have the relevant data. Although it might be used in the initial connection, since it should use SSH by default...
Nonetheless, if the "machines" file for your case indicates only IPs, then it should work as intended.

Best regards and good luck!
Bruno

vkrastev October 25, 2011 07:01

Quote:

Originally Posted by wyldckat (Post 329323)
Hi Vesselin,


Ah, try this instead:
Code:

`which mpirun` --mca btl_tcp_if_exclude lo -hostfile machines -np 4 `which foamExec` sonicAdaptiveFoam -parallel
This way the full path is provided during launch.
And remember that this way, both mpirun and OpenFOAM should be visible on the same path. I assume that your home folder is shared among all nodes.

Yes, the home folder is shared by all the nodes, but after trying the above command line I've obtained the same error as with the foamJob script:

/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 1.7.1 |
| \\ / A nd | Web: www.OpenFOAM.com |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : 1.7.1-03e7e056c215
Exec : sonicAdaptiveFoam -parallel
Date : Oct 25 2011
Time : 11:48:36
Host : node64-1.sub.uniroma2.it
PID : 31167
[node64-1.sub.uniroma2.it][[62160,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_comple te_connect] connect() to 10.1.1.102 failed: No route to host (113)

And nothing changes if i put directly the IP's inside the machines file (e.g. 10.1.1.101 instead of node64-1). Also, it is the same if I use the server as the master node and any other node as the slave one. Any further idea?

Thanks

V.

Phicau October 25, 2011 07:16

I think I had a similar problem recently, nodes were not able to communicate because mpirun by default used an interface which was not connected. This was installed by default for some application related with virtual machines (I do not remember the name, neither I can check it right now).

The solution was to kill this (unused in our case) interface.

wyldckat October 25, 2011 07:17

Hi Vesselin,

I still think the problem might be the incomplete "/etc/hosts" file, but I could be wrong.
Several days ago I wrote a couple of posts about trying to isolate-and-conquer issues when running in parallel: Segmentation fault in interFoam run through openMPI posts #8 and #10. You might want to read the whole thread, just to understand better what's being talked about on those two posts :)


I've also been collecting more information about running OpenFOAM in parallel on this blog post of mine: Notes about running OpenFOAM in parallel - These might come in handy for you as well.

Good luck!
Bruno

vkrastev October 25, 2011 08:23

Quote:

Originally Posted by Phicau (Post 329330)
I think I had a similar problem recently, nodes were not able to communicate because mpirun by default used an interface which was not connected. This was installed by default for some application related with virtual machines (I do not remember the name, neither I can check it right now).

The solution was to kill this (unused in our case) interface.

Hi Pablo,
thanks for the answer but I need a more precise information about this application before starting killing blindly something on the cluster (remember also that I have no administrator rights).

V.

vkrastev October 25, 2011 08:32

Quote:

Originally Posted by wyldckat (Post 329331)
Hi Vesselin,

I still think the problem might be the incomplete "/etc/hosts" file, but I could be wrong.
Several days ago I wrote a couple of posts about trying to isolate-and-conquer issues when running in parallel: Segmentation fault in interFoam run through openMPI posts #8 and #10. You might want to read the whole thread, just to understand better what's being talked about on those two posts :)


I've also been collecting more information about running OpenFOAM in parallel on this blog post of mine: Notes about running OpenFOAM in parallel - These might come in handy for you as well.

Good luck!
Bruno

Thank you very much Bruno, I'm starting with the first tread you've suggested, by doing some of the tests I still haven't tried: well, though I still haven't read carefully all the possibilities mentioned in your blog post, the further I go on the more it seems that the /etc/hosts problem is the key one (which is a bad news for me, as I'm not able to modify those files without contacting the administrator). Anyway, a curious (at least for me) thing also happened: if i change the machines file putting only one node with 4 cpus, the run starts without a problem from the server (whichever the node in the machines file)...But again there are problems if I try to do the same launching from one node to another!

Thanks once again

V.

star shower November 11, 2012 08:18

Recently, I meet the same problem and still not solved yet. Additional information: I can use other node from one node. For example, I can use node14 from node16. But not use two or more nodes. Anyone has some other advices? :confused:

wyldckat November 11, 2012 10:22

Greetings star shower,

Quote:

Originally Posted by star shower (Post 391474)
Recently, I meet the same problem and still not solved yet. Additional information: I can use other node from one node. For example, I can use node14 from node16. But not use two or more nodes. Anyone has some other advices? :confused:

This was discussed in a somewhat recent thread: http://www.cfd-online.com/Forums/ope...ple-nodes.html - there are several suggestions there and I advise you to also check the threads that one is linking to!

Best regards,
Bruno


All times are GMT -4. The time now is 04:40.