CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM Running, Solving & CFD

Problems about distributed parallel runs

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Display Modes
Old   October 24, 2011, 12:09
Default Problems about distributed parallel runs
  #1
Senior Member
 
Vesselin Krastev
Join Date: Jan 2010
Location: University of Tor Vergata, Rome
Posts: 361
Rep Power: 10
vkrastev is on a distinguished road
Hi all,
I'm trying to launch some distributed parallel runs on a CentOS based cluster (server and nodes have the same OS version installed), but this is what I have obtained running the foamJob script from the server:

[krastev@epsilon morris60_SA_secondamesh]$ foamJob -s -p sonicAdaptiveFoam
Parallel processing using OPENMPI with 4 processors
Executing: mpirun -np 4 -hostfile machines /server/krastev/OpenFOAM/OpenFOAM-1.7.1/bin/foamExec sonicAdaptiveFoam -parallel | tee log
/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 1.7.1 |
| \\ / A nd | Web: www.OpenFOAM.com |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : 1.7.1-03e7e056c215
Exec : sonicAdaptiveFoam -parallel
Date : Oct 24 2011
Time : 17:46:36
Host : node64-1.sub.uniroma2.it
PID : 25824
[node64-1.sub.uniroma2.it][[39888,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_comple te_connect] connect() to 10.1.1.102 failed: No route to host (113)

Could please someone give me some hint to find where the problem is?

Thanks a lot

V.

PS-Some additional information:
1) I have installed OF-1.7.1 locally in the home folder of my account (I'm not the server administrator and I need to have my own free-compiling version)
2) the same run works perfectly on the single nodes (they are quad-core nodes) by launching either the foamJob command or directly the mpirun -np 4 etc. etc. syntax
vkrastev is offline   Reply With Quote

Old   October 24, 2011, 13:53
Default
  #2
Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 8,253
Blog Entries: 34
Rep Power: 84
wyldckat is just really nicewyldckat is just really nicewyldckat is just really nicewyldckat is just really nice
Hi Vesselin,

The problem is somewhat simple: one or more IPs on each machine isn't/aren't visible from any point of view in the cluster.

A solution should be something like this:
Quote:
Originally Posted by pkr View Post
When using MPI_reduce, the OpenMPI was trying to establish TCP through a different interface. The problem is solved if the following command is used:
mpirun --mca btl_tcp_if_exclude lo,virbr0 -hostfile machines -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec interFoam -parallel

The above command will restrict MPI to use certain networks (lo, vibro in this case).
You can also check the "/etc/hosts" files on each node to check which IPs each one knows about and also try running on each one:
Code:
/sbin/ifconfig
to figure out which cards are associated to which IPs.

Best regards,
Bruno
wyldckat is offline   Reply With Quote

Old   October 25, 2011, 05:24
Default
  #3
Senior Member
 
Vesselin Krastev
Join Date: Jan 2010
Location: University of Tor Vergata, Rome
Posts: 361
Rep Power: 10
vkrastev is on a distinguished road
Hi Bruno, and thanks a lot for your answer!

Following your suggestions I've tried first to type (from the server):

mpirun --mca btl_tcp_if_exclude lo -hostfile machines -np 4 foamExec sonicAdaptiveFoam -parallel

but this is the result:

bash: orted: command not found
--------------------------------------------------------------------------
A daemon (pid 20171) died unexpectedly with status 127 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
bash: orted: command not found
mpirun: clean termination accomplished

After that, I've checked the IP's of each node with /sbin/ifconfig and all seems fine (each node has reported an active eth0 connection with an assigned IP). Finally, I've checked also the /etc/hosts files and they look like the following:

10.1.1.102 node64-2.sub.uniroma2.it node64-2 # Added by NetworkManager
127.0.0.1 localhost.localdomain localhost
::1 node64-2.sub.uniroma2.it node64-2 localhost6.localdomain6 localhost6

with 10.1.1.102 being the same inet adress reported by /sbin/ifconfig, except for the node 1, where the hosts file looks like this:

10.1.1.101 node64-1.sub.uniroma2.it node64-1
127.0.0.1 localhost localhost.localdomain

In addition, I can tell you that each node "knows" about the others in therms of interactive ssh connections, because they share a common .ssh/known_hosts file containing all the proper IP's.
Probably I'm missing something trivial, but the fact is that I would really like to run my simulations independently from the administrator (remember that I have no root access to the server, neither to the nodes)...

Thanks once again

V.
vkrastev is offline   Reply With Quote

Old   October 25, 2011, 05:37
Default
  #4
Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 8,253
Blog Entries: 34
Rep Power: 84
wyldckat is just really nicewyldckat is just really nicewyldckat is just really nicewyldckat is just really nice
Hi Vesselin,

Quote:
Originally Posted by vkrastev View Post
Following your suggestions I've tried first to type (from the server):

mpirun --mca btl_tcp_if_exclude lo -hostfile machines -np 4 foamExec sonicAdaptiveFoam -parallel
Ah, try this instead:
Code:
`which mpirun` --mca btl_tcp_if_exclude lo -hostfile machines -np 4 `which foamExec` sonicAdaptiveFoam -parallel
This way the full path is provided during launch.
And remember that this way, both mpirun and OpenFOAM should be visible on the same path. I assume that your home folder is shared among all nodes.


Quote:
Originally Posted by vkrastev View Post
In addition, I can tell you that each node "knows" about the others in therms of interactive ssh connections, because they share a common .ssh/known_hosts file containing all the proper IP's.
"known_hosts" probably isn't used by Open-MPI during the normal operation, since it's "/etc/hosts" that should have the relevant data. Although it might be used in the initial connection, since it should use SSH by default...
Nonetheless, if the "machines" file for your case indicates only IPs, then it should work as intended.

Best regards and good luck!
Bruno
wyldckat is offline   Reply With Quote

Old   October 25, 2011, 06:01
Default
  #5
Senior Member
 
Vesselin Krastev
Join Date: Jan 2010
Location: University of Tor Vergata, Rome
Posts: 361
Rep Power: 10
vkrastev is on a distinguished road
Quote:
Originally Posted by wyldckat View Post
Hi Vesselin,


Ah, try this instead:
Code:
`which mpirun` --mca btl_tcp_if_exclude lo -hostfile machines -np 4 `which foamExec` sonicAdaptiveFoam -parallel
This way the full path is provided during launch.
And remember that this way, both mpirun and OpenFOAM should be visible on the same path. I assume that your home folder is shared among all nodes.
Yes, the home folder is shared by all the nodes, but after trying the above command line I've obtained the same error as with the foamJob script:

/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 1.7.1 |
| \\ / A nd | Web: www.OpenFOAM.com |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : 1.7.1-03e7e056c215
Exec : sonicAdaptiveFoam -parallel
Date : Oct 25 2011
Time : 11:48:36
Host : node64-1.sub.uniroma2.it
PID : 31167
[node64-1.sub.uniroma2.it][[62160,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_comple te_connect] connect() to 10.1.1.102 failed: No route to host (113)

And nothing changes if i put directly the IP's inside the machines file (e.g. 10.1.1.101 instead of node64-1). Also, it is the same if I use the server as the master node and any other node as the slave one. Any further idea?

Thanks

V.
vkrastev is offline   Reply With Quote

Old   October 25, 2011, 06:16
Default
  #6
Senior Member
 
Pablo Higuera
Join Date: Jan 2011
Posts: 233
Rep Power: 7
Phicau is on a distinguished road
I think I had a similar problem recently, nodes were not able to communicate because mpirun by default used an interface which was not connected. This was installed by default for some application related with virtual machines (I do not remember the name, neither I can check it right now).

The solution was to kill this (unused in our case) interface.
Phicau is offline   Reply With Quote

Old   October 25, 2011, 06:17
Default
  #7
Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 8,253
Blog Entries: 34
Rep Power: 84
wyldckat is just really nicewyldckat is just really nicewyldckat is just really nicewyldckat is just really nice
Hi Vesselin,

I still think the problem might be the incomplete "/etc/hosts" file, but I could be wrong.
Several days ago I wrote a couple of posts about trying to isolate-and-conquer issues when running in parallel: Segmentation fault in interFoam run through openMPI posts #8 and #10. You might want to read the whole thread, just to understand better what's being talked about on those two posts


I've also been collecting more information about running OpenFOAM in parallel on this blog post of mine: Notes about running OpenFOAM in parallel - These might come in handy for you as well.

Good luck!
Bruno
wyldckat is offline   Reply With Quote

Old   October 25, 2011, 07:23
Default
  #8
Senior Member
 
Vesselin Krastev
Join Date: Jan 2010
Location: University of Tor Vergata, Rome
Posts: 361
Rep Power: 10
vkrastev is on a distinguished road
Quote:
Originally Posted by Phicau View Post
I think I had a similar problem recently, nodes were not able to communicate because mpirun by default used an interface which was not connected. This was installed by default for some application related with virtual machines (I do not remember the name, neither I can check it right now).

The solution was to kill this (unused in our case) interface.
Hi Pablo,
thanks for the answer but I need a more precise information about this application before starting killing blindly something on the cluster (remember also that I have no administrator rights).

V.
vkrastev is offline   Reply With Quote

Old   October 25, 2011, 07:32
Default
  #9
Senior Member
 
Vesselin Krastev
Join Date: Jan 2010
Location: University of Tor Vergata, Rome
Posts: 361
Rep Power: 10
vkrastev is on a distinguished road
Quote:
Originally Posted by wyldckat View Post
Hi Vesselin,

I still think the problem might be the incomplete "/etc/hosts" file, but I could be wrong.
Several days ago I wrote a couple of posts about trying to isolate-and-conquer issues when running in parallel: Segmentation fault in interFoam run through openMPI posts #8 and #10. You might want to read the whole thread, just to understand better what's being talked about on those two posts


I've also been collecting more information about running OpenFOAM in parallel on this blog post of mine: Notes about running OpenFOAM in parallel - These might come in handy for you as well.

Good luck!
Bruno
Thank you very much Bruno, I'm starting with the first tread you've suggested, by doing some of the tests I still haven't tried: well, though I still haven't read carefully all the possibilities mentioned in your blog post, the further I go on the more it seems that the /etc/hosts problem is the key one (which is a bad news for me, as I'm not able to modify those files without contacting the administrator). Anyway, a curious (at least for me) thing also happened: if i change the machines file putting only one node with 4 cpus, the run starts without a problem from the server (whichever the node in the machines file)...But again there are problems if I try to do the same launching from one node to another!

Thanks once again

V.

Last edited by vkrastev; October 25, 2011 at 07:40. Reason: adding information
vkrastev is offline   Reply With Quote

Old   November 11, 2012, 08:18
Default
  #10
New Member
 
charlse
Join Date: Mar 2011
Location: china
Posts: 6
Rep Power: 6
star shower is on a distinguished road
Recently, I meet the same problem and still not solved yet. Additional information: I can use other node from one node. For example, I can use node14 from node16. But not use two or more nodes. Anyone has some other advices?
star shower is offline   Reply With Quote

Old   November 11, 2012, 10:22
Default
  #11
Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 8,253
Blog Entries: 34
Rep Power: 84
wyldckat is just really nicewyldckat is just really nicewyldckat is just really nicewyldckat is just really nice
Greetings star shower,

Quote:
Originally Posted by star shower View Post
Recently, I meet the same problem and still not solved yet. Additional information: I can use other node from one node. For example, I can use node14 from node16. But not use two or more nodes. Anyone has some other advices?
This was discussed in a somewhat recent thread: MPI issue on multiple nodes - there are several suggestions there and I advise you to also check the threads that one is linking to!

Best regards,
Bruno
wyldckat is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
interDyMFoam, problems in mesh motion solutor run in parallel DLC OpenFOAM 11 December 11, 2012 03:20
Problems with "polyTopoChange" on parallel?!? daZigeiner OpenFOAM Programming & Development 0 March 14, 2011 11:05
Cyclic patches and parallel postprocessing problems askjak OpenFOAM Bugs 18 October 27, 2010 03:35
STAR-CD v4.02 parallel problems on Win XP 64 bit Kasper CD-adapco 3 September 24, 2007 06:06
Problems in Parallel PHOENICS Zeng Phoenics 3 February 27, 2001 14:28


All times are GMT -4. The time now is 04:17.