CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM Running, Solving & CFD

OpenMPI fails cluster run with an orphaned IP Address

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Display Modes
Old   January 28, 2014, 04:41
Question OpenMPI fails cluster run with an orphaned IP Address
  #1
svg
New Member
 
Sven G
Join Date: Sep 2013
Posts: 3
Rep Power: 3
svg is on a distinguished road
Hello all,

I have a very strange problem with running OpenFOAM on two computers.
The setup might be a bit special, but it should not be a problem. So one instance is a vmware virtual machine (ubuntu) and the second computer is a normal install. All linux and foam versions are the same. The two are connected over a gbit switch. The vmware instance runs with a bridged interface, which gives it an outside DHCP assigned IP.
The network itself works fine. I am able to ssh to the other PC and vice versa. I am also able to start a run by e.g.

Quote:
mpirun -np 20 -hostfile machines snappyHexMesh -parallel
(The VMWare machine has 12 cores, the normal machine 8 cores)
mpirun starts up, but it strangely tries to connect to an unknown IP Address!:

The vmware machine is the host where I launch the process, 10.132.33.101 is the other machine.

Quote:
xx@tbd:~/OpenFOAM/xx-2.2.2/run/tutorials/incompressible/simpleFoam/motorBike$ mpirun -np 20 --mca btl_tcp_if_include eth0 -hostfile machines snappyHexMesh -parallel
xx@10.132.33.101's password:

/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 2.2.2 |
| \\ / A nd | Web: www.OpenFOAM.org |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : 2.2.2-9240f8b967db
Exec : snappyHexMesh -parallel
Date : Jan 28 2014
Time : 09:08:14
Host : "tbd"
PID : 12314
[tbd][[63191,1],0][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_comple te_connect] connect() to 10.0.0.2 failed: No route to host (113)

After this, the snappyHexMesh process starts on both computers (so something is going on) and all processes on both machines run at full cpu load, but they only use ~5MB memory even though it should be much more given the foam case. I do not get any output and there are no output files even after some time. If I run the same case locally on the vmware machine, everything works fine.
I started to play with the mpirun parameters, but they did not seem to help e.g.: enforce an interface
Quote:
mpirun -np 20 --mca btl_tcp_if_include eth0 -hostfile machines snappyHexMesh -parallel
So the main strange thing is, that the IP Address of 10.0.0.2 does not exist! nowhere! If I run ifconfig on both machines I only get the ethernet interfaces with the DHCP addresses. (both machines are in the 10.132.33.xx range). I can ping those official addresses and everything is fine. There is one idea where this 10.0.0.2 address comes from and this is another ethernet adapter which WAS connected for some time and had this address set manually. However it is not connected to the machine anymore!

I checked the hosts file on both machines and the ifconfig output.
So I am a bit helpless and hope that anybody can figure out the routing problem. I really do not see where he even gets the 10.0.0.2 address from.



Update:

i can get rid of the error message by excluding the ip address.
Quote:
mpirun -np 20 --mca btl_tcp_if_exclude 10.0.0.0/24 -hostfile machines snappyHexMesh -parallel
However, then the processes start on both machines, but 1. with the wrong number of processes (before it was right, now it is 10 processes on each machine) and 2. I still do not get any output and the memory usage is ~5MB per process...


regards


svg

Last edited by svg; January 28, 2014 at 04:48. Reason: New findings
svg is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
run Fluent with SGE "cluster" aounallah FLUENT 7 September 26, 2013 19:39
Working directory via command line Luiz CFX 4 March 6, 2011 21:02
Cant run in parallel on two nodes using OpenMPI CHristofer Main CFD Forum 0 October 26, 2007 09:54
Serial run OK parallel one fails r2d2 OpenFOAM Running, Solving & CFD 2 November 16, 2005 13:44
Cluster run without master node Mikhail CFX 4 September 29, 2005 08:58


All times are GMT -4. The time now is 21:22.