CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)
-   -   OpenMPI fails cluster run with an orphaned IP Address (https://www.cfd-online.com/Forums/openfoam-solving/129068-openmpi-fails-cluster-run-orphaned-ip-address.html)

svg January 28, 2014 03:41

OpenMPI fails cluster run with an orphaned IP Address
 
Hello all,

I have a very strange problem with running OpenFOAM on two computers.
The setup might be a bit special, but it should not be a problem. So one instance is a vmware virtual machine (ubuntu) and the second computer is a normal install. All linux and foam versions are the same. The two are connected over a gbit switch. The vmware instance runs with a bridged interface, which gives it an outside DHCP assigned IP.
The network itself works fine. I am able to ssh to the other PC and vice versa. I am also able to start a run by e.g.

Quote:

mpirun -np 20 -hostfile machines snappyHexMesh -parallel
(The VMWare machine has 12 cores, the normal machine 8 cores)
mpirun starts up, but it strangely tries to connect to an unknown IP Address!:

The vmware machine is the host where I launch the process, 10.132.33.101 is the other machine.

Quote:

xx@tbd:~/OpenFOAM/xx-2.2.2/run/tutorials/incompressible/simpleFoam/motorBike$ mpirun -np 20 --mca btl_tcp_if_include eth0 -hostfile machines snappyHexMesh -parallel
xx@10.132.33.101's password:

/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 2.2.2 |
| \\ / A nd | Web: www.OpenFOAM.org |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : 2.2.2-9240f8b967db
Exec : snappyHexMesh -parallel
Date : Jan 28 2014
Time : 09:08:14
Host : "tbd"
PID : 12314
[tbd][[63191,1],0][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_comple te_connect] connect() to 10.0.0.2 failed: No route to host (113)


After this, the snappyHexMesh process starts on both computers (so something is going on) and all processes on both machines run at full cpu load, but they only use ~5MB memory even though it should be much more given the foam case. I do not get any output and there are no output files even after some time. If I run the same case locally on the vmware machine, everything works fine.
I started to play with the mpirun parameters, but they did not seem to help e.g.: enforce an interface
Quote:

mpirun -np 20 --mca btl_tcp_if_include eth0 -hostfile machines snappyHexMesh -parallel
So the main strange thing is, that the IP Address of 10.0.0.2 does not exist! nowhere! If I run ifconfig on both machines I only get the ethernet interfaces with the DHCP addresses. (both machines are in the 10.132.33.xx range). I can ping those official addresses and everything is fine. There is one idea where this 10.0.0.2 address comes from and this is another ethernet adapter which WAS connected for some time and had this address set manually. However it is not connected to the machine anymore!

I checked the hosts file on both machines and the ifconfig output.
So I am a bit helpless and hope that anybody can figure out the routing problem. I really do not see where he even gets the 10.0.0.2 address from.

:confused:

Update:

i can get rid of the error message by excluding the ip address.
Quote:

mpirun -np 20 --mca btl_tcp_if_exclude 10.0.0.0/24 -hostfile machines snappyHexMesh -parallel
However, then the processes start on both machines, but 1. with the wrong number of processes (before it was right, now it is 10 processes on each machine) and 2. I still do not get any output and the memory usage is ~5MB per process...


regards


svg


All times are GMT -4. The time now is 05:36.