CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)
-   -   Running damBreak with OpenFOAM 15 and Open MPI on mixed up CPU Systems (https://www.cfd-online.com/Forums/openfoam-solving/58280-running-dambreak-openfoam-15-open-mpi-mixed-up-cpu-systems.html)

engys November 19, 2008 04:58

Hi everybody, I try to run da
 
Hi everybody,
I try to run damBreakFine on four single CPU Systems + one dual CPU System via Open MPI

The machines file looks like this:

node1
node2
node3
node4
node5

I start the processes with:

mpirun --hostfile damBreakFine/system/machines -np 5 interFoam -case damBreakFine -parallel >& damBreakFine/log &

If I mpirun the single CPU Systems everything works fine. If I mpirun the dual CPU System it works fine. But if I mix them up, mpirun spawn a thread for every CPU and put them to 100% usage, but did not get any further.

The log file shows only: create mesh for time = 0

There is a need to kill one interFoam process to bring the CPUs back to normal usage rate.

Any suggestion?

engys November 19, 2008 08:26

Hello again, I also compile
 
Hello again,

I also compiled a little MPI "Hello World" Test and it works fine on 6 CPUs:

----------------------

#include <mpi.h>
#include <stdio.h>
#include <unistd.h>

int
main(int argc, char *argv[])
{
int rank, size, node;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &node);

char buf[1024];

gethostname(buf,1024);
printf (buf);
printf(" says hello world with nid %d \n",node);

MPI_Finalize();
return 0;
}

--------------------

> mpicc hello.c -o hello

This time the machines file looks like this:

node1
node2
node3
node4
node5 cpu=2

> mpirun --hostfile machines -np 10 ./hello

node1 says hello world with nid 0
node2 says hello world with nid 1
node3 says hello world with nid 2
node4 says hello world with nid 3
node5 says hello world with nid 4
node2 says hello world with nid 7
node3 says hello world with nid 8
node4 says hello world with nid 9
node5 says hello world with nid 5
node1 says hello world with nid 6

Could it be that OpenFOAM 1.5 parallel run is only supported up to 4 processors?

dkingsley November 19, 2008 09:11

I am running turbFoam right no
 
I am running turbFoam right now on 23 nodes with dual quad core cpus, thats 184 cores.

Did you run the decompasePar for the new number of processes.

engys November 19, 2008 09:34

Ahh good to hear! Yes I go th
 
Ahh good to hear!
Yes I go through the steps:

Editing decomposeParDict:

-------------

numberOfSubdomains 6;
simpleCoeffs
{
n (1 6 1);
}

metisCoeffs
{
processorWeights
(
1
1
1
1
1
1
);
}

-------------

> blockMesh
> setFields
> decomposePar -force

> mpirun --hostfile system/machines -np 6 interFoam -parallel >& log &

dkingsley November 19, 2008 10:57

Did decomposePar give any mess
 
Did decomposePar give any messages about zero sized blocks or faces?

Most of the time I see OF1.5 hang on startup is due to decomposition errors or NIC failures.

What is the topology/layout of the network for your nodes? Are you using the MPI libraries from OF1.5?

engys November 20, 2008 06:00

Thanks for your attention.
 
Thanks for your attention.

decomposePar did not give any messages about zero sized blocks or faces.

Every network interface doing well. The dual system has two network interfaces in the same subnet.

The machines connected to a stellar GigaBit switch and the open MPI libs compiled out of the openFoam 1.5 gtgz's on Heron LTS

I also try the turbFoam solver but the results are the same. If I use the single prozessor machines everything run smooth, but if I mix them up with the dual system the log file says: "create mesh for time = 0" an the processes runs on 100% CPU usage until i stop them.

dkingsley November 20, 2008 08:06

Is you dual machine actually 2
 
Is you dual machine actually 2 cpu's or two cores, if it is two cores then the cpu=2 may not work.

my machine file is auto generated by the queue manaager Torque.

However, I think you should have something like this in yours,

node1 //single processor
node2 //single processor
node3 //single processor
node4 //single processor
node5 //dual processor
node5 //dual processor

To match up with the -np=6, don't add the comments to the machine file.

engys November 21, 2008 15:38

After a phone call with Bernha
 
After a phone call with Bernhard Gschaider I put the Dualsystem aside and add a Quadcore with one network interface to the single CPU systems. This time every node doing well and I could take some measurements.

Thanx for your help and have a nice weekend!

engys November 26, 2008 04:57

Hi, I solved the problem with
 
Hi,
I solved the problem with the Dual CPU System. After stopping the second Network Interface with

ifdown eth1

everything doing fine.


All times are GMT -4. The time now is 16:56.