wese May 13, 2008 04:44

I'm trying to perform a simulation with about 3.3 Mil. cells.
Because this would take quite a lot of time on a single computer, I've used decomposePar to split it up to 15 pieces (9 PC's), but when I've tried to start the job the entire network crashed and our NFS-Server had to be rebooted.

First I thought, that it's not such a good idea to keep all 15 pieces within the NFS file-system.
So I've distributed it to the specific computers (while doing so I've checked all my steps with the OpenFOAM users manual) and changed the decomposition method from METIS to hierarchical, but when I've tried to restart the job the entire network hung again.

The funny thin is, that everything works fine (nevermind what kind of decomposition I use or if I distribute the data onto several disks or not), when I use less cores for my simulation.

If I use for example 12 cores on 6 or 7 machines everything works well.

During the last weekend I've tried to run this job on 8 nodes (4 computers), but after several hours the job was killed automatically.

What bothers me is, that I don't get any logs or other information, so that I always just can restart everything and hope, that it works well then.
Do you have any idea what causes my problems?
Is there any known problem with openMPI and an odd number of nodes and is it possible to get more information, what went wrong if a simulation fails?

To complete my post:
We work on a 100MBit (particularly 1000MBit) Network with a NFS-Server for the OpenFOAM 1.4.1 installation under SuSE-Linux 10.2. All machines are 64-Bit Intel Core2Duo Computers of nearly the same performance.

