CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM (http://www.cfd-online.com/Forums/openfoam/)
-   -   Issue with running in parallel on multiple nodes (http://www.cfd-online.com/Forums/openfoam/79580-issue-running-parallel-multiple-nodes.html)

daveatstyacht August 27, 2010 08:07

Issue with running in parallel on multiple nodes
 
Hey all,
I have been struggling for weeks with trying to get my network to perform parallel processing using the openmpi implemented with OF. I have performed these runs in parallel on a single node and successfully run this case. The issue arises when I goto run on multiple machines, mpi runs but then OF cannot find "controlDict". I get the following error for node "prius" (slave node to node "insight"):
Process 2506 Unable to locate the parameter file "/home/dave/Openfoam/dave-1.7.0/run/kukaseries/unsteadyCoGparallel" in the following search path:
/home/dave/OpenFOAM/dave-1.7.0/run/kukaseries/unsteadyCoGparallel:/usr/share/openmpi/amca-param-sets:/home/dave/OpenFOAM/dave-1.7.0/run/kukaseries/unsteadyCoGparallel
--------------------------------------------------------------------------
[insight:03276] [[17400,1],0] node[0].name insight daemon 0 arch ffca0200
[insight:03276] [[17400,1],0] node[1].name prius daemon 1 arch ffca0200
[prius:02505] procdir: /tmp/openmpi-sessions-dave@prius_0/17400/1/1
[prius:02506] procdir: /tmp/openmpi-sessions-dave@prius_0/17400/1/2
[prius:02506] jobdir: /tmp/openmpi-sessions-dave@prius_0/17400/1
[prius:02506] top: openmpi-sessions-dave@prius_0
[prius:02506] tmp: /tmp
[prius:02506] [[17400,1],2] node[0].name insight daemon 0 arch ffca0200
[prius:02505] jobdir: /tmp/openmpi-sessions-dave@prius_0/17400/1
[prius:02505] top: openmpi-sessions-dave@prius_0
[prius:02505] tmp: /tmp
[prius:02506] [[17400,1],2] node[1].name prius daemon 1 arch ffca0200
[prius:02505] [[17400,1],1] node[0].name insight daemon 0 arch ffca0200
[prius:02505] [[17400,1],1] node[1].name prius daemon 1 arch ffca0200
[insight:03277] procdir: /tmp/openmpi-sessions-dave@insight_0/17400/1/3
[insight:03277] jobdir: /tmp/openmpi-sessions-dave@insight_0/17400/1
[insight:03277] top: openmpi-sessions-dave@insight_0
[insight:03277] tmp: /tmp
[insight:03277] [[17400,1],3] node[0].name insight daemon 0 arch ffca0200
[insight:03277] [[17400,1],3] node[1].name prius daemon 1 arch ffca0200
[insight:03276] mca_param_files=/home/dave/.openmpi/mca-params.conf:/etc/openmpi/openmpi-mca-params.conf (default value)
[insight:03276] mca_base_param_file_prefix=/home/dave/Openfoam/dave-1.7.0/run/kukaseries/unsteadyCoGparallel (file:/home/dave/.openmpi/mca-params.conf)
[insight:03276] mca_base_param_file_path=/usr/share/openmpi/amca-param-sets:/home/dave/OpenFOAM/dave-1.7.0/run/kukaseries/unsteadyCoGparallel (default value)

[1] --> FOAM FATAL IO ERROR:
[1] cannot open file
[2]
[2]
[2] --> FOAM FATAL IO ERROR:
[2] cannot open file
[2]
[2] file: /home/dave/processor2/system/controlDict at line 0.
[2]
[2] From function regIOobject::readStream()
[2] in file db/regIOobject/regIOobjectRead.C at line 61.
[2]
FOAM parallel run exiting
[2]
[1]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[1] file: /home/dave/processor1/system/controlDict at line 0.
[1]
[prius:02506] sess_dir_finalize: proc session dir not empty - leaving
[1] From function regIOobject::readStream()
[1] in file db/regIOobject/regIOobjectRead.C at line 61.
[1]
FOAM parallel run exiting
[1]
[prius:02505] sess_dir_finalize: proc session dir not empty - leaving
[prius:02337] sess_dir_finalize: proc session dir not empty - leaving


The problem is that this directory does not exist on prius nor do I believe it should unless it is a temporary directory created when the mpi session is created. I have tried numerous methods of overcoming this such as using the --preload-file , the --preload-file-dir option and -wdir option to try and overcome this, but none have solved the problem. Any help would be really appreciated since it has been racking my brain for weeks how to overcome this.

-Dave

wyldckat August 27, 2010 09:29

Greetings Dave,

OK, here are a few questions, so we can better understand what's going on:
  1. How are you sharing your home folder between nodes? Are you using NFS, sshfs, samba/CIFS, or some other file sharing system? Because OpenMPI doesn't share the folders automagically! It only handles communications between processes (AFAIK). It will need both the OpenFOAM folders and your case folder to be shared among the nodes!
  2. How are you launching the solver? Are you using foamJob or mpirun?
  3. Have you tried running parallelTest? Here is a post on how to use it: post #19 of "OpenFOAM updates" - Note: ignore the changes needed for OpenFOAM 1.6.
Best regards,
Bruno

daveatstyacht August 30, 2010 05:57

Bruno,
Thank you for the insight about MPI, I had the impression that MPI would handle file/data transfer as necessary and that a file sharing system was not necessary. I will be setting up a NFS file transfer system today to see if I can resolve the issue of preloading files to the slave computers. I almost (see below) succeeded in parallel processing a tutorial (since the files were on both computers already) with the "mpirun --hostfile machines -np 4 <executable> -parallel > log " command.

I noticed a strange behavior difference between running MPI on a single node vs 2 nodes. When I run MPI on one node the "processorN" files are where the write files are sent to and I have to use reconstructPar to put the pieces together. When I ran the process on 2 nodes, it was creating the fully reconstructed time steps on BOTH computers and the processor folders remain empty other than the 0 and constant folders initially placed there. The issue appears to be that mpirun is setting off the processes on both computers but independently (basically I am running 2 instances of the case, with one on each computer using both cores on each computer). I noticed this behavior when one computer finished running and the other was still going and was at a different time step. The other odd behavior is that I am still getting the behavior of it being unable to find the "parameter file" but it runs fine. The error I get is this (Process 3499 being on the slave computer):

Process 3499 Unable to locate the parameter file "/home/dave/Openfoam/dave-1.7.0/run/kukaseries/unsteadyCoGparallel" in the following search path:
/home/dave/OpenFOAM/dave-1.7.1/run/tutorials/multiphase/interDyMFoam/ras/sloshingTank3D3DoF:/usr/share/openmpi/amca-param-sets:/home/dave/OpenFOAM/dave-1.7.1/run/tutorials/multiphase/interDyMFoam/ras/sloshingTank3D3DoF

I think I am going to try the parallelTest if I can't get it to run in a few tries.
-Dave

daveatstyacht August 30, 2010 07:29

Whoops, I had typed the appendix -parallel as >parallel and somehow it did not spit out an error. After correcting it, it runs in parallel correctly (although without NFS file transfer setup the write files for the slave computer are being written to "processorN" on that computer instead of the master computer). Goes to show when something goes wrong, take a look at what your typing in early in the debug process.

-Dave

daveatstyacht August 31, 2010 10:40

Bruno,
I managed to use a really ornate scripting file to tar, send, perform the execution and then selectively transfer the processor files of each slave computer back to the master computer where it is reconstructed. However, I have only done this between 32 bit machines, and have been unable to make a 64 bit machine execute with the 32 bit machines. This is unfortunate since about half of our machines are 32 bit and the other half are 64 bit, is there a way to force the 64 bit machines to read the 32bit libraries?

-Dave

wyldckat August 31, 2010 11:47

Hi Dave,

Why aren't you using NFS for sharing files/folders? It's a lot more efficient/simpler (I think it is, but I'm not 100% sure), but I guess that selective temporary sharing can be more efficient in the long run. Anyway, if you have openSUSE installed, it's very easy to setup NFS!
I suggest NFS because it usually is the most efficient file sharing system among Linux machines, and this way you can also share the whole OpenFOAM folder!

As for 32-64bit compatibilities:
  1. You can have a single OpenFOAM installation that is 32bit, shared among all computers. I assume from your post that you already have the 32bit version built, now you can simply share (or copy) the whole OpenFOAM folder (the one that contains both OpenFOAM-1.7.0 and ThirdParty-1.7.0) with all of the machines.
  2. On your 64bit machines, you must install the 32bit multilib libraries through the Linux's package manager, which whose name varies between Linux distributions. For example, in Ubuntu it's named ia32libs, if I'm not mistaken.
  3. Once you got this setup, it's possible to run 32bit machines along with 64bit machines. The downside to this setup is that you wont be taking full advantage of the 64bit processors :(
Personally I've never tried making 32bit versions of OpenFOAM run in parallel with 64bit versions, but if both are built with double precision (or single precision), I can only assume that the problem in between would be OpenMPI itself. Or how the machines are setup to work...

How exactly have you setup each machine?

Best regards,
Bruno

daveatstyacht August 31, 2010 15:23

Bruno,
The shell script was a matter of expediency since I already had a script file that did file transfer (by way of tarring and untarring the files) so I simply modified it to decompose, tar/copy the file to slaves, and the selectively copy back the relevant processor files (and delete them off the slave to save space). I would like to setup NFS in the long term, but in the name of expediency I went with what I knew (should work), since NFS is still a subject I am rather unfamiliar with.

As for the 64 vs 32 bit problem, I have 3 machines with 32 bit ubuntu and one with 64 bit. One machine is incapable of 64 bit while the other 3 are 64 bit processors so I am torn because I can either try and get the 64 bit machine to operate with the 32 bit by installing 32 bit (I have a boot disk that is incredibly easy to use for installing 32 bit ubuntu) or to upgrade the two 64 bit capable machines that are currently 32 bit (due to a mistake on my part of using the 32 bit machine's ubuntu disk I have to install ubuntu on them) and then modify the libraries as per above, or just modifying the 64 machine. I think in the long run I will upgrade the two machines to 64 bit ubuntu and eventually retire the 32 bit machine when a new machine comes along to replace it (since the performance benefits of 64 bit merit complete exchange to 64 architecture if possible).

The issue appears to be in which version of the execution is referenced, the 64 bit references one library file while the 32 bit references another another (as determined by a "which interDyMFoam" on each). As a short term solution I have simply run a quarter of the cases on the 64 bit machine and the remainder on the 32 bit parallel machines (inelegant I know, but working so...). Thank you for the insights in all of this!

-Dave

wyldckat August 31, 2010 17:16

Greetings Dave,

Quote:

Originally Posted by daveatstyacht (Post 273526)
The issue appears to be in which version of the execution is referenced, the 64 bit references one library file while the 32 bit references another another (as determined by a "which interDyMFoam" on each). As a short term solution I have simply run a quarter of the cases on the 64 bit machine and the remainder on the 32 bit parallel machines (inelegant I know, but working so...).

Now I've got somewhat of a clearer picture of your setup. Basically you built OpenFOAM in each machine independently. In the 64bit machine you built the 64bit version and in the 32bits you built the respective version. If you were a bit more specific on which version of OpenFOAM you installed and/or how you installed it, I could talk you through the steps needed for making it work!

Because by what I can infer from the information you gave, you probably are running Ubuntu 10.04 (also known as Lucid) and installed OpenFOAM via the Debian packages provided by OpenCFD. If this is the case, then I believe that some minor "hacking" can get you running all OpenFOAM versions in 32bit, without a need to change Ubuntu versions! Or we could even get the 64bit version to "play ball" with the 32bit versions :D


Quote:

Originally Posted by daveatstyacht (Post 273526)
Thank you for the insights in all of this!

You're welcome :) Helping others is a way of sharing experience ;)

Best regards,
Bruno


All times are GMT -4. The time now is 06:32.