CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM (http://www.cfd-online.com/Forums/openfoam/)
-   -   problems after decomposing for running (http://www.cfd-online.com/Forums/openfoam/87338-problems-after-decomposing-running.html)

alessio.nz April 18, 2011 05:47

problems after decomposing for running
 
Hello, I had a mesh with the decomposePartDict included and I could use this flie for running in parallel without problem. The mesh splitted well and then the running was perfect (this file is actually set in order to split my domain in more than one node of the cluster - each node has 8 cores, so for example I can run in 4 node = 32 cores)

I wanted to use the same file for another mesh, but after splitting the domains in the 32 processors, apparently without errors,

Number of processor faces = 50892
Max number of processor patches = 8
Max number of faces between processors = 9008

Processor 0: field transfer
Processor 1: field transfer
Processor 2: field transfer
Processor 3: field transfer
Processor 4: field transfer
Processor 5: field transfer
Processor 6: field transfer
Processor 7: field transfer
Processor 8: field transfer
Processor 9: field transfer
Processor 10: field transfer
Processor 11: field transfer
Processor 12: field transfer
Processor 13: field transfer
Processor 14: field transfer
Processor 15: field transfer
Processor 16: field transfer
Processor 17: field transfer
Processor 18: field transfer
Processor 19: field transfer
Processor 20: field transfer
Processor 21: field transfer
Processor 22: field transfer
Processor 23: field transfer
Processor 24: field transfer
Processor 25: field transfer
Processor 26: field transfer
Processor 27: field transfer
Processor 28: field transfer
Processor 29: field transfer
Processor 30: field transfer
Processor 31: field transfer

End.

I tried to run with the foamJob -p simpleFoam and gives the following error:


Executing: mpirun -np 32 -hostfile system/machines /cvos/shared/apps/OpenFOAM/OpenFOAM-1.7.1/bin/foamExec simpleFoam -parallel > log 2>&1
[user@cluster]$ tail -f log
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

do you know what could it be? I attach the file on the mail.

stevenvanharen April 18, 2011 06:33

it seems like the call to mpi generated by the foamJob script is not correct. (I miss the file specifying the machines)

Read section 3.4 in the user guide and try to run mpi without using the foamJob script.

alessio.nz April 18, 2011 09:46

This is the command I put:
mpirun --hostfile system/machines -np 32 SimpleFoam -parallel

and this is what I got:
--------------------------------------------------------------------------
Open RTE detected a parse error in the hostfile:
system/machines
It occured on line number 1 on token 1.
--------------------------------------------------------------------------
[elmo:11368] [[22308,0],0] ORTE_ERROR_LOG: Error in file base/ras_base_allocate.c at line 236
[elmo:11368] [[22308,0],0] ORTE_ERROR_LOG: Error in file base/plm_base_launch_support.c at line 72
[elmo:11368] [[22308,0],0] ORTE_ERROR_LOG: Error in file plm_rsh_module.c at line 990
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

stevenvanharen April 18, 2011 09:58

Quote:

Originally Posted by alessio.nz (Post 304063)
--------------------------------------------------------------------------
Open RTE detected a parse error in the hostfile:
system/machines
It occured on line number 1 on token 1.
--------------------------------------------------------------------------

Somehow it is not happy with your machines file, are you sure you set the right names for the remote nodes in the "machines" file?

alessio.nz April 18, 2011 10:11

yes, I am sure, I was working with another mesh and it work perfectly, the problem is that with this different one the splitting seems ok, but once I am running it crashes giving the errors I mentioned

alessio.nz April 20, 2011 08:44

Re:
 
Hello, finally it worked, maybe there was a problem in the cluster itself. Anyway thanks for the help.regards

alireza2475 December 23, 2015 15:27

Quote:

Originally Posted by stevenvanharen (Post 304065)
Somehow it is not happy with your machines file, are you sure you set the right names for the remote nodes in the "machines" file?

Just in case for anyone else may face the problem:

There is something wrong in the hostname file as steve mentioned.
Sometimes, even if you copy a working file for a new run, it's not gonna work. I suggest that you create another hostname file from scratch. I have just had the same problem by running a system that worked perfectly before. I just wrote the machine names again and it works now.


All times are GMT -4. The time now is 00:20.