CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

MPI issue on multiple nodes

Register Blogs Community New Posts Updated Threads Search

Like Tree9Likes

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   August 28, 2013, 15:43
Default
  #21
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Ehsan,

From the little information you've provided, and taking into account that you've checked most of the information on this thread, there are several possibilities:
  1. It might depend on the version of OpenFOAM you are using. You might be using a version that has got a bug that you are triggering.
  2. You might be using Open-MPI 1.5.x, which isn't a stable series.
  3. It could be the file sharing system (NFS ?) that is not properly configured.
  4. It could be because with this solver+case, the data exchanged is substantially more than with the other cases and solvers; or vice-versa, it might be too few cells per core, leading to extremely frequent data communication.
  5. It could be because the file sizes are far larger than with other cases.
  6. It could be because of the drivers for controlling the Ethernet cards are not the correct ones.
  7. It could be because there is a hardware failure in one of the machines.
I could think of several more possibilities, but I'm too tired right now. There are too many possibilities and not enough information .

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   August 30, 2013, 07:04
Default Reply
  #22
Senior Member
 
Ehsan
Join Date: Mar 2009
Posts: 112
Rep Power: 17
ehsan is on a distinguished road
Hi Bruno

We make 4 or 5 systems parallel, they work very fine if we try pimpleFoam but connection fails if we try interPhaseChangeFoam. We use OF v. 2.1, the decomposed parts have the same size as we used Scotch and there is no sign of hardware failure.

We changed the solver settings, i.e., from GAMG to PCG, or tried increasing nCellsInCoarsestLevel from the default value of 10 to 5000. These changes helped the run to go further but crashed at another time, let say with

nCellsInCoarsestLevel 10: connection stopped after 1000 s
nCellsInCoarsestLevel 5000: connection stopped after 6000 s
changing p solver to PCG: connection stopped after 3000 s

Would you please help in this regards?

Thanks
ehsan is offline   Reply With Quote

Old   August 31, 2013, 11:42
Default
  #23
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Ehsan,

Here are several questions I asked the other day on a related thread:
Quote:
Originally Posted by wyldckat View Post
  • How many cells does your mesh have?
  • What kinds of cells does your mesh have?
  • What does checkMesh output? More specifically:
    Code:
    checkMesh -allGeometry -allTopology
  • Are you using any kind of moving mesh, or AMI, MRF, cyclic patches, mapped boundary conditions, symmetry planes or wedges?
  • Are you using dynamic mesh refinement during the simulation?
  • Which decomposition method did you use?
  • What did the last time instance of the output of the solver look like?
  • Are you using any function objects?
  • Have you tried the more recent versions of OpenFOAM?
In addition, here is a couple more questions:
  • Are you able to reproduce the same problem with the tutorial "multiphase/interPhaseChangeFoam/cavitatingBullet" from OpenFOAM?
  • Have you checked the links on this blog post of mine: Notes about running OpenFOAM in parallel
Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   August 31, 2013, 13:25
Default
  #24
Senior Member
 
Ehsan
Join Date: Mar 2009
Posts: 112
Rep Power: 17
ehsan is on a distinguished road
Hi Bruno

Thanks for your time and efforts:

How many cells does your mesh have?

R: 900,000 cells

What kinds of cells does your mesh have?

R: Structured, we create the mesh using Gambit and then read it with OF.

What does checkMesh output? More specifically:
Code:
checkMesh -allGeometry -allTopology

R: We did not check this, but will try it. Meanwhile, I like to say that the case run correctly with 2 systems w.o any problems.


Are you using any kind of moving mesh, or AMI, MRF, cyclic patches, mapped boundary conditions, symmetry planes or wedges?

R: Yes, we use symmetry planes. We solve 3D cavitating flow behind a disk and we solve 1/4 of the geometry.

Are you using dynamic mesh refinement during the simulation?

R: No.

Which decomposition method did you use?

R: Scotch

What did the last time instance of the output of the solver look like?

R: Like other times but it stops before writing iteration of P_rgh equation, i.e., the run hangs.

Are you using any function objects?

Yes, but the code stops while solving P-rgh equation.


functions
(
forces
{
type forces;
functionObjectLibs ("libforces.so"); //Lib to load
patches (disk); // change to your patch name
rhoInf 998; //Reference density for fluid
CofR (0 0 0); //Origin for moment calculations
outputControl timeStep;
outputInterval 100;
}

forceCoeffs
{
type forceCoeffs;
functionObjectLibs ("libforces.so");
patches (disk); //change to your patch name
rhoInf 998;
CofR (0 0 0);
liftDir (0 1 0);
dragDir (1 0 0);
pitchAxis (0 0 0);
magUInf 10;
lRef 0.07;
Aref 0.0049;
outputControl timeStep;
outputInterval 100;
}
);


Have you tried the more recent versions of OpenFOAM?

R: Not yet, we only tried this version.

I will be glad if you could help me in this problem.

Regards
ehsan is offline   Reply With Quote

Old   August 31, 2013, 13:41
Default
  #25
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Ehsan,

OK, then there is still this answer left unanswered:
  • Are you able to reproduce the same problem with the tutorial "multiphase/interPhaseChangeFoam/cavitatingBullet" from OpenFOAM?
In addition, a new question:
  • You have 900000 cells, but in how many sub-domains are you decomposing your case when using the 4 or 5 machines?


And this still has me worried:

Quote:
Originally Posted by ehsan View Post
We detected that the problem is that one system goes out of connection from the network, i.e., once we ping it, it won't reply. It is odd that at the start, it goes fine but after some iterations it stop working in the network.
I've experienced this with some home made clusters and the problem was either due to:
  • Using NFS v3 is bad for HPC. v4 is a lot more stable.
  • Bad drivers for an Ethernet card on Linux can lead to a whole machine to either crash or drop out of the network.
    • In other words, the problem isn't the hardware itself, it's the drivers for that hardware.
    • And yes, it depended on the case used for running in parallel, because of the amount of data that had to be exchanged, which lead the driver to not be able to correlate the commands from the Ethernet card.
Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   August 31, 2013, 13:57
Default
  #26
Senior Member
 
Ehsan
Join Date: Mar 2009
Posts: 112
Rep Power: 17
ehsan is on a distinguished road
Hi Bruno

I did not understand the first question, do you mean stopping of the run if we run multiphase/interPhaseChangeFoam/cavitatingBullet?

We use 24 processors, each of them deal with the same size sub-domain (around 8Mg) as we use Scotch to make decompositions.

Question:

1- where could I check the version of NFS?
2- We use Ubuntu v. 11, where should I precisely check for network drivers?

Regards and best thanks for your time.
ehsan is offline   Reply With Quote

Old   August 31, 2013, 14:44
Default
  #27
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Ehsan,

Quote:
Originally Posted by ehsan View Post
I did not understand the first question, do you mean stopping of the run if we run multiphase/interPhaseChangeFoam/cavitatingBullet?
The idea is to try to isolate if the problem is really because of the solver or the case or the hardware. If you run this tutorial in parallel, using the same decomposition, and if it does not block midway, then at least the problem is not from the solver itself.

Quote:
Originally Posted by ehsan View Post
We use 24 processors, each of them deal with the same size sub-domain (around 8Mg) as we use Scotch to make decompositions.
900000/24 = 37500 cells/processor
The rule of thumb usually is that the minimum should be 50000 cells per processor. If you use the same number of machines, but less processors per machine, does it still freeze?


Quote:
Originally Posted by ehsan View Post
1- where could I check the version of NFS?
Run on each machine:
Code:
nfsstat

Quote:
Originally Posted by ehsan View Post
2- We use Ubuntu v. 11, where should I precisely check for network drivers?
Which exact version of Ubuntu? You can confirm by running:
Code:
cat /etc/lsb-release
As for drivers, on each machine run:
  1. This gives you which NIC chip your machine has on the Ethernet card:
    Code:
    lspci | grep Ethernet
    Example:
    Code:
    02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)
  2. This gives you a list of loaded modules (drivers):
    Code:
    lsmod
    In my case, the one that matters is this one:
    Code:
    r8169                  62190  0
This is a good example, because the "rev 06" of this same NIC chip cannot use the "r8169" driver in an HPC environment, but since I'm using "rev 01", I shouldn't have any problems.

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   July 1, 2019, 19:57
Default
  #28
Member
 
Hüseyin Can Önel
Join Date: Sep 2018
Location: Ankara, Turkey
Posts: 46
Rep Power: 7
hconel is on a distinguished road
Hi,
I am having a problem which I guess is related to this topic.
* If I do not use #calcEntry (#calc "..." function) in my dictionaries, I am able to run simulations on single or multiple nodes.
* If I use #calcEntry (#calc "..." function) in my dictionaries, I am able to run simulations on a single node, but when I try multiple nodes, I get the following error in log.pimpleFoam:

Code:
wmake libso /cfd/honel/OpenFOAM/honel-5.x/c.dual-isol.g-64/dynamicCode/_a17e4453985d3e4233a02ad03051231029c9bb42
    ln: ./lnInclude
    wmkdep: codeStreamTemplate.C
    Ctoo: codeStreamTemplate.C
    ld: /cfd/honel/OpenFOAM/honel-5.x/c.dual-isol.g-64/dynamicCode/_a17e4453985d3e4233a02ad03051231029c9bb42/../platforms/linux64Gcc48DPInt64Opt/lib/libcodeStream_a17e4453985d3e4233a02ad03051231029c9bb42.so
[86] 
[86] 
[86] --> FOAM FATAL IO ERROR: 
[86] Cannot read (NFS mounted) library 
"/cfd/honel/OpenFOAM/honel-5.x/c.dual-isol.g-64/dynamicCode/platforms/linux64Gcc48DPInt64Opt/lib/libcodeStream_a17e4453985d3e4233a02ad03051231029c9bb42.so"
on processor 86 detected size -1 whereas master size is 129190 bytes.
If your case is not NFS mounted (so distributed) set fileModificationSkew to 0
[86] 
[86] file: /cfd/honel/OpenFOAM/honel-5.x/c.dual-isol.g-64/processor86/0/p from line 25 to line 13.
[86] 
[86]     From function static void (* Foam::functionEntries::codeStream::getFunction(const Foam::dictionary&, const Foam::dictionary&))(Foam::Ostream&, const Foam::dictionary&)
[86]     in file db/dictionary/functionEntries/codeStream/codeStream.C at line 270.
[86] 
FOAM parallel run exiting
[86]
The thing is, all other pre-processing applications run without problems (snappyHexMesh, topoSet, decomposePar etc.)

How can I overcome this?
Thanks in advance.
hconel is offline   Reply With Quote

Old   July 9, 2019, 18:47
Default
  #29
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Quick answer: Without more details on how you are launching the solver, it's harder to give direct instructions to test what's going on.

In principle, the problem is that the file "libcodeStream_a17e4453985d3e4233a02ad03051231029c 9bb42.so" is not properly accessible on all parallel processes. This is either because NFS was not able to deliver the file on-time for the load, or because it simply was not shared via NFS.

One possible trick and test is to launch a test command with mpirun with -np X (X for number of cores) so that it will check the content of the file. For example, these commands will let you know what the applications are seeing before the simulation is launched:
Code:
mpirun -np 84  find dynamicCode -name "*.so"
mpirun -np 84  ls -l dynamicCode/*/*/*/*
mpirun -np 84  md5sum dynamicCode/*/*/*/*
  • The first two commands will tell you want each parallellely launched process is able to find the built library file and their size.
  • The last command will have md5sum read the file and give you a checksum of the content of the file, which should all be identical.
You may need to first run with a single core, to force the code to compile and then abort the run after the compilation is finished, so that the library file is available on all nodes.
__________________
wyldckat is offline   Reply With Quote

Old   July 10, 2019, 04:42
Default
  #30
Member
 
Hüseyin Can Önel
Join Date: Sep 2018
Location: Ankara, Turkey
Posts: 46
Rep Power: 7
hconel is on a distinguished road
Hi wyldckat,
Thanks for your reply, as always.
I could not quite get what you mean by how I am launching the solver, but I'm running pimpleFoam via runParallel command. Please ask any specific information if necessary.

I have run the find, ls and md5sum commands in parallel as you have said.

The pimpleFoam log of my problematic case is as follows (it runs on 2 nodes x 16 cores each = 32 cores in total):

http://s000.tinyupload.com/download....84241688376836

Here is the output of find, ls and md5sum commands:

http://s000.tinyupload.com/download....06754599415741

(I had to upload it on an external website because of the 1.5Mb size)

My understanding is that the file which is claimed not to be found is seen by all 32 processors.
hconel is offline   Reply With Quote

Old   July 23, 2019, 19:08
Default
  #31
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi hconel,

Sorry for the late reply...
Quote:
Originally Posted by hconel View Post
I could not quite get what you mean by how I am launching the solver, but I'm running pimpleFoam via runParallel command. Please ask any specific information if necessary.
OK, that answers my question.

Quote:
Originally Posted by hconel View Post
My understanding is that the file which is claimed not to be found is seen by all 32 processors.
Did you run the commands for md5sum and the others before or after pimpleFoam? My question is whether if the dynamic pieces of code were compiled during decomposePar or only after pimpleFoam started to run?

I ask this because if the pieces of code were built during decomposePar, then you can try running the command line with md5sum before running the solver, to try and enforce the files to be loaded into cache on all cores.

Best regards,
Bruno
wyldckat is offline   Reply With Quote

Old   October 4, 2019, 07:12
Default Distributed parallel with interIsoFoam
  #32
Member
 
Ndong-Mefane Stephane Boris
Join Date: Nov 2013
Location: Kawasaki (JAPAN)
Posts: 53
Rep Power: 12
S_teph_2000 is on a distinguished road
Hello,

Does anyone have some experience with interIsoFoam in distributed parallel?
in my case the command just hangs, and the mpi process does not start (no output, no error message).
I've already checked that i can access both nodes (yeah I'm trying with two nodes) via ssh, so now I really lost as to why it does not work.

Kazu
S_teph_2000 is offline   Reply With Quote

Old   November 2, 2022, 05:48
Default
  #33
Senior Member
 
Josh Williams
Join Date: Feb 2021
Location: Scotland
Posts: 112
Rep Power: 5
joshwilliams is on a distinguished road
I had a similar issue very recently running multi-node jobs on Oracle cloud. The simulation would go fine for a few timesteps, but then it would eventually just hang indefinitely.


This was resolved by adding the additional mpi tags detailed in this blog. Maybe it is specific to Oracle, but hopefully it will help someone in future.


FYI our setup was Bare metal Optimized3.36 nodes running on OpenFOAM 6 with OpenMPI. OS was Oracle-Linux 7.9.
joshwilliams is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
how to set periodic boundary conditions Ganesh FLUENT 15 November 18, 2020 06:09
Issue with OpenMPI-1.5.3 while running parallel jobs on multiple nodes LargeEddy OpenFOAM 1 March 7, 2012 17:05
Issue with running in parallel on multiple nodes daveatstyacht OpenFOAM 7 August 31, 2010 17:16
Error using LaunderGibsonRSTM on SGI ALTIX 4700 jaswi OpenFOAM 2 April 29, 2008 10:54
CFX4.3 -build analysis form Chie Min CFX 5 July 12, 2001 23:19


All times are GMT -4. The time now is 19:55.