CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM Installation

MPI issue on multiple nodes

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree4Likes

Reply
 
LinkBack Thread Tools Display Modes
Old   August 28, 2013, 15:43
Default
  #21
Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 9,558
Blog Entries: 39
Rep Power: 97
wyldckat is just really nicewyldckat is just really nicewyldckat is just really nicewyldckat is just really nicewyldckat is just really nice
Hi Ehsan,

From the little information you've provided, and taking into account that you've checked most of the information on this thread, there are several possibilities:
  1. It might depend on the version of OpenFOAM you are using. You might be using a version that has got a bug that you are triggering.
  2. You might be using Open-MPI 1.5.x, which isn't a stable series.
  3. It could be the file sharing system (NFS ?) that is not properly configured.
  4. It could be because with this solver+case, the data exchanged is substantially more than with the other cases and solvers; or vice-versa, it might be too few cells per core, leading to extremely frequent data communication.
  5. It could be because the file sizes are far larger than with other cases.
  6. It could be because of the drivers for controlling the Ethernet cards are not the correct ones.
  7. It could be because there is a hardware failure in one of the machines.
I could think of several more possibilities, but I'm too tired right now. There are too many possibilities and not enough information .

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   August 30, 2013, 07:04
Default Reply
  #22
Senior Member
 
ehsan
Join Date: Mar 2009
Posts: 106
Rep Power: 9
ehsan is on a distinguished road
Hi Bruno

We make 4 or 5 systems parallel, they work very fine if we try pimpleFoam but connection fails if we try interPhaseChangeFoam. We use OF v. 2.1, the decomposed parts have the same size as we used Scotch and there is no sign of hardware failure.

We changed the solver settings, i.e., from GAMG to PCG, or tried increasing nCellsInCoarsestLevel from the default value of 10 to 5000. These changes helped the run to go further but crashed at another time, let say with

nCellsInCoarsestLevel 10: connection stopped after 1000 s
nCellsInCoarsestLevel 5000: connection stopped after 6000 s
changing p solver to PCG: connection stopped after 3000 s

Would you please help in this regards?

Thanks
ehsan is offline   Reply With Quote

Old   August 31, 2013, 11:42
Default
  #23
Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 9,558
Blog Entries: 39
Rep Power: 97
wyldckat is just really nicewyldckat is just really nicewyldckat is just really nicewyldckat is just really nicewyldckat is just really nice
Hi Ehsan,

Here are several questions I asked the other day on a related thread:
Quote:
Originally Posted by wyldckat View Post
  • How many cells does your mesh have?
  • What kinds of cells does your mesh have?
  • What does checkMesh output? More specifically:
    Code:
    checkMesh -allGeometry -allTopology
  • Are you using any kind of moving mesh, or AMI, MRF, cyclic patches, mapped boundary conditions, symmetry planes or wedges?
  • Are you using dynamic mesh refinement during the simulation?
  • Which decomposition method did you use?
  • What did the last time instance of the output of the solver look like?
  • Are you using any function objects?
  • Have you tried the more recent versions of OpenFOAM?
In addition, here is a couple more questions:
  • Are you able to reproduce the same problem with the tutorial "multiphase/interPhaseChangeFoam/cavitatingBullet" from OpenFOAM?
  • Have you checked the links on this blog post of mine: Notes about running OpenFOAM in parallel
Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   August 31, 2013, 13:25
Default
  #24
Senior Member
 
ehsan
Join Date: Mar 2009
Posts: 106
Rep Power: 9
ehsan is on a distinguished road
Hi Bruno

Thanks for your time and efforts:

How many cells does your mesh have?

R: 900,000 cells

What kinds of cells does your mesh have?

R: Structured, we create the mesh using Gambit and then read it with OF.

What does checkMesh output? More specifically:
Code:
checkMesh -allGeometry -allTopology

R: We did not check this, but will try it. Meanwhile, I like to say that the case run correctly with 2 systems w.o any problems.


Are you using any kind of moving mesh, or AMI, MRF, cyclic patches, mapped boundary conditions, symmetry planes or wedges?

R: Yes, we use symmetry planes. We solve 3D cavitating flow behind a disk and we solve 1/4 of the geometry.

Are you using dynamic mesh refinement during the simulation?

R: No.

Which decomposition method did you use?

R: Scotch

What did the last time instance of the output of the solver look like?

R: Like other times but it stops before writing iteration of P_rgh equation, i.e., the run hangs.

Are you using any function objects?

Yes, but the code stops while solving P-rgh equation.


functions
(
forces
{
type forces;
functionObjectLibs ("libforces.so"); //Lib to load
patches (disk); // change to your patch name
rhoInf 998; //Reference density for fluid
CofR (0 0 0); //Origin for moment calculations
outputControl timeStep;
outputInterval 100;
}

forceCoeffs
{
type forceCoeffs;
functionObjectLibs ("libforces.so");
patches (disk); //change to your patch name
rhoInf 998;
CofR (0 0 0);
liftDir (0 1 0);
dragDir (1 0 0);
pitchAxis (0 0 0);
magUInf 10;
lRef 0.07;
Aref 0.0049;
outputControl timeStep;
outputInterval 100;
}
);


Have you tried the more recent versions of OpenFOAM?

R: Not yet, we only tried this version.

I will be glad if you could help me in this problem.

Regards
ehsan is offline   Reply With Quote

Old   August 31, 2013, 13:41
Default
  #25
Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 9,558
Blog Entries: 39
Rep Power: 97
wyldckat is just really nicewyldckat is just really nicewyldckat is just really nicewyldckat is just really nicewyldckat is just really nice
Hi Ehsan,

OK, then there is still this answer left unanswered:
  • Are you able to reproduce the same problem with the tutorial "multiphase/interPhaseChangeFoam/cavitatingBullet" from OpenFOAM?
In addition, a new question:
  • You have 900000 cells, but in how many sub-domains are you decomposing your case when using the 4 or 5 machines?


And this still has me worried:

Quote:
Originally Posted by ehsan View Post
We detected that the problem is that one system goes out of connection from the network, i.e., once we ping it, it won't reply. It is odd that at the start, it goes fine but after some iterations it stop working in the network.
I've experienced this with some home made clusters and the problem was either due to:
  • Using NFS v3 is bad for HPC. v4 is a lot more stable.
  • Bad drivers for an Ethernet card on Linux can lead to a whole machine to either crash or drop out of the network.
    • In other words, the problem isn't the hardware itself, it's the drivers for that hardware.
    • And yes, it depended on the case used for running in parallel, because of the amount of data that had to be exchanged, which lead the driver to not be able to correlate the commands from the Ethernet card.
Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   August 31, 2013, 13:57
Default
  #26
Senior Member
 
ehsan
Join Date: Mar 2009
Posts: 106
Rep Power: 9
ehsan is on a distinguished road
Hi Bruno

I did not understand the first question, do you mean stopping of the run if we run multiphase/interPhaseChangeFoam/cavitatingBullet?

We use 24 processors, each of them deal with the same size sub-domain (around 8Mg) as we use Scotch to make decompositions.

Question:

1- where could I check the version of NFS?
2- We use Ubuntu v. 11, where should I precisely check for network drivers?

Regards and best thanks for your time.
ehsan is offline   Reply With Quote

Old   August 31, 2013, 14:44
Default
  #27
Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 9,558
Blog Entries: 39
Rep Power: 97
wyldckat is just really nicewyldckat is just really nicewyldckat is just really nicewyldckat is just really nicewyldckat is just really nice
Hi Ehsan,

Quote:
Originally Posted by ehsan View Post
I did not understand the first question, do you mean stopping of the run if we run multiphase/interPhaseChangeFoam/cavitatingBullet?
The idea is to try to isolate if the problem is really because of the solver or the case or the hardware. If you run this tutorial in parallel, using the same decomposition, and if it does not block midway, then at least the problem is not from the solver itself.

Quote:
Originally Posted by ehsan View Post
We use 24 processors, each of them deal with the same size sub-domain (around 8Mg) as we use Scotch to make decompositions.
900000/24 = 37500 cells/processor
The rule of thumb usually is that the minimum should be 50000 cells per processor. If you use the same number of machines, but less processors per machine, does it still freeze?


Quote:
Originally Posted by ehsan View Post
1- where could I check the version of NFS?
Run on each machine:
Code:
nfsstat

Quote:
Originally Posted by ehsan View Post
2- We use Ubuntu v. 11, where should I precisely check for network drivers?
Which exact version of Ubuntu? You can confirm by running:
Code:
cat /etc/lsb-release
As for drivers, on each machine run:
  1. This gives you which NIC chip your machine has on the Ethernet card:
    Code:
    lspci | grep Ethernet
    Example:
    Code:
    02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)
  2. This gives you a list of loaded modules (drivers):
    Code:
    lsmod
    In my case, the one that matters is this one:
    Code:
    r8169                  62190  0
This is a good example, because the "rev 06" of this same NIC chip cannot use the "r8169" driver in an HPC environment, but since I'm using "rev 01", I shouldn't have any problems.

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
how to set periodic boundary conditions Ganesh FLUENT 13 January 22, 2014 05:11
Issue with OpenMPI-1.5.3 while running parallel jobs on multiple nodes LargeEddy OpenFOAM 1 March 7, 2012 18:05
Issue with running in parallel on multiple nodes daveatstyacht OpenFOAM 7 August 31, 2010 17:16
Error using LaunderGibsonRSTM on SGI ALTIX 4700 jaswi OpenFOAM 2 April 29, 2008 10:54
CFX4.3 -build analysis form Chie Min CFX 5 July 12, 2001 23:19


All times are GMT -4. The time now is 18:47.