CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

foam-extend-3.2 Pstream: "MPI_ABORT was invoked"

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree2Likes
  • 1 Post By wyldckat
  • 1 Post By craven.brent

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   November 11, 2015, 16:33
Default foam-extend-3.2 Pstream: "MPI_ABORT was invoked"
  #1
New Member
 
Brent Craven
Join Date: Oct 2015
Posts: 7
Rep Power: 10
craven.brent is on a distinguished road
I am having a similar related issue with foam-extend 3.2. It installed with no problems and runs in parallel using system Open MPI on a single node (up to 12 cores). But, when I try using more than 1 node I get the following MPI_ABORT:

Code:
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 5 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
I tried changing the commsType by putting the following in my system/controlDict case, but that did not resolve the issue.

Code:
OptimisationSwitches
{
    commsType       nonBlocking;
}
Has anyone else had this issue? I'm sure that it is not my MPI installation because I am using this same system Open MPI with OpenFOAM-2.4.x.

[ Moderator note: moved from http://www.cfd-online.com/Forums/ope...end-3-2-a.html ]

Last edited by wyldckat; November 16, 2015 at 12:04. Reason: see "Moderator note:"
craven.brent is offline   Reply With Quote

Old   November 14, 2015, 22:30
Default foam-extend-3.2 Pstream: "MPI_ABORT was invoked"
  #2
New Member
 
Brent Craven
Join Date: Oct 2015
Posts: 7
Rep Power: 10
craven.brent is on a distinguished road
Hi All,

I am having major problems getting foam-extend-3.2 running across multiple nodes on a cluster (actually, I have tried two different clusters with the same result). The code installed just fine and runs in serial and in parallel on a single node with descent scaling (so, MPI seems to be running on a single node just fine). However, as soon as I try to bridge multiple nodes, I get the following MPI_ABORT error as soon as simpleFoam (or other solvers that I have tested) enters the time loop:

Code:
Starting time loop

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 21 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
I am using system installs for GCC (5.1.0) and Open MPI (1.8.5), which I am also using for OpenFOAM-2.4.x (which runs with no problem across multiple nodes). I have tested using installations on multiple clusters and confirmed that the above MPI_Abort error occurs for foam-extend-3.2, but the same case runs across multiple nodes just fine using the same GCC/Open MPI and OpenFOAM-2.4.x. My next step is to install foam-extend-3.1 and see if I get the same issue there. But, has anyone else experienced this issue?

I noticed that the Pstream library changed locations from foam-extend-3.1 to foam-extend-3.2 and seems to have changed quite a bit. I wonder if that is part of the issue?
craven.brent is offline   Reply With Quote

Old   November 16, 2015, 12:28
Default
  #3
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Quick answer: Please try using the parallel testing utility that exists for OpenFOAM and foam-extend. Instructions for foam-extend are provided here: http://www.cfd-online.com/Forums/ope...tml#post560394 - post #12

The other possibility that comes to mind is that perhaps there is a shell environment flag that is automatically loading parts of shell environment variables for foam-extend only on the nodes, resulting in incompatible versions of simpleFoam being loaded.
One test I usually do for this is to launch mpirun with a shell script that simply outputs the current shell environment into a log file, so that I can examine what the shell environment looks like on each launched process. For example, a script containing this:
Code:
#!/bin/sh

export > log_env.$$
And the file "log_env.PID" will have the environment for the process ID number PID.
wyldckat is offline   Reply With Quote

Old   November 16, 2015, 16:51
Default
  #4
New Member
 
Brent Craven
Join Date: Oct 2015
Posts: 7
Rep Power: 10
craven.brent is on a distinguished road
Hi Bruno,

Thanks for your recommendation. I used your script and looked at my shell environment, which looks fine to me (is showing the correct $PATH that includes foam-extend-3.2 for all processes). So, I don't think that's the problem.

I had to slightly modify the parallelTest utility to get it compiled in foam-extend-3.2 since it appears as though Time.H no longer exists and I was getting "Time.H: No such file or directory." The source code is attached below.

I ran parallelTest using MPI across multiple nodes (2 nodes with 12 cores each) and here is the resultant stderr:

Code:
[13] slave sending to master 0
[13] slave receiving from master 0
[15] slave sending to master 0
[15] slave receiving from master 0
[23] slave sending to master 0
[23] slave receiving from master 0
[14] slave sending to master 0
[14] slave receiving from master 0
[22] slave sending to master 0
[22] slave receiving from master 0
[12] slave sending to master 0
[12] slave receiving from master 0
[20] slave sending to master [8] slave sending to master 0
[8] slave receiving from master 0
[9] 0
[20] slave receiving from master 0slave sending to master 0

[9] slave receiving from master 0
[0] master receiving from slave 1
[16] [1] slave sending to master 0
[1] slave receiving from master 0
[0] (0 1 2)
[0] master receiving from slave 2
slave sending to master 0[19] slave sending to master 0
[19] slave receiving from master 0

[21] slave sending to master 0
[21] slave receiving from master 0
[16] [11] slave sending to master 0
[11] slave receiving from master 0
slave receiving from master [3] slave sending to master 0
[3] slave receiving from master 0
[6] slave sending to master 0
[6] slave receiving from master 0[7] slave sending to master 0
[7] slave receiving from master 0

0
[2] slave sending to master 0
[0] (0 1 2)[2] slave receiving from master 0

[0] master receiving from slave 3
[0] (0 1 2)
[0] master receiving from slave 4
[10] slave sending to master 0
[10] slave receiving from master 0
[18] [5] slave sending to master 0
[18] slave receiving from master 0slave sending to master 0
[5] slave receiving from master 0

[4] slave sending to master 0
[0] (0 1 2)
[0] master receiving from slave 5[4] 
[0] (0 1 2)
[0] master receiving from slave 6
[0] (0 1 2)
[0] master receiving from slave 7
[0] (0 1 2)
[0] master receiving from slave 8
[0] (0 1 2)
[0] master receiving from slave 9
[0] (0 1 2)
[0] master receiving from slave 10
[0] (0 1 2)
[0] master receiving from slave 11
[0] (0 1 2)
[0] master receiving from slave 12
[0] (0 1 2)
[0] master receiving from slave 13
[0] (0 1 2)
[0] master receiving from slave 14
[0] (0 1 2)
[0] master receiving from slave 15
[0] (0 1 2)
[0] master receiving from slave 16
[0] (0 1 2)
[0] master receiving from slave 17
slave receiving from master 0
[0] [17] slave sending to master 0
[17] slave receiving from master 0
(0 1 2)
[0] master receiving from slave 18
[0] (0 1 2)
[0] master receiving from slave 19
[0] (0 1 2)
[0] master receiving from slave 20
[0] (0 1 2)
[0] master receiving from slave 21
[0] (0 1 2)
[0] master receiving from slave 22
[0] (0 1 2)
[0] master receiving from slave 23
[0] (0 1 2)
[0] master sending to slave 1
[0] [1] (0 1 2)
master sending to slave 2
[0] [2] (0 1 2)
master sending to slave 3
[0] master sending to slave 4
[0] master sending to slave 5
[0] master sending to slave 6
[0] master sending to slave 7
[0] master sending to slave 8
[0] master sending to slave 9
[0] master sending to slave 10
[0] master sending to slave 11
[0] master sending to slave 12
[5] (0 1 2)
[3] (0 1 2)
[4] (0 1 2)
[8] (0 1 2)
[6] (0 1 2)
[11] (0 1 2)
[0] master sending to slave 13
[0] master sending to slave 14
[0] master sending to slave 15
[0] master sending to slave 16
[0] master sending to slave 17
[0] master sending to slave 18
[0] master sending to slave 19
[0] master sending to slave 20
[0] master sending to slave 21
[0] master sending to slave 22
[0] master sending to slave 23
[10] (0 1 2)
[7] (0 1 2)
[9] (0 1 2)
[13] (0 1 2)
[12] [14] (0 1 2)
[17] (0 1 2)
[20] (0 1 2)
[18] (0 1 2)
[16] (0 1 2)
[21] (0 1 2)
[15] (0 1 2)
[19] (0 1 2)
[22] (0 1 2)
[23] (0 1 2)
(0 1 2)

There are no error messages. But, since the output is not synchronized, it's difficult to tell whether there is a problem or not. Does anything pop out at you?


Thanks for your help. I really appreciate it.
Attached Files
File Type: zip parallelTest.zip (3.1 KB, 4 views)
craven.brent is offline   Reply With Quote

Old   November 17, 2015, 17:17
Default
  #5
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Brent,

The output from parallelTest seems OK. Since it didn't crash, this means that at least the basic communication is working as intended with foam-extend's own Pstream mechanism.

I went back to see how you had tried to define the optimization flag and I then remembered that foam-extend does things a bit differently from OpenFOAM. Please check this post: http://www.cfd-online.com/Forums/ope...tml#post491522 - post #7
Oh, this is interesting... check this commit message as well: http://sourceforge.net/p/foam-extend...a0ca1f8ec3230/

If I understood it correctly, you can do the following:
Code:
mpirun ... simpleFoam -parallel -OptimisationSwitches commsType=nonBlocking
Best regards,
Bruno
adiraman9 likes this.
__________________
wyldckat is offline   Reply With Quote

Old   November 18, 2015, 07:55
Default
  #6
New Member
 
Brent Craven
Join Date: Oct 2015
Posts: 7
Rep Power: 10
craven.brent is on a distinguished road
Hi Bruno,

Making sure commsType was set to 'nonBlocking' in this way seems to have solved my issue. Unfortunately, I wiped my previous test case where I was trying to set it in the case controlDict to see why that didn't work. But, regardless it is now working and I am happy!

Thanks for your help with this! I really appreciate it.

Thanks,
Brent
wyldckat likes this.
craven.brent is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
[Other] mesh airfoil NACA0012 anand_30 OpenFOAM Meshing & Mesh Conversion 13 March 7, 2022 17:22
[blockMesh] error message with modeling a cube with a hold at the center hsingtzu OpenFOAM Meshing & Mesh Conversion 2 March 14, 2012 09:56
[blockMesh] BlockMesh FOAM warning gaottino OpenFOAM Meshing & Mesh Conversion 7 July 19, 2010 14:11
[blockMesh] Axisymmetrical mesh Rasmus Gjesing (Gjesing) OpenFOAM Meshing & Mesh Conversion 10 April 2, 2007 14:00
[Gmsh] Import gmsh msh to Foam adorean OpenFOAM Meshing & Mesh Conversion 24 April 27, 2005 08:19


All times are GMT -4. The time now is 00:51.