CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)
-   -   foam-extend-3.2 Pstream: "MPI_ABORT was invoked" (https://www.cfd-online.com/Forums/openfoam-solving/162650-foam-extend-3-2-pstream-mpi_abort-invoked.html)

craven.brent November 11, 2015 16:33

foam-extend-3.2 Pstream: "MPI_ABORT was invoked"
 
I am having a similar related issue with foam-extend 3.2. It installed with no problems and runs in parallel using system Open MPI on a single node (up to 12 cores). But, when I try using more than 1 node I get the following MPI_ABORT:

Code:

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 5 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------

I tried changing the commsType by putting the following in my system/controlDict case, but that did not resolve the issue.

Code:

OptimisationSwitches
{
    commsType      nonBlocking;
}

Has anyone else had this issue? I'm sure that it is not my MPI installation because I am using this same system Open MPI with OpenFOAM-2.4.x.

[ Moderator note: moved from http://www.cfd-online.com/Forums/ope...end-3-2-a.html ]

craven.brent November 14, 2015 22:30

foam-extend-3.2 Pstream: "MPI_ABORT was invoked"
 
Hi All,

I am having major problems getting foam-extend-3.2 running across multiple nodes on a cluster (actually, I have tried two different clusters with the same result). The code installed just fine and runs in serial and in parallel on a single node with descent scaling (so, MPI seems to be running on a single node just fine). However, as soon as I try to bridge multiple nodes, I get the following MPI_ABORT error as soon as simpleFoam (or other solvers that I have tested) enters the time loop:

Code:

Starting time loop

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 21 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

I am using system installs for GCC (5.1.0) and Open MPI (1.8.5), which I am also using for OpenFOAM-2.4.x (which runs with no problem across multiple nodes). I have tested using installations on multiple clusters and confirmed that the above MPI_Abort error occurs for foam-extend-3.2, but the same case runs across multiple nodes just fine using the same GCC/Open MPI and OpenFOAM-2.4.x. My next step is to install foam-extend-3.1 and see if I get the same issue there. But, has anyone else experienced this issue?

I noticed that the Pstream library changed locations from foam-extend-3.1 to foam-extend-3.2 and seems to have changed quite a bit. I wonder if that is part of the issue?

wyldckat November 16, 2015 12:28

Quick answer: Please try using the parallel testing utility that exists for OpenFOAM and foam-extend. Instructions for foam-extend are provided here: http://www.cfd-online.com/Forums/ope...tml#post560394 - post #12

The other possibility that comes to mind is that perhaps there is a shell environment flag that is automatically loading parts of shell environment variables for foam-extend only on the nodes, resulting in incompatible versions of simpleFoam being loaded.
One test I usually do for this is to launch mpirun with a shell script that simply outputs the current shell environment into a log file, so that I can examine what the shell environment looks like on each launched process. For example, a script containing this:
Code:

#!/bin/sh

export > log_env.$$

And the file "log_env.PID" will have the environment for the process ID number PID.

craven.brent November 16, 2015 16:51

1 Attachment(s)
Hi Bruno,

Thanks for your recommendation. I used your script and looked at my shell environment, which looks fine to me (is showing the correct $PATH that includes foam-extend-3.2 for all processes). So, I don't think that's the problem.

I had to slightly modify the parallelTest utility to get it compiled in foam-extend-3.2 since it appears as though Time.H no longer exists and I was getting "Time.H: No such file or directory." The source code is attached below.

I ran parallelTest using MPI across multiple nodes (2 nodes with 12 cores each) and here is the resultant stderr:

Code:

[13] slave sending to master 0
[13] slave receiving from master 0
[15] slave sending to master 0
[15] slave receiving from master 0
[23] slave sending to master 0
[23] slave receiving from master 0
[14] slave sending to master 0
[14] slave receiving from master 0
[22] slave sending to master 0
[22] slave receiving from master 0
[12] slave sending to master 0
[12] slave receiving from master 0
[20] slave sending to master [8] slave sending to master 0
[8] slave receiving from master 0
[9] 0
[20] slave receiving from master 0slave sending to master 0

[9] slave receiving from master 0
[0] master receiving from slave 1
[16] [1] slave sending to master 0
[1] slave receiving from master 0
[0] (0 1 2)
[0] master receiving from slave 2
slave sending to master 0[19] slave sending to master 0
[19] slave receiving from master 0

[21] slave sending to master 0
[21] slave receiving from master 0
[16] [11] slave sending to master 0
[11] slave receiving from master 0
slave receiving from master [3] slave sending to master 0
[3] slave receiving from master 0
[6] slave sending to master 0
[6] slave receiving from master 0[7] slave sending to master 0
[7] slave receiving from master 0

0
[2] slave sending to master 0
[0] (0 1 2)[2] slave receiving from master 0

[0] master receiving from slave 3
[0] (0 1 2)
[0] master receiving from slave 4
[10] slave sending to master 0
[10] slave receiving from master 0
[18] [5] slave sending to master 0
[18] slave receiving from master 0slave sending to master 0
[5] slave receiving from master 0

[4] slave sending to master 0
[0] (0 1 2)
[0] master receiving from slave 5[4]
[0] (0 1 2)
[0] master receiving from slave 6
[0] (0 1 2)
[0] master receiving from slave 7
[0] (0 1 2)
[0] master receiving from slave 8
[0] (0 1 2)
[0] master receiving from slave 9
[0] (0 1 2)
[0] master receiving from slave 10
[0] (0 1 2)
[0] master receiving from slave 11
[0] (0 1 2)
[0] master receiving from slave 12
[0] (0 1 2)
[0] master receiving from slave 13
[0] (0 1 2)
[0] master receiving from slave 14
[0] (0 1 2)
[0] master receiving from slave 15
[0] (0 1 2)
[0] master receiving from slave 16
[0] (0 1 2)
[0] master receiving from slave 17
slave receiving from master 0
[0] [17] slave sending to master 0
[17] slave receiving from master 0
(0 1 2)
[0] master receiving from slave 18
[0] (0 1 2)
[0] master receiving from slave 19
[0] (0 1 2)
[0] master receiving from slave 20
[0] (0 1 2)
[0] master receiving from slave 21
[0] (0 1 2)
[0] master receiving from slave 22
[0] (0 1 2)
[0] master receiving from slave 23
[0] (0 1 2)
[0] master sending to slave 1
[0] [1] (0 1 2)
master sending to slave 2
[0] [2] (0 1 2)
master sending to slave 3
[0] master sending to slave 4
[0] master sending to slave 5
[0] master sending to slave 6
[0] master sending to slave 7
[0] master sending to slave 8
[0] master sending to slave 9
[0] master sending to slave 10
[0] master sending to slave 11
[0] master sending to slave 12
[5] (0 1 2)
[3] (0 1 2)
[4] (0 1 2)
[8] (0 1 2)
[6] (0 1 2)
[11] (0 1 2)
[0] master sending to slave 13
[0] master sending to slave 14
[0] master sending to slave 15
[0] master sending to slave 16
[0] master sending to slave 17
[0] master sending to slave 18
[0] master sending to slave 19
[0] master sending to slave 20
[0] master sending to slave 21
[0] master sending to slave 22
[0] master sending to slave 23
[10] (0 1 2)
[7] (0 1 2)
[9] (0 1 2)
[13] (0 1 2)
[12] [14] (0 1 2)
[17] (0 1 2)
[20] (0 1 2)
[18] (0 1 2)
[16] (0 1 2)
[21] (0 1 2)
[15] (0 1 2)
[19] (0 1 2)
[22] (0 1 2)
[23] (0 1 2)
(0 1 2)


There are no error messages. But, since the output is not synchronized, it's difficult to tell whether there is a problem or not. Does anything pop out at you?


Thanks for your help. I really appreciate it.

wyldckat November 17, 2015 17:17

Hi Brent,

The output from parallelTest seems OK. Since it didn't crash, this means that at least the basic communication is working as intended with foam-extend's own Pstream mechanism.

I went back to see how you had tried to define the optimization flag and I then remembered that foam-extend does things a bit differently from OpenFOAM. Please check this post: http://www.cfd-online.com/Forums/ope...tml#post491522 - post #7
Oh, this is interesting... check this commit message as well: http://sourceforge.net/p/foam-extend...a0ca1f8ec3230/

If I understood it correctly, you can do the following:
Code:

mpirun ... simpleFoam -parallel -OptimisationSwitches commsType=nonBlocking
Best regards,
Bruno

craven.brent November 18, 2015 07:55

Hi Bruno,

Making sure commsType was set to 'nonBlocking' in this way seems to have solved my issue. Unfortunately, I wiped my previous test case where I was trying to set it in the case controlDict to see why that didn't work. But, regardless it is now working and I am happy!

Thanks for your help with this! I really appreciate it.

Thanks,
Brent


All times are GMT -4. The time now is 12:28.