Problems with parallelization

haze_1986 · September 18, 2015, 02:30

Dear All,

I did some modification to a main solver's PIMPLE loop to take into account another relative error to ensure convergence before proceeding to the next time step:

Code:

while ( 
    pimple.loop() 
    || sdofrbms.motion().mState().aRelErr() > someTolerance
    )
    { 
        // Original PIMPLE solver here
    }

Everything worked fine in serial but when I run parallel an error occurred somewhere down:

Code:

[xxx:11318] *** An error occurred in MPI_Recv
[xxx:11318] *** on communicator MPI_COMM_WORLD
[xxx:11318] *** MPI_ERR_TRUNCATE: message truncated
[xxx:11318] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

I have narrowed down the issue to the while loop, adding the tolerance checks causes the errors that occurs only in parallel. Can anyone give me some advice on how to overcome this?

alexeym · September 18, 2015, 04:59

Hi,

In general this happens if method sdofrbms.motion().mState().aRelErr() does not collect tolerances over all processes (i.e. on some processes your while loop is still going, yet on others it is finished).

Do you do something like reduce(tolearance_variable, maxOp<scalar>()) (i.e. determine maximum value of tolerance over all processes) in sdofrbms.motion().mState().aRelErr()?

haze_1986 · September 18, 2015, 05:04

Quote:

Originally Posted by alexeym

Hi,

In general this happens if method sdofrbms.motion().mState().aRelErr() does not collect tolerances over all processes (i.e. on some processes your while loop is still going, yet on others it is finished).

Do you do something like reduce(tolearance_variable, maxOp<scalar>()) (i.e. determine maximum value of tolerance over all processes) in sdofrbms.motion().mState().aRelErr()?

Oh the variable returned by sdofrbms.motion().mState().aRelErr() is located in a method which has

Code:

if (Pstream::master()){
// calculate aRelErr
}
 
Pstream::scatter(motionState_);

Would that be the cause of the problem?

alexeym · September 18, 2015, 06:07

It depends. When do you recalculate tolerance? When do you scatter data? Just try to check your additional convergence, print sdofrbms.motion().mState().aRelErr() > someTolerance result using Pout stream for example to check if your processes are somehow unsynchronized. And finally is the error reproducible? Maybe it was just a fluctuation

haze_1986 · September 19, 2015, 10:16

I have made some changes based on some intuition on how parallelization works, not very confident but here goes:

I would assume the problem lies with the convergence check, I tried allocating that job to the master core:

Code:

while ( 
    pimple.loop() 
    || !converged
    )
    { 
        // Original PIMPLE solver here

        if (Pstream::master())
        {
            converged = sdofrbms.motion().mState().aRelErr() > someTolerance
        }

    }

Now it always proceed to run 9 time steps, on the 10th it will results in the following error, also associated with MPI:

Code:

[1] 
[1] 
[1] --> FOAM FATAL IO ERROR: 
[1] error in IOstream "IOstream" for operation operator>>(Istream&, List<T>&) : reading first token
[1] 
[1] file: IOstream at line 0.
[1] 
[1]     From function IOstream::fatalCheck(const char*) const
[1]     in file db/IOstreams/IOstreams/IOstream.C at line 114.
[1] 
FOAM parallel run exiting
[1]

alexeym · September 20, 2015, 05:27

Why do you check convergence only on master node?

haze_1986 · September 20, 2015, 05:38

Quote:

Originally Posted by alexeym

Why do you check convergence only on master node?

In the function where aRelErr() is updated (sixDoFRigidBodyMotion.C)
if (Pstream::master()) {} is used as well. My convergence check only checks one value which I feel that there is no requirement to spread over the different processors.

However, I am currently trying to remove if (Pstream::master()) {} from both the top level solver and also the function where aRelErr() lies. It is currently still running for quite awhile and there seems to be no differences to the results of the serial runs which leads me to being more confused on why it is needed in the first place.

alexeym · September 20, 2015, 06:26

As I said, usually, the error message you have posted indicates synchronization problem between nodes.

Since you check convergence only on master node, while loop exists only on master node, on others it still goes on. Instead of

Code:

if (Pstream::master())
{
    converged = sdofrbms.motion().mState().aRelErr() > someTolerance
}

you can use something like:

Code:

scalar aRelErr = sdofrbms.motion().mState().aRelErr();
reduce(aRelErr, maxOp<scalar>());
converged = aRelErr < someTolerance;  // usually convergence means error BELOW certain tolerance

reduce operation will collect relative error from all node and apply maxOp to collected data. So converged will become true only when relative error goes below someTolerance on ALL nodes.

haze_1986 · September 21, 2015, 07:36

Quote:

Originally Posted by alexeym

As I said, usually, the error message you have posted indicates synchronization problem between nodes.

Since you check convergence only on master node, while loop exists only on master node, on others it still goes on. Instead of

Code:

if (Pstream::master())
{
    converged = sdofrbms.motion().mState().aRelErr() > someTolerance
}

you can use something like:

Code:

scalar aRelErr = sdofrbms.motion().mState().aRelErr();
reduce(aRelErr, maxOp<scalar>());
converged = aRelErr < someTolerance;  // usually convergence means error BELOW certain tolerance

reduce operation will collect relative error from all node and apply maxOp to collected data. So converged will become true only when relative error goes below someTolerance on ALL nodes.

I have just extensively tested the above. Works very well, thank you.

haze_1986 · October 14, 2015, 02:22

Hi it turns out that on one of the case that I am currently testing, the output from aRelErr is in the order of 3.52337924472e+246 at every iteration which does not make sense and is different from what is output from the real aRelErr returned from the sixDoFRigidBodyMotion library itself. I do not understand why, since it is just one value (not a field) and how would some of the processors return that value.

haze_1986 · October 14, 2015, 03:03

also note that in the sixDoFRigidBodyMotion lib aRelErr was evaluated as below

Code:

aRelErr() = mag(a() - aPrev())/mag(a());

So I am not sure if maxOp<scalar> is the correct way of implementing the reduce as well.

September 18, 2015, 02:30	Problems with parallelization	#1
haze_1986 Senior Member Join Date: Jul 2011 Posts: 120 Rep Power: 14	Dear All, I did some modification to a main solver's PIMPLE loop to take into account another relative error to ensure convergence before proceeding to the next time step: Code: while ( pimple.loop() \|\| sdofrbms.motion().mState().aRelErr() > someTolerance ) { // Original PIMPLE solver here } Everything worked fine in serial but when I run parallel an error occurred somewhere down: Code: [xxx:11318] * An error occurred in MPI_Recv [xxx:11318] * on communicator MPI_COMM_WORLD [xxx:11318] * MPI_ERR_TRUNCATE: message truncated [xxx:11318] * MPI_ERRORS_ARE_FATAL: your MPI job will now abort I have narrowed down the issue to the while loop, adding the tolerance checks causes the errors that occurs only in parallel. Can anyone give me some advice on how to overcome this?

September 20, 2015, 06:26		#8
alexeym Senior Member Alexey Matveichev Join Date: Aug 2011 Location: Nancy, France Posts: 1,930 Rep Power: 38	As I said, usually, the error message you have posted indicates synchronization problem between nodes. Since you check convergence only on master node, while loop exists only on master node, on others it still goes on. Instead of Code: if (Pstream::master()) { converged = sdofrbms.motion().mState().aRelErr() > someTolerance } you can use something like: Code: scalar aRelErr = sdofrbms.motion().mState().aRelErr(); reduce(aRelErr, maxOp<scalar>()); converged = aRelErr < someTolerance; // usually convergence means error BELOW certain tolerance reduce operation will collect relative error from all node and apply maxOp to collected data. So converged will become true only when relative error goes below someTolerance on ALL nodes. haze_1986 likes this.

October 14, 2015, 03:03		#11
haze_1986 Senior Member Join Date: Jul 2011 Posts: 120 Rep Power: 14	also note that in the sixDoFRigidBodyMotion lib aRelErr was evaluated as below Code: aRelErr() = mag(a() - aPrev())/mag(a()); So I am not sure if maxOp<scalar> is the correct way of implementing the reduce as well.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[ICEM] Problems with coedge curves and surfaces	tommymoose	ANSYS Meshing & Geometry	6	December 1, 2020 11:12
[mesh manipulation] Problems with rotational cyclic boundaries	TReviol	OpenFOAM Meshing & Mesh Conversion	8	July 11, 2014 03:45
Needed Benchmark Problems for FSI	Mechstud	Main CFD Forum	4	July 26, 2011 12:13
Two-phase air water flow problems by activating Wall Lubrication Force	challenger85	CFX	5	November 5, 2009 05:44
Help required to solve Hydraulic related problems	aero	CFX	0	October 30, 2006 11:00

September 18, 2015, 04:59		#2
alexeym Senior Member Alexey Matveichev Join Date: Aug 2011 Location: Nancy, France Posts: 1,930 Rep Power: 38	Hi, In general this happens if method sdofrbms.motion().mState().aRelErr() does not collect tolerances over all processes (i.e. on some processes your while loop is still going, yet on others it is finished). Do you do something like reduce(tolearance_variable, maxOp<scalar>()) (i.e. determine maximum value of tolerance over all processes) in sdofrbms.motion().mState().aRelErr()?

September 18, 2015, 06:07		#4
alexeym Senior Member Alexey Matveichev Join Date: Aug 2011 Location: Nancy, France Posts: 1,930 Rep Power: 38	It depends. When do you recalculate tolerance? When do you scatter data? Just try to check your additional convergence, print sdofrbms.motion().mState().aRelErr() > someTolerance result using Pout stream for example to check if your processes are somehow unsynchronized. And finally is the error reproducible? Maybe it was just a fluctuation

September 20, 2015, 05:27		#6
alexeym Senior Member Alexey Matveichev Join Date: Aug 2011 Location: Nancy, France Posts: 1,930 Rep Power: 38	Why do you check convergence only on master node?

October 14, 2015, 02:22		#10
haze_1986 Senior Member Join Date: Jul 2011 Posts: 120 Rep Power: 14	Hi it turns out that on one of the case that I am currently testing, the output from aRelErr is in the order of 3.52337924472e+246 at every iteration which does not make sense and is different from what is output from the real aRelErr returned from the sixDoFRigidBodyMotion library itself. I do not understand why, since it is just one value (not a field) and how would some of the processors return that value.