CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Programming & Development

Problems with parallelization

Register Blogs Community New Posts Updated Threads Search

Like Tree1Likes
  • 1 Post By alexeym

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   September 18, 2015, 02:30
Default Problems with parallelization
  #1
Senior Member
 
Join Date: Jul 2011
Posts: 120
Rep Power: 14
haze_1986 is on a distinguished road
Dear All,

I did some modification to a main solver's PIMPLE loop to take into account another relative error to ensure convergence before proceeding to the next time step:

Code:
while ( 
    pimple.loop() 
    || sdofrbms.motion().mState().aRelErr() > someTolerance
    )
    { 
        // Original PIMPLE solver here
    }
Everything worked fine in serial but when I run parallel an error occurred somewhere down:
Code:
[xxx:11318] *** An error occurred in MPI_Recv
[xxx:11318] *** on communicator MPI_COMM_WORLD
[xxx:11318] *** MPI_ERR_TRUNCATE: message truncated
[xxx:11318] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
I have narrowed down the issue to the while loop, adding the tolerance checks causes the errors that occurs only in parallel. Can anyone give me some advice on how to overcome this?
haze_1986 is offline   Reply With Quote

Old   September 18, 2015, 04:59
Default
  #2
Senior Member
 
Alexey Matveichev
Join Date: Aug 2011
Location: Nancy, France
Posts: 1,930
Rep Power: 38
alexeym has a spectacular aura aboutalexeym has a spectacular aura about
Send a message via Skype™ to alexeym
Hi,

In general this happens if method sdofrbms.motion().mState().aRelErr() does not collect tolerances over all processes (i.e. on some processes your while loop is still going, yet on others it is finished).

Do you do something like reduce(tolearance_variable, maxOp<scalar>()) (i.e. determine maximum value of tolerance over all processes) in sdofrbms.motion().mState().aRelErr()?
alexeym is offline   Reply With Quote

Old   September 18, 2015, 05:04
Default
  #3
Senior Member
 
Join Date: Jul 2011
Posts: 120
Rep Power: 14
haze_1986 is on a distinguished road
Quote:
Originally Posted by alexeym View Post
Hi,

In general this happens if method sdofrbms.motion().mState().aRelErr() does not collect tolerances over all processes (i.e. on some processes your while loop is still going, yet on others it is finished).

Do you do something like reduce(tolearance_variable, maxOp<scalar>()) (i.e. determine maximum value of tolerance over all processes) in sdofrbms.motion().mState().aRelErr()?
Oh the variable returned by sdofrbms.motion().mState().aRelErr() is located in a method which has
Code:
if (Pstream::master()){
// calculate aRelErr
}
 
Pstream::scatter(motionState_);
Would that be the cause of the problem?
haze_1986 is offline   Reply With Quote

Old   September 18, 2015, 06:07
Default
  #4
Senior Member
 
Alexey Matveichev
Join Date: Aug 2011
Location: Nancy, France
Posts: 1,930
Rep Power: 38
alexeym has a spectacular aura aboutalexeym has a spectacular aura about
Send a message via Skype™ to alexeym
It depends. When do you recalculate tolerance? When do you scatter data? Just try to check your additional convergence, print sdofrbms.motion().mState().aRelErr() > someTolerance result using Pout stream for example to check if your processes are somehow unsynchronized. And finally is the error reproducible? Maybe it was just a fluctuation
alexeym is offline   Reply With Quote

Old   September 19, 2015, 10:16
Default
  #5
Senior Member
 
Join Date: Jul 2011
Posts: 120
Rep Power: 14
haze_1986 is on a distinguished road
I have made some changes based on some intuition on how parallelization works, not very confident but here goes:

I would assume the problem lies with the convergence check, I tried allocating that job to the master core:
Code:
while ( 
    pimple.loop() 
    || !converged
    )
    { 
        // Original PIMPLE solver here

        if (Pstream::master())
        {
            converged = sdofrbms.motion().mState().aRelErr() > someTolerance
        }

    }
Now it always proceed to run 9 time steps, on the 10th it will results in the following error, also associated with MPI:
Code:
[1] 
[1] 
[1] --> FOAM FATAL IO ERROR: 
[1] error in IOstream "IOstream" for operation operator>>(Istream&, List<T>&) : reading first token
[1] 
[1] file: IOstream at line 0.
[1] 
[1]     From function IOstream::fatalCheck(const char*) const
[1]     in file db/IOstreams/IOstreams/IOstream.C at line 114.
[1] 
FOAM parallel run exiting
[1]
haze_1986 is offline   Reply With Quote

Old   September 20, 2015, 05:27
Default
  #6
Senior Member
 
Alexey Matveichev
Join Date: Aug 2011
Location: Nancy, France
Posts: 1,930
Rep Power: 38
alexeym has a spectacular aura aboutalexeym has a spectacular aura about
Send a message via Skype™ to alexeym
Why do you check convergence only on master node?
alexeym is offline   Reply With Quote

Old   September 20, 2015, 05:38
Default
  #7
Senior Member
 
Join Date: Jul 2011
Posts: 120
Rep Power: 14
haze_1986 is on a distinguished road
Quote:
Originally Posted by alexeym View Post
Why do you check convergence only on master node?
In the function where aRelErr() is updated (sixDoFRigidBodyMotion.C)
if (Pstream::master()) {} is used as well. My convergence check only checks one value which I feel that there is no requirement to spread over the different processors.

However, I am currently trying to remove if (Pstream::master()) {} from both the top level solver and also the function where aRelErr() lies. It is currently still running for quite awhile and there seems to be no differences to the results of the serial runs which leads me to being more confused on why it is needed in the first place.
haze_1986 is offline   Reply With Quote

Old   September 20, 2015, 06:26
Default
  #8
Senior Member
 
Alexey Matveichev
Join Date: Aug 2011
Location: Nancy, France
Posts: 1,930
Rep Power: 38
alexeym has a spectacular aura aboutalexeym has a spectacular aura about
Send a message via Skype™ to alexeym
As I said, usually, the error message you have posted indicates synchronization problem between nodes.

Since you check convergence only on master node, while loop exists only on master node, on others it still goes on. Instead of

Code:
if (Pstream::master())
{
    converged = sdofrbms.motion().mState().aRelErr() > someTolerance
}
you can use something like:

Code:
scalar aRelErr = sdofrbms.motion().mState().aRelErr();
reduce(aRelErr, maxOp<scalar>());
converged = aRelErr < someTolerance;  // usually convergence means error BELOW certain tolerance
reduce operation will collect relative error from all node and apply maxOp to collected data. So converged will become true only when relative error goes below someTolerance on ALL nodes.
haze_1986 likes this.
alexeym is offline   Reply With Quote

Old   September 21, 2015, 07:36
Default
  #9
Senior Member
 
Join Date: Jul 2011
Posts: 120
Rep Power: 14
haze_1986 is on a distinguished road
Quote:
Originally Posted by alexeym View Post
As I said, usually, the error message you have posted indicates synchronization problem between nodes.

Since you check convergence only on master node, while loop exists only on master node, on others it still goes on. Instead of

Code:
if (Pstream::master())
{
    converged = sdofrbms.motion().mState().aRelErr() > someTolerance
}
you can use something like:

Code:
scalar aRelErr = sdofrbms.motion().mState().aRelErr();
reduce(aRelErr, maxOp<scalar>());
converged = aRelErr < someTolerance;  // usually convergence means error BELOW certain tolerance
reduce operation will collect relative error from all node and apply maxOp to collected data. So converged will become true only when relative error goes below someTolerance on ALL nodes.
I have just extensively tested the above. Works very well, thank you.

Last edited by haze_1986; September 26, 2015 at 01:50.
haze_1986 is offline   Reply With Quote

Old   October 14, 2015, 02:22
Default
  #10
Senior Member
 
Join Date: Jul 2011
Posts: 120
Rep Power: 14
haze_1986 is on a distinguished road
Hi it turns out that on one of the case that I am currently testing, the output from aRelErr is in the order of 3.52337924472e+246 at every iteration which does not make sense and is different from what is output from the real aRelErr returned from the sixDoFRigidBodyMotion library itself. I do not understand why, since it is just one value (not a field) and how would some of the processors return that value.
haze_1986 is offline   Reply With Quote

Old   October 14, 2015, 03:03
Default
  #11
Senior Member
 
Join Date: Jul 2011
Posts: 120
Rep Power: 14
haze_1986 is on a distinguished road
also note that in the sixDoFRigidBodyMotion lib aRelErr was evaluated as below
Code:
aRelErr() = mag(a() - aPrev())/mag(a());
So I am not sure if maxOp<scalar> is the correct way of implementing the reduce as well.
haze_1986 is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
[ICEM] Problems with coedge curves and surfaces tommymoose ANSYS Meshing & Geometry 6 December 1, 2020 11:12
[mesh manipulation] Problems with rotational cyclic boundaries TReviol OpenFOAM Meshing & Mesh Conversion 8 July 11, 2014 03:45
Needed Benchmark Problems for FSI Mechstud Main CFD Forum 4 July 26, 2011 12:13
Two-phase air water flow problems by activating Wall Lubrication Force challenger85 CFX 5 November 5, 2009 05:44
Help required to solve Hydraulic related problems aero CFX 0 October 30, 2006 11:00


All times are GMT -4. The time now is 01:25.