|
[Sponsors] |
September 18, 2015, 02:30 |
Problems with parallelization
|
#1 |
Senior Member
Join Date: Jul 2011
Posts: 120
Rep Power: 14 |
Dear All,
I did some modification to a main solver's PIMPLE loop to take into account another relative error to ensure convergence before proceeding to the next time step: Code:
while ( pimple.loop() || sdofrbms.motion().mState().aRelErr() > someTolerance ) { // Original PIMPLE solver here } Code:
[xxx:11318] *** An error occurred in MPI_Recv [xxx:11318] *** on communicator MPI_COMM_WORLD [xxx:11318] *** MPI_ERR_TRUNCATE: message truncated [xxx:11318] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort |
|
September 18, 2015, 04:59 |
|
#2 |
Senior Member
|
Hi,
In general this happens if method sdofrbms.motion().mState().aRelErr() does not collect tolerances over all processes (i.e. on some processes your while loop is still going, yet on others it is finished). Do you do something like reduce(tolearance_variable, maxOp<scalar>()) (i.e. determine maximum value of tolerance over all processes) in sdofrbms.motion().mState().aRelErr()? |
|
September 18, 2015, 05:04 |
|
#3 | |
Senior Member
Join Date: Jul 2011
Posts: 120
Rep Power: 14 |
Quote:
Code:
if (Pstream::master()){ // calculate aRelErr } Pstream::scatter(motionState_); |
||
September 18, 2015, 06:07 |
|
#4 |
Senior Member
|
It depends. When do you recalculate tolerance? When do you scatter data? Just try to check your additional convergence, print sdofrbms.motion().mState().aRelErr() > someTolerance result using Pout stream for example to check if your processes are somehow unsynchronized. And finally is the error reproducible? Maybe it was just a fluctuation
|
|
September 19, 2015, 10:16 |
|
#5 |
Senior Member
Join Date: Jul 2011
Posts: 120
Rep Power: 14 |
I have made some changes based on some intuition on how parallelization works, not very confident but here goes:
I would assume the problem lies with the convergence check, I tried allocating that job to the master core: Code:
while ( pimple.loop() || !converged ) { // Original PIMPLE solver here if (Pstream::master()) { converged = sdofrbms.motion().mState().aRelErr() > someTolerance } } Code:
[1] [1] [1] --> FOAM FATAL IO ERROR: [1] error in IOstream "IOstream" for operation operator>>(Istream&, List<T>&) : reading first token [1] [1] file: IOstream at line 0. [1] [1] From function IOstream::fatalCheck(const char*) const [1] in file db/IOstreams/IOstreams/IOstream.C at line 114. [1] FOAM parallel run exiting [1] |
|
September 20, 2015, 05:27 |
|
#6 |
Senior Member
|
Why do you check convergence only on master node?
|
|
September 20, 2015, 05:38 |
|
#7 |
Senior Member
Join Date: Jul 2011
Posts: 120
Rep Power: 14 |
In the function where aRelErr() is updated (sixDoFRigidBodyMotion.C)
if (Pstream::master()) {} is used as well. My convergence check only checks one value which I feel that there is no requirement to spread over the different processors. However, I am currently trying to remove if (Pstream::master()) {} from both the top level solver and also the function where aRelErr() lies. It is currently still running for quite awhile and there seems to be no differences to the results of the serial runs which leads me to being more confused on why it is needed in the first place. |
|
September 20, 2015, 06:26 |
|
#8 |
Senior Member
|
As I said, usually, the error message you have posted indicates synchronization problem between nodes.
Since you check convergence only on master node, while loop exists only on master node, on others it still goes on. Instead of Code:
if (Pstream::master()) { converged = sdofrbms.motion().mState().aRelErr() > someTolerance } Code:
scalar aRelErr = sdofrbms.motion().mState().aRelErr(); reduce(aRelErr, maxOp<scalar>()); converged = aRelErr < someTolerance; // usually convergence means error BELOW certain tolerance |
|
September 21, 2015, 07:36 |
|
#9 | |
Senior Member
Join Date: Jul 2011
Posts: 120
Rep Power: 14 |
Quote:
Last edited by haze_1986; September 26, 2015 at 01:50. |
||
October 14, 2015, 02:22 |
|
#10 |
Senior Member
Join Date: Jul 2011
Posts: 120
Rep Power: 14 |
Hi it turns out that on one of the case that I am currently testing, the output from aRelErr is in the order of 3.52337924472e+246 at every iteration which does not make sense and is different from what is output from the real aRelErr returned from the sixDoFRigidBodyMotion library itself. I do not understand why, since it is just one value (not a field) and how would some of the processors return that value.
|
|
October 14, 2015, 03:03 |
|
#11 |
Senior Member
Join Date: Jul 2011
Posts: 120
Rep Power: 14 |
also note that in the sixDoFRigidBodyMotion lib aRelErr was evaluated as below
Code:
aRelErr() = mag(a() - aPrev())/mag(a()); |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[ICEM] Problems with coedge curves and surfaces | tommymoose | ANSYS Meshing & Geometry | 6 | December 1, 2020 11:12 |
[mesh manipulation] Problems with rotational cyclic boundaries | TReviol | OpenFOAM Meshing & Mesh Conversion | 8 | July 11, 2014 03:45 |
Needed Benchmark Problems for FSI | Mechstud | Main CFD Forum | 4 | July 26, 2011 12:13 |
Two-phase air water flow problems by activating Wall Lubrication Force | challenger85 | CFX | 5 | November 5, 2009 05:44 |
Help required to solve Hydraulic related problems | aero | CFX | 0 | October 30, 2006 11:00 |