CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Programming & Development

MPI problem turbulentTemperatureCoupledBaffleMixedFvPatchScala rField.C

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree1Likes
  • 1 Post By Robin.Kamenicky

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   February 26, 2019, 11:42
Default MPI problem turbulentTemperatureCoupledBaffleMixedFvPatchScala rField.C
  #1
Member
 
Robin Kamenicky
Join Date: Mar 2016
Posts: 65
Rep Power: 7
Robin.Kamenicky is on a distinguished road
Hi guys,

I have been playing around with a BC based on turbulentTemperatureCoupledBaffleMixedFvPatchScala rField.C
and temperatureCoupledBase.C

I have adjusted both of the files (including the header files) in a way that in the temperatureCoupledBase.C are now two different methods for calculating kappa.
In the turbulentTemperatureCoupledBaffleMixedFvPatchScala rField.C is an if condition which decides which kappa function to use. The decision is made based on the user entry.

My code is running well on 1CPU but I am not able to run it on more CPUs. When I use 2CPUs the solver is able to make quite a few PIMPLE iterations but when I run it on 4CPUs it freezes almost instantly.

By freezing I mean that no backtrace or other error is given. The solver is still running or rather hanging in a deadlock. I can see it, when I run the top command but no output is produced anymore.

When I run with mpirunDebug, SIGHUP error occures:
Code:
        -> Thread 1 "XYZ" received signal SIGHUP, Hangup.
0x00007fffdd069ba8 in ?? () from /usr/lib/openmpi/lib/openmpi/mca_btl_vader.so
#0  0x00007fffdd069ba8 in ?? () from /usr/lib/openmpi/lib/openmpi/mca_btl_vader.so
#1  0x00007fffe901f1ea in opal_progress () from /usr/lib/libopen-pal.so.13
#2  0x00007fffeaceff65 in ompi_request_default_wait_all () from /usr/lib/libmpi.so.12
#3  0x00007fffdbdcd426 in ompi_coll_tuned_allreduce_intra_recursivedoubling () from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#4  0x00007fffeacfff23 in PMPI_Allreduce () from /usr/lib/libmpi.so.12
#5  0x00007ffff041f343 in Foam::allReduce<double, Foam::sumOp<double> > (Value=@0x7fffffff4cb0: -2.4676237571465036e-09, MPICount=1, MPIType=0x7fffeaf6b5e0 <ompi_mpi_double>, MPIOp=0x7fffeaf7c920 <ompi_mpi_op_sum>, bop=..., tag=1, communicator=0) at allReduceTemplates.C:157
#6  0x00007ffff041c7f9 in Foam::reduce (Value=@0x7fffffff4cb0: -2.4676237571465036e-09, bop=..., tag=1, communicator=0) at UPstream.C:223 
#7  0x000000000047e59a in Foam::gSum<double> (f=..., comm=0) at /home/robin/OpenFOAM/OpenFOAM-dev-debug/src/OpenFOAM/lnInclude/FieldFunctions.C:543
#8  0x00007ffff7a26730 in Foam::gSum<double> (tf1=...) at /home/robin/OpenFOAM/OpenFOAM-dev-debug/src/OpenFOAM/lnInclude/FieldFunctions.C:543
#9  0x00007ffff755ddd5 in Foam::fvc::domainIntegrate<double> (vf=...) at /home/robin/OpenFOAM/OpenFOAM-dev-debug/src/finiteVolume/lnInclude/fvcVolumeIntegrate.C:95
This appears at a line where I use an Info statement in my solver so when I comment it out and use collated fileHandler for parallelization I get another one from mpirunDebug:

Code:
-> Thread 1 "XYZ" received signal SIGHUP, Hangup.
0x00007fffdd069a40 in ?? () from /usr/lib/openmpi/lib/openmpi/mca_btl_vader.so
#0  0x00007fffdd069a40 in ?? () from /usr/lib/openmpi/lib/openmpi/mca_btl_vader.so
#1  0x00007fffe901f1ea in opal_progress () from /usr/lib/libopen-pal.so.13
#2  0x00007fffeaceff65 in ompi_request_default_wait_all () from /usr/lib/libmpi.so.12
#3  0x00007fffead20d27 in PMPI_Waitall () from /usr/lib/libmpi.so.12
#4  0x00007ffff041e03a in Foam::UPstream::waitRequests (start=0) at UPstream.C:730
#5  0x000000000052391c in Foam::mapDistributeBase::distribute<double, Foam::flipOp> (commsType=Foam::UPstream::commsTypes::nonBlocking, schedule=..., constructSize=20, subMap=..., subHasFlip=false, constructMap=..., constructHasFlip=false, field=..., negOp=..., tag=2) at /home/robin/OpenFOAM/OpenFOAM-dev-debug/src/OpenFOAM/lnInclude/mapDistributeBaseTemplates.C:587
#6  0x000000000051fb3d in Foam::mapDistributeBase::distribute<double, Foam::flipOp> (this=0x10ab1f8, fld=..., negOp=..., tag=2) at /home/robin/OpenFOAM/OpenFOAM-dev-debug/src/OpenFOAM/lnInclude/mapDistributeBaseTemplates.C:1208
#7  0x00007ffff4cd5880 in Foam::mapDistribute::distribute<double, Foam::flipOp> (this=0x10ab1f0, fld=..., negOp=..., dummyTransform=true, tag=2) at /home/robin/OpenFOAM/OpenFOAM-dev-debug/src/OpenFOAM/lnInclude/mapDistributeTemplates.C:137
#8  0x00007ffff4cd56eb in Foam::mapDistribute::distribute<double> (this=0x10ab1f0, fld=..., dummyTransform=true, tag=2) at /home/robin/OpenFOAM/OpenFOAM-dev-debug/src/OpenFOAM/lnInclude/mapDistributeTemplates.C:155
#9  0x00007ffff4eba4d2 in Foam::mappedPatchBase::distribute<double> (this=0xda9c28, lst=...) at /home/robin/OpenFOAM/OpenFOAM-dev-debug/src/meshTools/lnInclude/mappedPatchBaseTemplates.C:38
#10 0x00007fffd1e67928 in Foam::turbulentTemperatureCoupledBaffleMixedXYZFvPatchScalarField::updateCoeffs (this=0xf14690) at turbulentTemperatureCoupledBaffleMixedXYZ/turbulentTemperatureCoupledBaffleMixedXYZFvPatchScalarField.C:213
I assume it might be an issue with tag which is defined at turbulentTemperatureCoupledBaffleMixedFvPatchScala rField.C:166
I define it only once as well as mpp in the boundary condition only than I split the code by the if condition to choose the kappa() function.
I call updateCoeffs() inside UpdateCoeffs() inside evaluate(), If this information is of any use.

Some general In formation:
Code:
mpirun -V
mpirun (Open MPI) 1.10.2
OpenFOAM-dev compiled on 30th October 2018 not updated after that.
Ubuntu 16.04 LTS

My mesh is very small with two regions.

I appreciate any help or comments.
Thank you
Robin
Robin.Kamenicky is offline   Reply With Quote

Old   February 26, 2019, 17:42
Default
  #2
Senior Member
 
Andrew Somorjai
Join Date: May 2013
Posts: 172
Rep Power: 9
massive_turbulence is on a distinguished road
Quote:
Originally Posted by Robin.Kamenicky View Post
Hi guys,


I have adjusted both of the files (including the header files) in a way that in the temperatureCoupledBase.C are now two different methods for calculating kappa.
In the turbulentTemperatureCoupledBaffleMixedFvPatchScala rField.C is an if condition which decides which kappa function to use. The decision is made based on the user entry.
Is the kappa function updated every time for each cell or for each processor? Maybe this is causing a weird mismatch between processors?
massive_turbulence is offline   Reply With Quote

Old   March 6, 2019, 14:17
Default
  #3
Member
 
Robin Kamenicky
Join Date: Mar 2016
Posts: 65
Rep Power: 7
Robin.Kamenicky is on a distinguished road
Hi Andrew,

sorry for my late answer and thank you for your response.

Yes the kappa is updated every time step but the function is defined in exact same manner as the original kappa function. Just look for thermophysicalProperties or for turbulenceProperties.

I debuged it further and it seems that the problem is that I call this new BC( let's call it mixed BC) from another BC to evaluate the wall temperature. First run through the mixed BC is fine but then it returns to the BC from which it is called. That BC decides that it needs to make one more iteration because of BC residuals which it controls. Here starts the problem. Only some CPUs enter the mixed BC again and the others do not enter in. This means that if I use 4 CPUs, 2 CPUs wait inside the mixed BC for the other 2 and the other 2 waits outside at some other point in the code.

The problem seems to be at following lines (turbulentTemperatureCoupledBaffleMixedFvPatchScal arField.C:233):
Code:
this->refValue() = nbrIntFld();
            this->refGrad() = 0.0;
            this->valueFraction() = nbrKDelta()/(nbrKDelta() + myKDelta());
The code runs without deadlock, if I comment them out. Even when I use the identical original BC turbulentTemperatureCoupledBaffleMixedFvPatchScala rField.C without any changes and I call it from the other BC, the deadlock occurs.

Thank you again,
Robin
Robin.Kamenicky is offline   Reply With Quote

Old   February 12, 2020, 18:08
Default
  #4
New Member
 
Benjamin Khoo
Join Date: Aug 2018
Posts: 3
Rep Power: 4
benksy is on a distinguished road
Hi Robin,

Did you manage to solve the issue?

I'm facing a similar issue too
benksy is offline   Reply With Quote

Old   February 13, 2020, 10:26
Default
  #5
Member
 
Robin Kamenicky
Join Date: Mar 2016
Posts: 65
Rep Power: 7
Robin.Kamenicky is on a distinguished road
Hi Benjamin,


Yes, I was able to solve it. Actually, I saw that same problem is also in OF-1906 and OF-1912, still need to contact developers and discuss it. There is no bug in the original code!



The main thing is that I call the function updateCoeffs() of the class from some other boundary condition. In the other boundary condition, there is an if condition within for loop to evaluate an error. Based on the evaluation the for loop is exited or looped again. The problem occurs when the if condition is evaluated based on separate results from each processor. Some processor evaluates the if condition as true and some false. Therefore some processor stays in the for loop and other processor exits the for loop. The hanging occurs when the processors go into different part of the code an both encounter a mpi function which requires information from all processors. That means, that they wait for the other processors for ever.



To circumvent the problem the if condition must be evaluated using and information from all processors not from each of them separately. This might be done by using g functions such as gMax().
Code:
error = gMax(mag(oldValue - newValue))
if (error < 1e-4)
{..}
I hope this helps. What is your problem?


Have a great day,
Robin
Robin.Kamenicky is offline   Reply With Quote

Old   February 13, 2020, 17:46
Default
  #6
HPE
Senior Member
 
Herpes Free Engineer
Join Date: Sep 2019
Location: The Home Under The Ground with the Lost Boys
Posts: 658
Rep Power: 8
HPE is on a distinguished road
Is there a bug, or the problem caused by your modifications. If you think there is a bug, I would issue abug ticket in GitLab (considerinb u r using ESI-OF version).
HPE is offline   Reply With Quote

Old   February 13, 2020, 18:23
Default
  #7
Member
 
Robin Kamenicky
Join Date: Mar 2016
Posts: 65
Rep Power: 7
Robin.Kamenicky is on a distinguished road
Hi HPE,


As I have written in the previous post, there is no bug.


The code modifications I made were initially made in OF-6 from openfoam.org. Some time later, I have found that same code was implemented in OF-1906 but part of the code was commented out because of lagging when multiple CPUs was used (Exactly for the same reason I have encountered during my development in OF-6). This commented part of the code has been then deleted in OF-v1912. I find the part of the code to be pretty important therefore I have issued a ticket couple hours ago (https://develop.openfoam.com/Develop...am/issues/1592). I would call it as a room for improvement instead of bug.



Thanks
Robin
HPE likes this.
Robin.Kamenicky is offline   Reply With Quote

Old   February 13, 2020, 18:25
Default
  #8
HPE
Senior Member
 
Herpes Free Engineer
Join Date: Sep 2019
Location: The Home Under The Ground with the Lost Boys
Posts: 658
Rep Power: 8
HPE is on a distinguished road
Very interesting. Thank you.
HPE is offline   Reply With Quote

Old   February 14, 2020, 18:55
Default
  #9
New Member
 
Benjamin Khoo
Join Date: Aug 2018
Posts: 3
Rep Power: 4
benksy is on a distinguished road
Quote:
Originally Posted by Robin.Kamenicky View Post
Hi Benjamin,


Yes, I was able to solve it. Actually, I saw that same problem is also in OF-1906 and OF-1912, still need to contact developers and discuss it. There is no bug in the original code!



The main thing is that I call the function updateCoeffs() of the class from some other boundary condition. In the other boundary condition, there is an if condition within for loop to evaluate an error. Based on the evaluation the for loop is exited or looped again. The problem occurs when the if condition is evaluated based on separate results from each processor. Some processor evaluates the if condition as true and some false. Therefore some processor stays in the for loop and other processor exits the for loop. The hanging occurs when the processors go into different part of the code an both encounter a mpi function which requires information from all processors. That means, that they wait for the other processors for ever.



To circumvent the problem the if condition must be evaluated using and information from all processors not from each of them separately. This might be done by using g functions such as gMax().
Code:
error = gMax(mag(oldValue - newValue))
if (error < 1e-4)
{..}
I hope this helps. What is your problem?


Have a great day,
Robin
I've a look at the code again, and I realise that my problem is similar but not the same. It seems like for my code results in a slightly different deltaT across different processors with each timestep when adjustTimeStep is turned on. This eventually led to in some processor writing field and some doesn't which resulted in a hangup
benksy is offline   Reply With Quote

Reply

Tags
boundary condition error, mpi, multiregion, parallel

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
MPI problem when using snappyHexMesh Mohamed Mousa OpenFOAM Running, Solving & CFD 3 September 17, 2017 12:45
MPI problem with fluent daviyu FLUENT 0 July 8, 2017 05:22
Problem running in parralel Val OpenFOAM Running, Solving & CFD 1 June 12, 2014 03:47
MPI problem with fluent alexsatan FLUENT 2 July 9, 2013 05:56
OpenFOAM 1.7.1 installation problem on OpenSUSE 11.3 flakid OpenFOAM Installation 16 December 28, 2010 09:48


All times are GMT -4. The time now is 06:29.