CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)
-   -   Killing an parallel computing with mpirun (https://www.cfd-online.com/Forums/openfoam-solving/115542-killing-parallel-computing-mpirun.html)

malaboss April 2, 2013 03:59

Killing an parallel computing with mpirun
 
Hi FOAMers,
I am working on 2D cases in parallel.
The machine I use is shared between several colleagues and I need to stop my simulations during the night (in my case, at 11PM). Then I can restart my computations at 7 AM.

I wrote a script which stop my computations with the command :

Code:

kill $MpirunPID
Everything is fine there

At 7AM, the script asks for a restart from the latest time (I just call mpirun -np $numberofProcessors $solver)
And sometimes it tells me that the folder associated to the latest time is not complete (for example it says :

Code:

--> FOAM FATAL IO ERROR:
cannot find file

file: /home/OpenFOAM/host-2.1.1/run/cylindre/turbulence/Spalart/pimple/domaine_elargi/cylindreRE1000000_79/68.0095/p at line 0.

    From function regIOobject::readStream()
    in file db/regIOobject/regIOobjectRead.C at line 73.

FOAM exiting

What is really weird is that sometimes, everything is fine and sometimes it gives this message ...
When I receive this error message, I must erase the latest time and rerun the case.

Thus I have 2 questions, and I hope you can help me :
1) How can I tell to OpenFoam that when I stop the computations I want it to finish the iteration ?
2) Why sometimes calling "kill" works and sometimes it stops the computations in the middle of the iteration ? Am I just lucky ?

Thank you all for your help !

haakon April 2, 2013 04:43

The concept here is that by default, killing a process will stop it immediately, independent on what it is doing. That means that you can be lucky, when you kill it after it has finished a write, but before a new write is started, you can restart from the last write. However, you can also be unlucky and kill it in the middle of a write operation, and in that case you have an incomplete timestep, from which you cannot restart.

The solution here is to use OpenFOAMs run-time controls. I suggest you look at these pages:

http://www.openfoam.org/version2.1.0...me-control.php
http://www.openfoam.org/version2.2.0...me-control.php

You will need to set the stopAtWriteNowSignal to a positive integer, and send the same signal to your process when you want it to stop. It will then nicely write the latest timestep to disk and stop.

Disclaimer: I have never tried this personally, and does not know details on how this work, however it looks promising.

malaboss April 2, 2013 10:24

Hi,
Thank you so much for the links.

I added this code in my controlDict file :

Code:

OptimisationSwitches
{
    // Force dumping (at next timestep) upon signal
    writeNowSignal              10;
    // Force dumping (at next timestep) and clean exit upon signal
    stopAtWriteNowSignal        20; //-1;
}

When I kill the process named mpirun with a kill -10 everything goes fine, an the iteration is complete.
However, when I tried to use kill -20, it did not kill anything. Actually, nothing happened and the process was still running. I may not have completely understood what is stopAtWriteNowSignal.

If you have the answer, I would be very delighted.
Thank you anyway !

haakon April 2, 2013 10:42

I did a quick test and found that if I only set
Code:

OptimisationSwitches
{
    // Force dumping (at next timestep) and clean exit upon signal
    stopAtWriteNowSignal        10;
}

I could send signal 10 to the mpirun process with
Code:

kill -s 10 $MPIRUN_PID
and OpenFOAM will nicely write the solution filed and abort. The terminal log from OpenFOAM is:
Code:

Courant Number mean: 0.0028915643 max: 0.15907544
DILUPBiCG:  Solving for Ux, Initial residual = 6.2541889e-07, Final residual = 6.2541889e-07, No Iterations 0
DILUPBiCG:  Solving for Uz, Initial residual = 9.1704148e-05, Final residual = 6.8430736e-08, No Iterations 1
GAMG:  Solving for p, Initial residual = 0.036746401, Final residual = 0.0012739916, No Iterations 3
GAMG:  Solving for p, Initial residual = 0.0030208887, Final residual = 0.00011714219, No Iterations 7
time step continuity errors : sum local = 5.6965912e-13, global = -7.8049053e-17, cumulative = -2.6564765e-16
GAMG:  Solving for p, Initial residual = 0.0022172423, Final residual = 0.00010582563, No Iterations 10
mpirun: Forwarding signal 10 to job
sigStopAtWriteNow : setting up write and stop at end of the next iteration

GAMG:  Solving for p, Initial residual = 0.00024891902, Final residual = 8.8121468e-08, No Iterations 22
time step continuity errors : sum local = 4.2722368e-16, global = -1.0474626e-18, cumulative = -2.6669511e-16
ExecutionTime = 6.86 s  ClockTime = 9 s

Time = 3.285

Courant Number mean: 0.0028915635 max: 0.15907251
DILUPBiCG:  Solving for Ux, Initial residual = 6.2474242e-07, Final residual = 6.2474242e-07, No Iterations 0
DILUPBiCG:  Solving for Uz, Initial residual = 9.1592459e-05, Final residual = 6.7962237e-08, No Iterations 1
GAMG:  Solving for p, Initial residual = 0.036745335, Final residual = 0.001272349, No Iterations 3
GAMG:  Solving for p, Initial residual = 0.0030201551, Final residual = 0.00011688914, No Iterations 7
time step continuity errors : sum local = 5.6825884e-13, global = -5.5901766e-17, cumulative = -3.2259688e-16
GAMG:  Solving for p, Initial residual = 0.0022140212, Final residual = 0.00010510169, No Iterations 10
GAMG:  Solving for p, Initial residual = 0.00024748843, Final residual = 7.5154998e-08, No Iterations 23
time step continuity errors : sum local = 3.6429105e-16, global = -9.8798764e-19, cumulative = -3.2358487e-16
ExecutionTime = 8.48 s  ClockTime = 11 s

End

Finalising parallel run

I don't know if signal 20 is reserved for something, overriden by the operating system or something else. Anyways, you only need one signal, right?

malaboss April 2, 2013 11:04

Yup my problem is already resolved, but I found strange the problem with -20
I switched the value for the signals so that I have
Code:

OptimisationSwitches
{
    // Force dumping (at next timestep) upon signal
    writeNowSignal              20;
    // Force dumping (at next timestep) and clean exit upon signal
    stopAtWriteNowSignal        10; //-1;
}

kill -10 still works and kill -20 doesn't...
By "it doesn't work" I mean : when I use -20 in this case, the simulation doesn't stop but doesn't write any other folder either for new times.

Thank you for your help. At least my first problem is now solved ! :)

malaboss April 3, 2013 03:37

The script went all right tonight kill -10 calling stopAtWriteNowSignal.
Thanks again !


All times are GMT -4. The time now is 05:18.