Auto restart of crashing simulations in OpenFOAM
Vijaya Kumar. G
I use 'chtMultiRegionFoam' solver on a big geometry (60 m3). Sometimes the solver crashes with 'Floating point Exception error' during parallel run (mpirun). But when I restart the same, the simulation run continues without any issue. This is definitely not a case of diverging solution, as I have good pretty good match with experimental data.

This problem becomes a nuisance because I cannot give a simulation to a HPC and just relax for a day or two, the simulation would have broken down in between resulting in idle time of HPC.

I then decided to use a python script, which itself kills a process every 1 hour and then restarts it. By this way I can restrict the idle time of the systems.

python script:
# Start of script
#!/usr/bin/env python3
import subprocess
import time
import sys

force = False
args = sys.argv[1:]; app = args[0].replace("'", "")
proc = app.split()[0].split("/")[-1]
cycle = int(args[1])*60; run = int(args[2])*60

if args[3] == "force":
force = True
except IndexError:

def get_pid(proc_name):
return subprocess.check_output(
["pgrep", proc_name]
except subprocess.CalledProcessError:

def kill(pid, force):
if force == False:
subprocess.Popen(["kill", "-s", "TERM", pid])
elif force == True:
subprocess.Popen(["kill", pid])

while True:
subprocess.Popen(["/bin/bash", "-c", app])
pid = get_pid(proc)
if pid != None:
kill(pid, force)
time.sleep(cycle - run)
#End of script

The script works with python3 and the command is as follows
python3 <script> <command_to_run_application> <cycle_time> <application_run_time>

<script> - Name of the python script
<command_to_run_application> - chtMultiRegionFoam (or any solver)
<cycle_time> - Periodic interval for restart
<application_run_time> - Time till which the application runs

I have tested the script with 'firefox' browser and it does the intended job. But when I tried this script with OpenFOAM it is restarting as specified by <cycle_time> but it is not killing the process as specified by the <application_run_time>. Which means the new process starts wihtout closing the old one.

I suppose this has to do something with the 'controlDict'. The application run time is dictated by the 'contolDict' file which overrides the application run time given in the python script.

Any help would be welcome.

