# problem with interfoam on multiple nodes

June 27, 2016, 11:00
problem with interfoam on multiple nodes
G. K.
Join Date: Jun 2016
Posts: 2
Hi,

I am facing trouble with running interFoam on a HPC cluster and would really appreciate some help.

My code runs in parallel without problems when it is restricted to a single node. However, when I submit the job on multiple nodes the simulation crashes after it has performed an arbitrary number of time steps. The output for a typical case which was run in parallel (60 cores, i.e. 3 nodes) is given below

 Courant Number mean: 1.06878e-05 max: 0.000182367 Interface Courant Number mean: 4.91428e-07 max: 0.000158003 deltaT = 6.62472e-07 Time = 3.914837612e-06 PIMPLE: iteration 1 MULES: Solving for alpha.water Phase-1 volume fraction = 0.0855268 Min(alpha.water) = 0 Max(alpha.water) = 1 MULES: Solving for alpha.water Phase-1 volume fraction = 0.0855268 Min(alpha.water) = 0 Max(alpha.water) = 1 MULES: Solving for alpha.water Phase-1 volume fraction = 0.0855268 Min(alpha.water) = 0 Max(alpha.water) = 1 GAMG: Solving for p_rgh, Initial residual = 0.00011391, Final residual = 1.10544e-06, No Iterations 6 time step continuity errors : sum local = 1.68697e-11, global = 1.55809e-13, cumulative = 3.4914e-11 GAMG: Solving for p_rgh, Initial residual = 7.62066e-05, Final residual = 7.97734e-07, No Iterations 5 time step continuity errors : sum local = 1.21538e-11, global = 8.31624e-15, cumulative = 3.49224e-11 GAMG: Solving for p_rgh, Initial residual = 3.03775e-05, Final residual = 8.72768e-09, No Iterations 10 time step continuity errors : sum local = 1.32912e-13, global = 1.30875e-17, cumulative = 3.49224e-11 ExecutionTime = 76.1 s ClockTime = 79 s Courant Number mean: 1.29647e-05 max: 0.000228018 Interface Courant Number mean: 5.9077e-07 max: 0.000191552 deltaT = 7.94966e-07 Time = 4.709803933e-06 PIMPLE: iteration 1 MULES: Solving for alpha.water Phase-1 volume fraction = 0.0855268 Min(alpha.water) = 0 Max(alpha.water) = 1 MULES: Solving for alpha.water Phase-1 volume fraction = 0.0855268 Min(alpha.water) = 0 Max(alpha.water) = 1 MULES: Solving for alpha.water Phase-1 volume fraction = 0.0855268 Min(alpha.water) = 0 Max(alpha.water) = 1 GAMG: Solving for p_rgh, Initial residual = 9.46689e-05, Final residual = 9.36175e-07, No Iterations 6 time step continuity errors : sum local = 2.03431e-11, global = 1.79796e-13, cumulative = 3.51022e-11 GAMG: Solving for p_rgh, Initial residual = 6.33264e-05, Final residual = 7.16328e-07, No Iterations 5
The simulation does not appear to diverge, however the simulation crashes and this is a typical error message is given below:

 srun: error: node196: task 37: Floating point exception srun: Terminating job step 109895.0 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: *** STEP 109895.0 CANCELLED AT 2016-06-23T15:57:08 *** on node019 srun: error: node196: tasks 20-36,38-39: Killed srun: error: node245: tasks 40-59: Killed srun: error: node019: tasks 0-19: Killed
Attempting to resolve this issue I have tried many different things which range from playing with the solution method, e.g. use PCG instead of GAMG, use different decomposition methods, e.g. scotch, metis, simple, hierarchical, also played with different schemes but in all cases the problem persists.

To check whether this problem was affecting solely my code, I have also tried to run the damBreakWithObstacle tutorial. This tutorial exhibits many similarities with my code in terms of solution method, etc. For my test, I used the interFoam solver (instead of the inteDyMFoam used by default in the tutorial) with a mesh of (165 165 165) elements and ran it on 60 cores (3 nodes), which corresponds to approximately 75000 cells/processor. Exactly the same problem arises also in this case!

Finally, I should also mention that I use OpenFOAM 2.4.0 which has been compiled using the intelmpi 5.0.3 library.

I would greatly appreciate any ideas on what might cause this problem and how it would be possible to resolve this.

Best regards,

George

 Any suggestions?