CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   CFX (https://www.cfd-online.com/Forums/cfx/)
-   -   Fatal overflow in linear solver occur when execute solution in parallel (https://www.cfd-online.com/Forums/cfx/224753-fatal-overflow-linear-solver-occur-when-execute-solution-parallel.html)

karachun March 3, 2020 10:27

Fatal overflow in linear solver occur when execute solution in parallel
 
5 Attachment(s)
Hi, folks!

Let me introduce my problem and physics description first.

I`m performing sloshing analysis in 2D rectangular box (256x256 mm), caused by impulse acceleration load. Maximum acceleration is 8 g, ramped load, load duration is 70 ms.
Working liquids are air and water, both are incompressible. Turbulence model SST. I use Homogeneous multiphase model with Standard Free surface model option. I include Buoyancy model too. I set Surface tension model OFF.
I perform tutorial “Flow around bump” before my analysis and mainly use solver setting from this tutorial.
My computational domain is isolated, BC consist from four walls and two symmetry planes. I initialize transient simulation using expressions – set test water tank is half filed with water.
At convergence control I ser 3 to 5 coefficient loops and select High resolution scheme for all equations. I add Advanced option -> Volume Fraction Coupling -> Coupled. I use Double Precision for all runs.

Now let`s proceed to failure I faced.

At the very beginning of mesh convergence/timestep investigation I faced strange error. When I run problem in Serial I manage to solve it and obtain some results. Although, when I run solution in parallel, my solution diverge at first timestep. No matter if I solve first timesteps at single core and then restart to parallel I still get divergence.

Errorcode is
Code:

+--------------------------------------------------------------------+
 | ERROR #004100018 has occurred in subroutine FINMES.                |
 | Message:                                                          |
 | Fatal overflow in linear solver.                                  |
 +--------------------------------------------------------------------+

Therefore I have two questions:
1) What I`m doing wrong and how can I fix issue with parallel solution.
2) Is there any other mistakes in my physics/numeric setup?

I attach CCL file, two meshes (4 and 8 mm element size) and two output files – successful solution and failed one.

Thanks in advance!

Opaque March 3, 2020 12:46

If you want to investigate what might be the problem, you can take advantage of the fact you solved the problem in serial (1 core).

Run the problem in parallel using the same initial conditions/guess as in the serial run. Set both simulations to stop before the timestep that had failed in parallel before.

Now you should have to results files, as well as two output files.

Compare the two output files using a graphical file difference tool, so you can compare what is different between the two output files. Ignore the obvious things such as parallel settings, partitioning information (for now), etc.

Are the diagnostics of the solution steps the same, or close enough? If not, then you got something to investigate further. Both solutions, in theory, should proceed identically if the solution of the linear equations are identical.

Hope the above helps,

karachun March 3, 2020 13:11

Thanks for answer.
Unfortunately my simulation fail at second coefficient loop of first iteration.
Code:

  ======================================================================
 TIME STEP =    1 SIMULATION TIME = 1.0000E-04 CPU SECONDS = 1.684E+01
 ----------------------------------------------------------------------
 | SOLVING : Wall Scale                                              |
 ----------------------------------------------------------------------
 |      Equation      | Rate | RMS Res | Max Res |  Linear Solution |
 +----------------------+------+---------+---------+------------------+
 | Wallscale-Bulk      | 0.00 | 2.7E-04 | 2.7E-04 | 31.8  8.9E-02  OK|
 +----------------------+------+---------+---------+------------------+
 | Wallscale-Bulk      | 0.09 | 2.4E-05 | 1.6E-04 | 39.5  6.3E-02  OK|
 +----------------------+------+---------+---------+------------------+
 | Wallscale-Bulk      | 0.30 | 7.0E-06 | 4.8E-05 | 39.5  6.3E-02  OK|
 +----------------------+------+---------+---------+------------------+
 ----------------------------------------------------------------------
 COEFFICIENT LOOP ITERATION =    1              CPU SECONDS = 1.750E+01
 ----------------------------------------------------------------------
 |      Equation      | Rate | RMS Res | Max Res |  Linear Solution |
 +----------------------+------+---------+---------+------------------+
 | U-Mom-Bulk          | 0.00 | 9.6E-25 | 7.7E-24 |      8.2E+20  * |
 | V-Mom-Bulk          | 0.00 | 2.2E-23 | 1.8E-22 |      1.1E+21  * |
 | W-Mom-Bulk          | 0.00 | 0.0E+00 | 0.0E+00 |      0.0E+00  OK|
 | Mass-Water          | 0.00 | 2.5E-44 | 2.0E-43 |      5.2E+22  * |
 | Mass-Air            | 0.00 | 1.3E-45 | 1.0E-44 | 15.8  9.1E+22  * |
 +----------------------+------+---------+---------+------------------+
 | K-TurbKE-Bulk        | 0.00 | 9.6E-16 | 2.8E-14 | 10.6  4.7E-10  OK|
 | O-TurbFreq-Bulk      | 0.00 | 6.2E-02 | 1.0E+00 | 17.3  8.9E-07  OK|
 +----------------------+------+---------+---------+------------------+
 ----------------------------------------------------------------------
 COEFFICIENT LOOP ITERATION =    2              CPU SECONDS = 1.878E+01
 ----------------------------------------------------------------------
 |      Equation      | Rate | RMS Res | Max Res |  Linear Solution |
 +----------------------+------+---------+---------+------------------+
 
 +--------------------------------------------------------------------+
 | ERROR #004100018 has occurred in subroutine FINMES.                |
 | Message:                                                          |
 | Fatal overflow in linear solver.                                  |
 +--------------------------------------------------------------------+

Everything going to fail from beginning.
In addition - I run problem on other PC and error is gone - solution run normal in parallel. Maybe I should reinstall Ansys.

Opaque March 3, 2020 13:39

I would compare the output files between the successful run, and the failed run to understand what is different at the start of the run.

Similarly, the suggestion above applies for the diagnostics in the first coefficient loops. They must, in theory, be identical but there are subtle differences in parallel that should go away when converged.

Gert-Jan March 3, 2020 13:52

Did you try 2 or 3 partitions?
Did you try an alternative parallelization method like Recursive bisection (instead of Metis)?


Regs, Gert-Jan

karachun March 3, 2020 14:18

Gert-Jan, Opaque.
I will try tomorrow these recommendations.

karachun March 4, 2020 03:44

Opaque
Output files are different from the beginning. Look like solution already diverged during first coeff loop.

Failed parallel output
Code:

======================================================================
 TIME STEP =    1 SIMULATION TIME = 1.0000E-04 CPU SECONDS = 1.661E+01
 ----------------------------------------------------------------------
 | SOLVING : Wall Scale                                              |
 ----------------------------------------------------------------------
 |      Equation      | Rate | RMS Res | Max Res |  Linear Solution |
 +----------------------+------+---------+---------+------------------+
 | Wallscale-Bulk      | 0.00 | 2.1E-04 | 2.1E-04 | 46.9  1.0E-01  ok|
 +----------------------+------+---------+---------+------------------+
 | Wallscale-Bulk      | 0.10 | 2.1E-05 | 1.7E-04 | 46.9  1.3E-01  ok|
 +----------------------+------+---------+---------+------------------+
 | Wallscale-Bulk      | 0.34 | 7.3E-06 | 6.2E-05 | 46.9  1.3E-01  ok|
 +----------------------+------+---------+---------+------------------+
 ----------------------------------------------------------------------
 COEFFICIENT LOOP ITERATION =    1              CPU SECONDS = 1.734E+01
 ----------------------------------------------------------------------
 |      Equation      | Rate | RMS Res | Max Res |  Linear Solution |
 +----------------------+------+---------+---------+------------------+
 | U-Mom-Bulk          | 0.00 | 7.0E-12 | 7.9E-11 |      1.8E+08  F |
 | V-Mom-Bulk          | 0.00 | 1.3E-10 | 1.5E-09 |      1.8E+08  F |
 | W-Mom-Bulk          | 0.00 | 0.0E+00 | 0.0E+00 |      0.0E+00  OK|
 | Mass-Water          | 0.00 | 2.0E-18 | 2.3E-17 |      2.5E+09  F |
 | Mass-Air            | 0.00 | 6.3E-21 | 7.5E-20 | 15.7  4.1E+09  F |
 +----------------------+------+---------+---------+------------------+
 | K-TurbKE-Bulk        | 0.00 | 2.3E-06 | 3.4E-06 | 11.0  7.5E-10  OK|
 | O-TurbFreq-Bulk      | 0.00 | 1.2E-01 | 1.0E+00 | 12.8  3.5E-15  OK|
 +----------------------+------+---------+---------+------------------+

Normal serial output
Code:

======================================================================
 TIME STEP =    1 SIMULATION TIME = 1.0000E-04 CPU SECONDS = 2.055E+00
 ----------------------------------------------------------------------
 | SOLVING : Wall Scale                                              |
 ----------------------------------------------------------------------
 |      Equation      | Rate | RMS Res | Max Res |  Linear Solution |
 +----------------------+------+---------+---------+------------------+
 | Wallscale-Bulk      | 0.00 | 2.1E-04 | 2.1E-04 | 39.1  8.5E-02  OK|
 +----------------------+------+---------+---------+------------------+
 | Wallscale-Bulk      | 0.08 | 1.7E-05 | 3.6E-05 | 46.7  6.6E-02  OK|
 +----------------------+------+---------+---------+------------------+
 | Wallscale-Bulk      | 0.30 | 5.1E-06 | 1.1E-05 | 46.7  6.6E-02  OK|
 +----------------------+------+---------+---------+------------------+
 ----------------------------------------------------------------------
 COEFFICIENT LOOP ITERATION =    1              CPU SECONDS = 2.382E+00
 ----------------------------------------------------------------------
 |      Equation      | Rate | RMS Res | Max Res |  Linear Solution |
 +----------------------+------+---------+---------+------------------+
 | U-Mom-Bulk          | 0.00 | 2.9E-02 | 3.2E-01 |      2.0E-03  OK|
 | V-Mom-Bulk          | 0.00 | 4.1E-02 | 4.7E-01 |      1.8E-03  OK|
 | W-Mom-Bulk          | 0.00 | 0.0E+00 | 0.0E+00 |      0.0E+00  OK|
 | Mass-Water          | 0.00 | 5.8E-06 | 4.8E-05 |      5.1E-03  OK|
 | Mass-Air            | 0.00 | 2.6E-07 | 2.4E-05 | 15.8  1.5E-02  ok|
 +----------------------+------+---------+---------+------------------+
 | K-TurbKE-Bulk        | 0.00 | 3.0E-07 | 2.1E-06 |  8.6  1.5E-15  OK|
 | O-TurbFreq-Bulk      | 0.00 | 1.2E-01 | 1.0E+00 |  9.3  1.3E-16  OK|
 +----------------------+------+---------+---------+------------------+

Gert-Jan
I have tried different partitioning methods but solution still diverge at first iteration.
Some meshes (I have 8, 4 and 2 mm variants) run on 3 cores and some on 2 and 3 cores but fail at 4 cores. Coarse 8 mm case can run on all four cores.

Gert-Jan March 4, 2020 04:13

This is strange. I would ask ANSYS.

Also, in Post, I would check to see how the partitioning is done (look for partition number). I would partition in vertical or horizontal direction .

Btw, do you now have 1 element in the 3rd dimension? I would also perform a test with 2 elements.

karachun March 4, 2020 06:06

2 Attachment(s)
Use four elements per thickness for sure but still get divergence.
Check partition numbers on mesh – looks adequate.

ghorrocks March 4, 2020 18:03

If areas of very high gradients (such as free surfaces) align with partition boundaries you can get convergence problems. It is best to make sure partition boundaries do not align with free surfaces. Based on your images of the partitions you are using it appears this is contributing.

I would try other partitioning algorithms (eg recursive bisection) and check that they give you a better partition pattern. I would think vertical stripes would probably be a good pattern for you. But as your free surface sloshes around all over the place it might be challenging to find a partition shape which avoids the free surface for the entire run, you will have to compromise a bit there.

karachun March 5, 2020 04:45

Thanks.
Today, at one of test runs, I observe error that may confirm your statement. I run model on 3 cores and solution run ok for some time but then I got sudden divergence (at one timestep model run as usual and at other all diverge). When I change back to serial solver error is disappear.
I will try different partition methods and write here if I have success.
BTW if problem is in large gradients then it is possible to reduce these gradients somehow?
The goal of my calculation is to obtain pressure time history to use it in Finite Element Analysis. Therefore I can neglect some of physics that have minor impact on pressure at wall.
As I understand for this problem I should account two main features:
-) bulk flow of water;
-) pressure change inside tank.
I have already perform convergence study and can say that I can neglect turbulence effects and use laminar viscous model.
On the way is study homogeneous vs. inhomogeneous multiphase model. Best practices recommend use inhomogeneous for problems where interface didn’t remain constant but again – this interphase interaction may not effect on results that I want to obtain.

ghorrocks March 5, 2020 05:37

Quote:

I observe error that may confirm your statement.
I am not just a pretty face, you know :)

If your simulation is super-sensitive to the free surface lining up with the partition boundary this suggests your model is very numerically unstable. A free surface simulation in a square box should not be very numerically unstable - so your problem is likely to actually be poor model setup causing instability. So to fix the root cause you should improve the numerical stability.

Here are some tips:
* Double precision numerics
* Smaller timestep (how did you set the time step? Did you guess? If so then you guessed wrong)
* Improve mesh quality
* Better initial conditions
* Check the physics is correctly configured
* Tighter convergence tolerance.

ghorrocks March 5, 2020 05:44

Just had a look at your setup.
* You have a fixed time step size. Unless this is the result of a time step sensitivity study this will be wrong. I recommend you change to adaptive time stepping, converging on 3-5 coeff loops per iteration.
(Actually, your simulation reaches convergence later on in 3 or 4 coeff loops so your time step probably is not too far off for this convergence tolerance)
* You have min 3, max 5 coeff loops per iteration. Why have you done this? Set this to no minimum and max 10.
* Have you checked your convergence tolerance is adequate? You should do a sensitivity check on this.
* I see this is pseudo-2D simulation. In that case make the thickness in the z direction equal to the element size in the X or Y directions. This will make your elements closer to aspect ratio 1.

karachun March 6, 2020 03:08

5 Attachment(s)
I have tried to partition domain onto four vertical stripes and get failed but solution with three horizontal partitions (therefore whole initial free surface belong to one partition) run fine. But now I cannot be sure that at some time during simulation free surface location don’t cause this error again.

*Increase geometry thickness to make elements close to 1:1 - ready.

*I have changed timestepping control to automatic timestepping. Here is my timestep controls.
HTML Code:

TIME STEPS:
  First Update Time = 0.0 [s]
  Initial Timestep = 1e-6 [s]
  Option = Adaptive
  Timestep Update Frequency = 1
  TIMESTEP ADAPTION:
    Maximum Timestep = 0.001 [s] (based on input data discretization levet I need in FEA analisys)
    Minimum Timestep = 1e-10 [s]
    Option = Number of Coefficient Loops
    Target Maximum Coefficient Loops = 5
    Target Minimum Coefficient Loops = 3
    Timestep Decrease Factor = 0.8
    Timestep Increase Factor = 1.06
  END
...
SOLVER CONTROL:
...
CONVERGENCE CONTROL:
  Maximum Number of Coefficient Loops = 10
  Minimum Number of Coefficient Loops = 1
  Timescale Control = Coefficient Loops
END

I have attached ccl file with new model setup.

* I run sensitive study to determine adequate RMS Residual level.
Results on 1e-3 and 1e-4 are pretty close. I use time history of force acting on side wall and pressure at one point as convergence parameter. Result are pretty close.
Case with 5e-5 is still solving, I will update this question later, whec achieve results.

BTW I have noticed that solver control convergence only for main flow parameters like mass, momentum and volume fraction. On the other hand CFX allow turbulence residuals to fall much coarser. I didn’t assign special controls to turbulence residuals. Is it planned solver behavior?
For example I have target RMS 1e-5, flow parameters are converged, and turbulence residuals are 5e-5 for K and 3e-4 for Omega at third coeff. loop and solver don’t iterate further but start new timestep.

Gert-Jan March 6, 2020 03:28

My 50 cents:

If you run these kind of multiphase simulations, then convergence on residuals is quite hard. So, 1e-4 might be hard to reach. Better add multiple monitor points, like your pressure point. And monitor Pressure, velocity and volume fraction.
To make sure that within a timestep you reach convergence, switch on the option "Monitor Coefficient Loop Convergence". You can find this in the top of the CFX-Pre-tab Output Control>Monitor. This will give you the progression of the variables within a timestep. Best results are obtained if you have flatliners everywhere.

This also allows you to create a graph in the solver manager to plot these coefficient loops. I would also recommend to plot the time step size. You can do this by creating a monitor in CFX-Pre with the option "Expression” and variable “Time Step Size”. These things won't help the solver, but graphically shows you what the solver is doing and where it has difficulties.

ghorrocks March 6, 2020 04:27

Note that for this sensitivity analysis, rather than the normal approach of comparing important variables (like what you seem to be doing quite nicely) you should look at how numerically stable the result is.

Maybe consider choosing the most unstable configuration - 4 partitions on METIS appears to crash on your initial setup very early - so try tighter convergence and smaller time steps on this configuration and see if it does not crash.

karachun March 6, 2020 05:31

5 Attachment(s)
To Gert-Jan.
Thanks for advice, I`ll use Coeff Loop Convergence for further.
I have plot timestep already. For RME 1e-4 most of time timestep is equal or larger than 1e-4 but sometimes it fall to 1e-5.
Also I monitor Residuals history of main flow quantities (mass, momentum, volume fraction), most of the time convergence is ok, residuals are below desired level but sometimes there are “spikes” when solver cannot converge during 10 coeff loops. As I mentioned before – look like solver consider turbulence residuals or use much more loose convergence criteria for them.
Here is my additional convergence statistics.

To ghorrocs.
Unfortunately with four partitions solution, in 95% of cases, diverge at second coeff loop of first timestep, before Residuals metrics can be applied.
At this point I assume that is “unsafe” to launch solution with many cores. Even with tree cores and “horizontal stripes” partitions. Some variants (different mesh size, physics, residuals) are run normal on multicore and some can fail somewhere in the middle of solution time. I can not recognize system in these fails.

Gert-Jan March 6, 2020 05:39

You can also partition in a radial direction. Or use a certain direction. Why not try (1,1,0) or (1,2,0)? Then it is not in line with your free surface. Aternatively, use more elements.....

karachun March 6, 2020 08:55

3 Attachment(s)
While trying to launch solution on many cores I found another issue. According to documentation I set Pressure Level Information (place point inside air phase) and my results have changed dramatically. When I check pressure contour I don’t see any difference. It is strange because I have set pressure distribution using expression (hydrostatic pressure) and I suppose that initialization with expression is enough.

Opaque March 6, 2020 12:15

Have you looked into the previous output file for a warning regarding the pressure level information?

If you have a closed system, the pressure level is undefined. Some setup may get away without it, but the initial conditions are not guaranteed to define the level.


All times are GMT -4. The time now is 04:42.