CFD Online Discussion Forums - Performance Issue

Good evening all,

I am hoping that someone can help me with direction with trying to figure out why a certain model may be performing slowly.

While solving (local single machine or over msmpi/ms job scheduler) the fl_mpi23* process works for a few seconds, but then stops until the next iteration.

In this "wait" period, the cx23* process is madly busy doing I don't know what - is there any way I can figure out what it is in fact trying to do?

Things we have tried:

Local run only and also through MS Job scheduler (msmpi)
Run without GUI or with
Limiting CPU Cores Per Node different amounts from 4 to 40
Limiting Nodes different amounts from 1 to 12
Disabling dpm in the model

All other benchmarks that we have tried is performing 100%. Our system is 12 x nodes of dual 44-core Intel Xeon 6152 cpus with 6 memory channels per CPU. Storage is SSD. Network is 100gb IB.

The "Pausing" is reflected in the performance timer as the iteration supposedly takes 4 seconds but 30 iterations take 1276 seconds.

No errors in fluent output at all showing something is wrong. Disk queues empty. It seems that the "average wall-clock time per iteration" scales accordingly to number of nodes/cores given to the job, however the waiting still stays there.

If anyone can maybe suggest how we can figure out the root cause I'd really appreciate it.

Code:

Performance timer output:



Performance Timer for 30 iterations on 240 compute nodes

  Average wall-clock time per iteration:                4.032 sec

  Global reductions per iteration:                        443 ops

  Global reductions time per iteration:                 0.000 sec (0.0%)

  Message count per iteration:                         760147 messages

  Data transfer per iteration:                       3612.734 MB

  LE solves per iteration:                                  5 solves

  LE wall-clock time per iteration:                     0.607 sec (15.1%)

  LE global solves per iteration:                           2 solves

  LE global wall-clock time per iteration:              0.025 sec (0.6%)

  LE global matrix maximum size:                          355

  AMG cycles per iteration:                             6.000 cycles

  Relaxation sweeps per iteration:                        436 sweeps

  Relaxation exchanges per iteration:                       0 exchanges

  LE early protections (stall) per iteration:           0.000 times

  LE early protections (divergence) per iteration:      0.000 times

  Total SVARS touched:                                    398

  DPM updates per iteration:                           0.5000 updates

  DPM wall-clock time per iteration:                    1.334 sec (33.1%)

  Time-step updates per iteration:                       0.50 updates

  Time-step wall-clock time per iteration:              2.448 sec (60.7%)



  Total wall-clock time:                              120.947 sec

  Total dpm solve time:                                40.029 sec

  Total dpm i/o time:                                   0.000 sec





Simulation wall-clock time for 30 iterations             1276 sec

Thank you very much in advance