Slow calculation time on CFD server

killian153 · June 18, 2021, 09:05

Hello,

We currently have a dual-CPU server with:

- CentOS 7
- 2x AMD EPYC 7352 24 cores
- 256Go of ECC-RAM
- 1x RTX 3090

We run our calculations remotely with MobaXterm and use the command "workbench&" to launch it.

The issue :

We realized the calculations seemed very slow when 2 different fluent process were running (even with only half of the cores in-use).

So I ran a benchmark on the same case (at the same time) between my computer and the server, with the same conditions: PBNS simulation of a nozzle in 2D axisymmetric - 12 cores

For 3076 iterations on my computer, I only made 1699 iterations on the server. BUT when I check the CPU info, here's what I get:

My computer :

---------------------------------------------------------------------------------------
| CPU | System Mem (GB)
Hostname | Sock x Core x HT Clock (MHz) Load (%)| Total Available
---------------------------------------------------------------------------------------
####### | 1 x 8 x 2 3792 2.44111| 31.8535 14.192
---------------------------------------------------------------------------------------
Total | 16 - - | 31.8535 14.192
---------------------------------------------------------------------------------------

---------------------------------------------
| CPU Time Usage (Seconds)
ID | User Kernel Elapsed
---------------------------------------------
host | 187.141 29.9219 6098.92
n0 | 1754.94 31.6563 6097.94
n1 | 1999.02 49 6097.94
n2 | 2042.19 44.1875 6097.93
n3 | 2069.17 45.0938 6097.92
n4 | 2055.78 51.4844 6097.91
n5 | 2071.81 47.2656 6097.91
n6 | 2085.63 37.7813 6097.9
n7 | 2081.78 41.1875 6097.89
n8 | 2015.33 45.3594 6097.88
n9 | 2049.89 46.7031 6097.87
n10 | 2048.98 52.3125 6097.85
n11 | 2055.55 58.5313 6097.84
---------------------------------------------
Total | 24517.2 580.484 -
---------------------------------------------

Model Timers (Host)
Flow Model Time: 31.281 sec (CPU), count 3076
Other Models Time: 0.344 sec (CPU)
Total Time: 31.625 sec (CPU)

Model Timers
Flow Model Time: 999.346 sec (WALL), 1000.984 sec (CPU), count 3076
Turbulence Model Time: 278.553 sec (WALL), 276.125 sec (CPU), count 3076
Temperature Model Time: 211.580 sec (WALL), 211.203 sec (CPU), count 3076
Other Models Time: 0.533 sec (WALL)
Total Time: 1490.012 sec (WALL)

Performance Timer for 3076 iterations on 12 compute nodes
Average wall-clock time per iteration: 0.489 sec
Global reductions per iteration: 69 ops
Global reductions time per iteration: 0.000 sec (0.0%)
Message count per iteration: 28044 messages
Data transfer per iteration: 16.082 MB
LE solves per iteration: 4 solves
LE wall-clock time per iteration: 0.239 sec (49.0%)
LE global solves per iteration: 4 solves
LE global wall-clock time per iteration: 0.001 sec (0.3%)
LE global matrix maximum size: 589
AMG cycles per iteration: 7.876 cycles
Relaxation sweeps per iteration: 974 sweeps
Relaxation exchanges per iteration: 0 exchanges
LE early protections (stall) per iteration: 0.007 times
LE early protections (divergence) per iteration: 0.000 times
Total SVARS touched: 364

Total wall-clock time: 1503.849 sec

The server:

---------------------------------------------------------------------------------------
| CPU | System Mem (GB)
Hostname | Sock x Core x HT Clock (MHz) Load (%)| Total Available
---------------------------------------------------------------------------------------
##| 2 x 24 x 1 2300 21.6 | 251.845 125.986
---------------------------------------------------------------------------------------
Total | 48 - - | 251.845 125.986
---------------------------------------------------------------------------------------
---------------------------------------------
| CPU Time Usage (Seconds)
ID | User Kernel Elapsed
---------------------------------------------
host | 75 87 -
n0 | 389 88 -
n1 | 2737 243 -
n2 | 2741 237 -
n3 | 2746 244 -
n4 | 2756 235 -
n5 | 2746 224 -
n6 | 2757 209 -
n7 | 2747 216 -
n8 | 2740 207 -
n9 | 2752 221 -
n10 | 2749 211 -
n11 | 2750 205 -
---------------------------------------------
Total | 30685 2627 -
---------------------------------------------

Model Timers (Host)
Flow Model Time: 3.001 sec (CPU), count 1699
Other Models Time: 0.178 sec (CPU)
Total Time: 3.179 sec (CPU)

Model Timers
Flow Model Time: 190.640 sec (WALL), 191.337 sec (CPU), count 1699
Turbulence Model Time: 59.533 sec (WALL), 59.675 sec (CPU), count 1699
Temperature Model Time: 51.040 sec (WALL), 51.256 sec (CPU), count 1699
Other Models Time: 0.283 sec (WALL)
Total Time: 301.496 sec (WALL)

Performance Timer for 1699 iterations on 12 compute nodes
Average wall-clock time per iteration: 0.182 sec
Global reductions per iteration: 68 ops
Global reductions time per iteration: 0.000 sec (0.0%)
Message count per iteration: 25853 messages
Data transfer per iteration: 14.436 MB
LE solves per iteration: 4 solves
LE wall-clock time per iteration: 0.081 sec (44.7%)
LE global solves per iteration: 4 solves
LE global wall-clock time per iteration: 0.001 sec (0.7%)
LE global matrix maximum size: 614
AMG cycles per iteration: 7.741 cycles
Relaxation sweeps per iteration: 961 sweeps
Relaxation exchanges per iteration: 0 exchanges
LE early protections (stall) per iteration: 0.007 times
LE early protections (divergence) per iteration: 0.000 times
Total SVARS touched: 364

Total wall-clock time: 308.849 sec

---------------------------------------

So, the time/iteration is 0.182s for the server and 0.489s for my computer and yet my calculation is 45% slower on the server.

I just don't understand..

Thanks!

For information:

We also have an alert message reach time we launch Fluent (maybe it's related):

Warning:
Direct rendering unavailable, hardware acceleration will be disabled.
In the absence of hardware-accelerated drivers, the performance of all graphics operations will be severely affected. Make sure you have a supported graphics card, latest graphics driver, and a supported remote visualization tool with direct server-side rendering enabled. If you feel your system meets these requirements, try forcing the accelerated driver by using the command line flag (-driver <name>) or setting the HOOPS_PICTURE environment variable. Refer to the documentation for more details.

We never solved this problem. The GPU is recognized by Fluent, all drivers are installed but maybe the problem comes from MobaXterm ?

wkernkamp · June 18, 2021, 17:39

The wall clock for the server is 309 sec and your local computer 1504 sec. Correcting for number of iterations, we would get 309*3076/1699 is 559 sec if the server had performed the same 3076 iterations. So the server is performing better than your home computer. What am I missing?

killian153 · June 18, 2021, 19:01

Quote:

Originally Posted by wkernkamp

The wall clock for the server is 309 sec and your local computer 1504 sec. Correcting for number of iterations, we would get 309*3076/1699 is 559 sec if the server had performed the same 3076 iterations. So the server is performing better than your home computer. What am I missing?

In "real-time" (by real time I mean the physical time), when launched at the same time, the calculation made 3076 iterations on my computer and only 1699 on the server. That's what I don't understand: results show that the server is faster but in reality, it's 50% slower (I can prove that because I launched both simulations at the same time and when I decide to stop it, my computer is at 3079 while the server is still at 1699.

wkernkamp · June 19, 2021, 17:03

Looks like the server is performing OK when running, but most of the available time is consumed by something else. I assume there is nothing else running on the machine so that leaves the error message and a possible interface wait as the only possibility. Have you tried running fluent from the command line:

fluent <version> -g -t<nprocs>-gpgpu=<ngpgpus> -i journalfile > outputfile

Baum · July 27, 2021, 07:28

This is interesting because I only saw this behaviour in our PCs when I maxed them out 100%: the same PC finished a task faster with e.g. 28/32 cores running compared to 32/32, though the Fluent process claims that the iterations are solved more quickly in the second case. I always wrote this off as a case of CPU-overhead which is needed for other (smaller) tasks, like plotting the graphs, writing report files etc.. However, if you have the same problem with 12/24 cores running, I guess that assumption was wrong. Have you tried contacting your Fluent representative to check up with them?

June 18, 2021, 09:05	Slow calculation time on CFD server	#1
killian153 New Member Killian Join Date: Nov 2017 Posts: 26 Rep Power: 8	Hello, We currently have a dual-CPU server with: - CentOS 7 - 2x AMD EPYC 7352 24 cores - 256Go of ECC-RAM - 1x RTX 3090 We run our calculations remotely with MobaXterm and use the command "workbench&" to launch it. The issue : We realized the calculations seemed very slow when 2 different fluent process were running (even with only half of the cores in-use). So I ran a benchmark on the same case (at the same time) between my computer and the server, with the same conditions: PBNS simulation of a nozzle in 2D axisymmetric - 12 cores For 3076 iterations on my computer, I only made 1699 iterations on the server. BUT when I check the CPU info, here's what I get: My computer : --------------------------------------------------------------------------------------- \| CPU \| System Mem (GB) Hostname \| Sock x Core x HT Clock (MHz) Load (%)\| Total Available --------------------------------------------------------------------------------------- ####### \| 1 x 8 x 2 3792 2.44111\| 31.8535 14.192 --------------------------------------------------------------------------------------- Total \| 16 - - \| 31.8535 14.192 --------------------------------------------------------------------------------------- --------------------------------------------- \| CPU Time Usage (Seconds) ID \| User Kernel Elapsed --------------------------------------------- host \| 187.141 29.9219 6098.92 n0 \| 1754.94 31.6563 6097.94 n1 \| 1999.02 49 6097.94 n2 \| 2042.19 44.1875 6097.93 n3 \| 2069.17 45.0938 6097.92 n4 \| 2055.78 51.4844 6097.91 n5 \| 2071.81 47.2656 6097.91 n6 \| 2085.63 37.7813 6097.9 n7 \| 2081.78 41.1875 6097.89 n8 \| 2015.33 45.3594 6097.88 n9 \| 2049.89 46.7031 6097.87 n10 \| 2048.98 52.3125 6097.85 n11 \| 2055.55 58.5313 6097.84 --------------------------------------------- Total \| 24517.2 580.484 - --------------------------------------------- Model Timers (Host) Flow Model Time: 31.281 sec (CPU), count 3076 Other Models Time: 0.344 sec (CPU) Total Time: 31.625 sec (CPU) Model Timers Flow Model Time: 999.346 sec (WALL), 1000.984 sec (CPU), count 3076 Turbulence Model Time: 278.553 sec (WALL), 276.125 sec (CPU), count 3076 Temperature Model Time: 211.580 sec (WALL), 211.203 sec (CPU), count 3076 Other Models Time: 0.533 sec (WALL) Total Time: 1490.012 sec (WALL) Performance Timer for 3076 iterations on 12 compute nodes Average wall-clock time per iteration: 0.489 sec Global reductions per iteration: 69 ops Global reductions time per iteration: 0.000 sec (0.0%) Message count per iteration: 28044 messages Data transfer per iteration: 16.082 MB LE solves per iteration: 4 solves LE wall-clock time per iteration: 0.239 sec (49.0%) LE global solves per iteration: 4 solves LE global wall-clock time per iteration: 0.001 sec (0.3%) LE global matrix maximum size: 589 AMG cycles per iteration: 7.876 cycles Relaxation sweeps per iteration: 974 sweeps Relaxation exchanges per iteration: 0 exchanges LE early protections (stall) per iteration: 0.007 times LE early protections (divergence) per iteration: 0.000 times Total SVARS touched: 364 Total wall-clock time: 1503.849 sec The server: --------------------------------------------------------------------------------------- \| CPU \| System Mem (GB) Hostname \| Sock x Core x HT Clock (MHz) Load (%)\| Total Available --------------------------------------------------------------------------------------- ##\| 2 x 24 x 1 2300 21.6 \| 251.845 125.986 --------------------------------------------------------------------------------------- Total \| 48 - - \| 251.845 125.986 --------------------------------------------------------------------------------------- --------------------------------------------- \| CPU Time Usage (Seconds) ID \| User Kernel Elapsed --------------------------------------------- host \| 75 87 - n0 \| 389 88 - n1 \| 2737 243 - n2 \| 2741 237 - n3 \| 2746 244 - n4 \| 2756 235 - n5 \| 2746 224 - n6 \| 2757 209 - n7 \| 2747 216 - n8 \| 2740 207 - n9 \| 2752 221 - n10 \| 2749 211 - n11 \| 2750 205 - --------------------------------------------- Total \| 30685 2627 - --------------------------------------------- Model Timers (Host) Flow Model Time: 3.001 sec (CPU), count 1699 Other Models Time: 0.178 sec (CPU) Total Time: 3.179 sec (CPU) Model Timers Flow Model Time: 190.640 sec (WALL), 191.337 sec (CPU), count 1699 Turbulence Model Time: 59.533 sec (WALL), 59.675 sec (CPU), count 1699 Temperature Model Time: 51.040 sec (WALL), 51.256 sec (CPU), count 1699 Other Models Time: 0.283 sec (WALL) Total Time: 301.496 sec (WALL) Performance Timer for 1699 iterations on 12 compute nodes Average wall-clock time per iteration: 0.182 sec Global reductions per iteration: 68 ops Global reductions time per iteration: 0.000 sec (0.0%) Message count per iteration: 25853 messages Data transfer per iteration: 14.436 MB LE solves per iteration: 4 solves LE wall-clock time per iteration: 0.081 sec (44.7%) LE global solves per iteration: 4 solves LE global wall-clock time per iteration: 0.001 sec (0.7%) LE global matrix maximum size: 614 AMG cycles per iteration: 7.741 cycles Relaxation sweeps per iteration: 961 sweeps Relaxation exchanges per iteration: 0 exchanges LE early protections (stall) per iteration: 0.007 times LE early protections (divergence) per iteration: 0.000 times Total SVARS touched: 364 Total wall-clock time: 308.849 sec --------------------------------------- So, the time/iteration is 0.182s for the server and 0.489s for my computer and yet my calculation is 45% slower on the server. I just don't understand.. Thanks! For information: We also have an alert message reach time we launch Fluent (maybe it's related): Warning: Direct rendering unavailable, hardware acceleration will be disabled. In the absence of hardware-accelerated drivers, the performance of all graphics operations will be severely affected. Make sure you have a supported graphics card, latest graphics driver, and a supported remote visualization tool with direct server-side rendering enabled. If you feel your system meets these requirements, try forcing the accelerated driver by using the command line flag (-driver <name>) or setting the HOOPS_PICTURE environment variable. Refer to the documentation for more details. We never solved this problem. The GPU is recognized by Fluent, all drivers are installed but maybe the problem comes from MobaXterm ?

June 18, 2021, 17:39	Don't understand your timing assessment	#2
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 354 Rep Power: 14	The wall clock for the server is 309 sec and your local computer 1504 sec. Correcting for number of iterations, we would get 3093076/1699 is 559 sec if the server had performed the same 3076 iterations. So the server is performing better than your home computer. What am I missing? Last edited by wkernkamp; June 18, 2021 at 17:42. Reason: Filled in correct times and number of iterations.*

June 19, 2021, 17:03	Got it	#4
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 354 Rep Power: 14	Looks like the server is performing OK when running, but most of the available time is consumed by something else. I assume there is nothing else running on the machine so that leaves the error message and a possible interface wait as the only possibility. Have you tried running fluent from the command line: fluent <version> -g -t<nprocs>-gpgpu=<ngpgpus> -i journalfile > outputfile

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
laplacianFoam with source term	Herwig	OpenFOAM Running, Solving & CFD	17	November 19, 2019 13:47
pimpleDyMFoam computation randomly stops	babapeti	OpenFOAM Running, Solving & CFD	5	January 24, 2018 05:28
pressure in incompressible solvers e.g. simpleFoam	chrizzl	OpenFOAM Running, Solving & CFD	13	March 28, 2017 05:49
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 05:36
How to write k and epsilon before the abnormal end	xiuying	OpenFOAM Running, Solving & CFD	8	August 27, 2013 15:33

July 27, 2021, 07:28		#5
Baum New Member Join Date: Dec 2019 Posts: 29 Rep Power: 6	This is interesting because I only saw this behaviour in our PCs when I maxed them out 100%: the same PC finished a task faster with e.g. 28/32 cores running compared to 32/32, though the Fluent process claims that the iterations are solved more quickly in the second case. I always wrote this off as a case of CPU-overhead which is needed for other (smaller) tasks, like plotting the graphs, writing report files etc.. However, if you have the same problem with 12/24 cores running, I guess that assumption was wrong. Have you tried contacting your Fluent representative to check up with them?