CPUs vs GPUs for CFD?

hami11 · January 27, 2023, 18:52

Given that star ccm+ and fluent now both support GPU acceleration, why don't more users opt for GPU accelerated computing vs CPU-only computation? GPUs offer a ton of speedup compared to CPUs in CFD (from what I heard) and honestly I don't see a significant drawback that would prevent anyone from using them. Is there a reason why they are not mainstream or will they eventually take over from CPUs?

flotus1 · January 27, 2023, 19:00

The main reason in my opinion: GPU acceleration does not have feature parity with CPU implementations of the same code. For cases where it has everything you need, it can work as advertised. For other cases, you are opening a can of worms.

See also General recommendations for CFD hardware [WIP]

Quote:

If you are still determined to leverage GPU acceleration for your CFD workstation, you need to do your own research. Important points to answer before buying a GPU for computing are:

Does your CFD package support GPU acceleration?
Do the solvers you intend to run benefit from GPU acceleration?
Are your models small enough to fit into GPU memory? Or vice versa, how much VRAM would you need to run your models?
Does GPU acceleration for your code work via CUDA (Nvidia only) or OpenCL (both AMD and Nvidia)
Which GPUs are allowed for GPU acceleration? Some commercial software comes with whitelists for supported GPUs for acceleration, and refuses to work with other GPUs.
Single or double precision? All GPUs have tons of single precision floating point performance. But especially for Nvidia, only a few GPUs at the very top end also have noteworthy double precision floating point performance.

hami11 · January 27, 2023, 20:54

It looks like two premier cfd solvers (fluent and ccm+) support both AMD and Nvidia GPUs now, so from what I understand gpus should have an advantage over cpus on those platforms?

the_phew · January 30, 2023, 17:07

GPUs offer more computational throughout for a given hardware cost, but less memory capacity. So they can be a good fit if you are running lots of smaller simulations and/or have access to a massive GPU cluster.

For instance, I often run simulations requiring over 1TB of RAM. That means I would need over a dozen 80GB A100s (at a cost of $18k+ apiece, over $220k total) to run my simulations on a GPU cluster. Meanwhile, you can build a single 2P EPYC Genoa node with 128 cores and 1.5TB of DDR5 RAM for under $30k.

In this comparison, the GPU cluster would have about 24x the memory bandwidth of the CPU node, while costing about 7x as much; so scale favors GPUs. But in the real world, many of us have to run large simulations on smaller/cheaper clusters, so CPU wins out in those scenarios. Not to mention, commercial CFD solvers do their licensing so GPUs and CPUs cost about the same to license for a given computational throughout, so it really comes down to whether you'd rather spend your hardware $ on speed or memory capacity.

gnwt4a · January 31, 2023, 03:01

despite the spectacular theoretical gpu gflops, realizing a fair
fraction of these is subject to less known obstacles -- ie above and
beyond there being enough parallelizable operation count. one of the
most important of them is data structures; ideally, gpu data must be
contiguously stored and accessed (ie unit stride), but this not
largely possible for 3d data structures and it gets much worse for
complicated geometries.

and even lesser known limitation (ie unless u have programmed in
opencl/cuda) is that the programmer is responsible for managing the
data transfers between video buffer and on-chip memory. here u have to
keep count of the number of registers being used, local memory fill-up
and bank access pattern to avoid having the read/write access
serialized and/or your data being shoved off to the video buffer
resulting in two orders of magnitude slowdown. it may also happen that
some computation cannot be made to fit within the on-chip memory.

lesser known still is that some operations are one two orders of
magnitudes slower than + or - : /, trigs, logs, %, if() and some
others. do not trust manufacturer's benchmarks. They silently get
round all these to show those gflops that impress.

overall, gpu acceleration is not given especially for general purpose
commercial codes.

hami11 · February 4, 2023, 13:28

I did suspect that the listed speedup numbers were not as great as they were shown to be. Do you know of any place where the actual speedup numbers are listed for GPU v CPU?

arjun · February 6, 2023, 05:19

Quote:

Originally Posted by hami11

It looks like two premier cfd solvers (fluent and ccm+) support both AMD and Nvidia GPUs now, so from what I understand gpus should have an advantage over cpus on those platforms?

Do they really support AMD GPUs?? I ask this because I know that Fluent and Starccm+ both are using amgx which i assume is CUDA.

Ansys was involved in development of amgx if i understand correctly. It is possible that ansys have developed inhouse alternative to amgx that supports AMDs too. Starccm definitely as of today not there (They are still on amgx).

(my impression is that they will develop in-house amg too based on what i hear).

techtuner · February 19, 2023, 06:23

Here it is some test results to compare performance of CPU+GPU vs CPUs in ANSYS Fluent.

Formulation of the task: simulation of the water flow into the circular pipe.
BC: inlet - V=1 [m/s], T=300[K]; outlet - static pressure 0 Pa; sidewall - wall heat flux 1e5 [W/m2].
SIMPLE algorithm.
Mesh size: 2.1, 3.7 or 12.4 mln cells.
Mesh type: unstructured mesh made by Sweep method with Inflation layers near sidewall.
Initial conditions: V=(0,0,1) [m/s], P=0 [Pa], T=300 [K].
Number of iterations: 10.

Performance results (Wall clock time for 10 iterations):

(3.7 mln cells) task results (at this size of task it is enough of GTX 1660 SUPER vRAM in SINGLE precision, but in DOUBLE precision it isn't):
DOUBLE precision solver:
* ANSYS Fluent 2023 R1 Native GPU Solver: AMD Ryzen 5900x 12 cores SMT off (4x32 GB DDR4-3000 MT/s ECC Unbuffered, dual channel) + NVIDIA Geforce 1660 Super 6 GB vRAM (vRAM bandwidth computed by ANSYS Fluent: 320 GB/s): ~7500 sec, vRAM usage 5.9 GB.
* ANSYS Fluent 2023 R1 CPU Solver: AMD Ryzen 5900x 12 cores: 53.43 sec; Peak RAM usage - 10.24 GB.
* ANSYS Fluent 17.0 CPU Solver: 2 servers of dual AMD EPYC 7532, 128 cores total, SMT off; per processor: 8x64 GB DDR4-2933 MT/s ECC Reg, 2R, total 32 memory channels in 4 processors; custom liquid cooling of CPU's: 6.67 sec; Peak Resident RAM usage - 16.6 GB; Peak Virtual RAM usage - 181.2 GB.
* ANSYS Fluent 17.0 CPU Solver: 4 servers of dual AMD Opteron 6380, 128 cores total (64 FPU, 128 IPU); per processor: 4x16 GB DDR3-1600 MT/s ECC Reg, total 32 memory channels in 8 processors: 17.01 sec; Peak Resident RAM usage - 17.8 GB; Peak Virtual RAM usage - 180.8 GB.
* (Coupled/Pseudo transient method) ANSYS Fluent 17.0 CPU Solver: 2 servers of dual AMD EPYC 7532, 128 cores total, SMT off; per processor: 8x64 GB DDR4-2933 MT/s ECC Reg, 2R, total 32 memory channels in 4 processors; custom liquid cooling of CPU's: 13.3 sec.
* (Coupled/Pseudo transient method) ANSYS Fluent 17.0 CPU Solver: 4 servers of dual AMD Opteron 6380, 128 cores total (64 FPU, 128 IPU); per processor: 4x16 GB DDR3-1600 MT/s ECC Reg, total 32 memory channels in 8 processors: 38.2 sec.

SINGLE precision solver:
* Native GPU Solver: AMD Ryzen 5900x 12 cores + NVIDIA Geforce 1660 Super 6 GB vRAM: 17.0 sec.
* Native GPU Solver: AMD Ryzen 5900x 2 cores + NVIDIA Geforce 1660 Super 6 GB vRAM: 7.9 sec, vRAM usage - 4.2 GB, peak RAM usage - 8.2 GB.
* CPU Solver: AMD Ryzen 5900x 12 cores: 77.88 sec; Peak RAM usage - 7.53 GB.
* ANSYS Fluent 17.0 CPU Solver: 2 servers of dual AMD EPYC 7532, 128 cores total, SMT off; per processor: 8x64 GB DDR4-2933 MT/s ECC Reg, 2R, total 32 memory channels in 4 processors; custom liquid cooling of CPU's: 6.33 sec; Peak Resident RAM usage - 13.4 GB; Peak Virtual RAM usage - 177.6 GB.
* ANSYS Fluent 17.0 CPU Solver: 4 servers of dual AMD Opteron 6380, 128 cores total (64 FPU, 128 IPU); per processor: 4x16 GB DDR3-1600 MT/s ECC Reg, total 32 memory channels in 8 processors: 15.12 sec; Peak Resident RAM usage - 14.6 GB; Peak Virtual RAM usage - 177.1 GB.

(2.1 mln cells) task results, ANSYS Fluent 2023 R1 (at this size of task it is enough of GTX 1660 SUPER vRAM in SINGLE and DOUBLE precision):
* Native GPU Solver, SINGLE precision, AMD Ryzen 5900x 2 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 5.08 sec; Peak vRAM usage by Solver - 2.47 GB; Peak RAM usage by Solver - 5.40 GB.
* Native GPU Solver, DOUBLE precision, AMD Ryzen 5900x 2 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 8.17 sec; Peak vRAM usage by Solver - 3.54 GB; Peak RAM usage by Solver - 6.01 GB.
* Native GPU Solver, SINGLE precision, AMD Ryzen 5900x 6 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 7.37 sec; Peak vRAM usage by Solver - 3.66 GB; Peak RAM usage by Solver - 7.29 GB.
* Native GPU Solver, DOUBLE precision, AMD Ryzen 5900x 6 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 9.40 sec; Peak vRAM usage by Solver - 4.47 GB; Peak RAM usage by Solver - 8.55 GB.
* CPU Solver, SINGLE precision, AMD Ryzen 5900x 6 cores (SMT off): 44.15 sec; Peak RAM usage by Solver - 4.12 GB.
* CPU Solver, DOUBLE precision, AMD Ryzen 5900x 6 cores (SMT off): 30.96 sec; Peak RAM usage by Solver - 5.58 GB.
* CPU Solver, SINGLE precision, AMD Ryzen 5900x 12 cores (SMT off): 35.67 sec; Peak RAM usage by Solver - 4.81 GB.
* CPU Solver, DOUBLE precision, AMD Ryzen 5900x 12 cores (SMT off): 27.39 sec; Peak RAM usage by Solver - 6.22 GB.

(12.4 mln cells) task results, ANSYS Fluent 2023 R1 (at this size of task it is not enough of GTX 1660 SUPER vRAM in SINGLE and DOUBLE precision):
* Native GPU Solver, SINGLE precision solver: AMD Ryzen 5900x 2 cores + NVIDIA Geforce 1660 Super 6 GB vRAM: 23760 sec, vRAM usage - 5.9 GB.
* CPU Solver, SINGLE precision solver: AMD Ryzen 5900x 12 cores: 301 sec, peak RAM usage - 20.1 GB.

In case when vRAM is enough for task, performance of the ANSYS Fluent with GPU solver on low-mid range graphic card in gaming-PC is equivalent to dual server cluster that is based on Zen2 EPYC with system price about 45 times higher!

In case when the task is requested more vRAM than available for GPU into the system, it was observed 30-40% PCI-E bus loading during computation with extremely low overall performance of simulation. Data transfer via x16 PCI-E 4.0 is almost stopping the simulation.

So, new Native GPU solver is aimed, first of all, on modern GPU clusters with multiple NVIDIA H100 80 GB vRAM that are connected via NVLink/NVSwitch [900 GB/s]. In this case performance will be huge.
In case of usage Geforce/Quadro GPU's it is possible to solve small tasks with high performance.

According to information from NVIDIA.COM, compute performance of NVIDIA Geforce GTX 1660 SUPER 6 GB in FP32 - 5.0 TFLOPS, in FP64 - 0.15 TFLOPS. But in ANSYS the performance difference between single and double precision was less than 1.6x.

arjun · February 21, 2023, 09:43

The numbers are good and thank you for this detail post but i am curious as to what these timings show.

Are they timing of 1 iteration. Or are they timing of converging the same case.

Or are they timings for running certain fixed iterations etc.

I ask this because i am working on a gpu framework myself now and one of the test i use is 4.1 million cells. So your mesh is similar to mine.

Now for me in wildkatze solver single process solve on cpu gives average 15 second for segregated flow model. The gpu version take average of 0.35 second (average of 500 iterations). The gpu i have is rtx 2080 ti.

I wanted to relate to your numbers but i couldn't.

Ps: I don't use amx or any other lib from outside. Everything i am writing for my solver.

the_phew · February 21, 2023, 13:44

Quote:

Originally Posted by techtuner

So, performance in SINGLE presicion mode of GPU solver on low-mid range graphic card in gaming-PC is equivalent to dual server cluster that is based on Zen2 EPYC with system price about 45 times higher!

In most CFD benchmarks I've seen, simulation throughput mostly scales with memory bandwidth (for any CPU or GPU that would be used for CFD). Considering that a Geforce 1660 has roughly the same memory bandwidth as a single EPYC Rome/Milan CPU, I'd expect them to perform similarly. Although it would be difficult to do an apples-to-apples comparison, since a 1660 only has 6GB of RAM (not many practical CFD cases will fit inside 6GB of RAM).

Typically you'd be comparing say A100s or A6000s versus ~24-64 core server CPUs. In those comparisons, each GPU has more memory bandwidth than even a 2P CPU node, albeit much less total memory. GPUs offer WAY more memory bandwidth for a given cost vs. CPUs, but WAY less memory capacity for a given cost. So solver compatibility issues aside, it comes down to where the scale of your cluster/simulations falls on the RAM bandwidth vs. capacity spectrum.

techtuner · February 25, 2023, 15:00

Quote:

Originally Posted by arjun

The numbers are good and thank you for this detail post but i am curious as to what these timings show.

Are they timing of 1 iteration. Or are they timing of converging the same case.

Or are they timings for running certain fixed iterations etc.

I ask this because i am working on a gpu framework myself now and one of the test i use is 4.1 million cells. So your mesh is similar to mine.

Now for me in wildkatze solver single process solve on cpu gives average 15 second for segregated flow model. The gpu version take average of 0.35 second (average of 500 iterations). The gpu i have is rtx 2080 ti.

I wanted to relate to your numbers but i couldn't.

Ps: I don't use amx or any other lib from outside. Everything i am writing for my solver.

The post was rewritten. I hope now it's clear that there are given wall clock time for 10 iterations.

techtuner · February 25, 2023, 16:07

Quote:

Originally Posted by the_phew

In most CFD benchmarks I've seen, simulation throughput mostly scales with memory bandwidth (for any CPU or GPU that would be used for CFD). Considering that a Geforce 1660 has roughly the same memory bandwidth as a single EPYC Rome/Milan CPU, I'd expect them to perform similarly. Although it would be difficult to do an apples-to-apples comparison, since a 1660 only has 6GB of RAM (not many practical CFD cases will fit inside 6GB of RAM).

Typically you'd be comparing say A100s or A6000s versus ~24-64 core server CPUs. In those comparisons, each GPU has more memory bandwidth than even a 2P CPU node, albeit much less total memory. GPUs offer WAY more memory bandwidth for a given cost vs. CPUs, but WAY less memory capacity for a given cost. So solver compatibility issues aside, it comes down to where the scale of your cluster/simulations falls on the RAM bandwidth vs. capacity spectrum.

Thank's for your comment.
I completely agree with you that CFD in modern HPC hardware is mostly RAM bandwidth limited.
According to my tests on different CPU's, architecture of ANSYS Fluent and Siemens Simcenter Star-CCM+ is more CPU frequency depended than, for example, ANSYS CFX or OpenFOAM. But, anyway, in Fluent we have strong dependency on RAM bandwidth.

Performance comparison of CFD products on GPU's is something new for me. As result of series of numerical simulations on GPU it was clear that we have to use data center-class GPU's to perform simulations due to RAM amount and bandwidth limitation of Geforce/Quadro products. Also double precision performance of Geforce-class GPU's is poor.

Memory bandwidth of Nvidia 1660 SUPER [320 GB/s] is equivalent to dual AMD EPYC 7532 server with RAM limited to 2933 MT/s [2x188GB/s~375 GB/s]. For small tasks direct comparison of this GPU and CPU's looks good from position of RAM bandwidth.

Currently, CFD codes are using GPU's in very limited way. By this reason GPU simulations are observed like future work. Ability to perform HPC simulation in near future on Fluent Native GPU solver in ANSYS Discovery Live using segregated cluster GPU's looks fantastic. Realization of this idea by ANSYS will move CFD simulations on new level of performance/quality.

arjun · February 26, 2023, 08:12

Thank you for the reply. Now i normalise my results so that some comparison could be made.

First i don't have single precision version i write double precision only.

With that, the gpu i am using has roughly 3 times as much as cuda cores.

My mesh is 4.15 million cell trimmer mesh generated by starccm (the mesher).

Since i get as on average 0.3 seconds per iteration my timings in your comparion should be:

10 (iterations) x 0.35 x 3( ratio of cuda cores), that would make it around 11.5 seconds in your reference frame.

Now since this segregated model is a simplified one and it needs more things like second order interpolation like what fluent does. Realistically finally it would end up around 15 seconds in your frame.

Which is double of your single precision results.

Ps: this is gpu native wildkatze results that means everything solved in gpu

ANEgorov · August 8, 2023, 09:16

Hi all!
I have several single-socket and several dual-socket machines and some GPUs. Perhaps off topic, but there is a question: is it possible in fluent to bind each GPU to its own socket on a two-socket machine? Or, in fluent, a two-socket machine cannot be conventionally represented as two separate single-socket machines with GPU?)

rikkiflow · March 28, 2024, 08:36

@techtuner, I am interested to test my computer configuration performance to your results. I could use a similar model in cell size but it would still be quite different based on sizes and angles. Could you share your model so that I can compare more accurately - and have real data to support my request for new hardware budget.

northstrider · April 1, 2024, 11:23

I tested out my 4090 GPU on Fluent and while yes it was about 8x-10x faster than my Ryzen9 7950x 16 core, it couldn't do DPM or Eulerian VoF. Not only that, but when I tested a simple case between CPU and GPU, the GPU gave me a different answer than my CPU. My vendor calls it a "cutting edge" technology, while I call it "barebones" early adopter technology.

KEDELLE · May 12, 2024, 11:15

"

Quote:

Originally Posted by techtuner

According to information from NVIDIA.COM, compute performance of NVIDIA Geforce GTX 1660 SUPER 6 GB in FP32 - 5.0 TFLOPS, in FP64 - 0.15 TFLOPS. But in ANSYS the performance difference between single and double precision was less than 1.6x.

"

are you saying we should ignore using double precision

KEDELLE · May 12, 2024, 11:18

[QUOTE=techtuner;844819]

In case when the task is requested more vRAM than available for GPU into the system, it was observed 30-40% PCI-E bus loading during computation with extremely low overall performance of simulation. Data transfer via x16 PCI-E 4.0 is almost stopping the simulation."

so PCI-E 5.0 should allow for smaller vram usage

KEDELLE · May 12, 2024, 11:20

Quote:

Originally Posted by northstrider

I tested out my 4090 GPU on Fluent and while yes it was about 8x-10x faster than my Ryzen9 7950x 16 core, it couldn't do DPM or Eulerian VoF. Not only that, but when I tested a simple case between CPU and GPU, the GPU gave me a different answer than my CPU. My vendor calls it a "cutting edge" technology, while I call it "barebones" early adopter technology.

I was thinking of swapping my A6000 to a 4090 since they both low on DP but 4090 is massive on SP

CFDfan · May 12, 2024, 12:16

[QUOTE I tested out my 4090 GPU on Fluent and while yes it was about 8x-10x faster than my Ryzen9 7950x 16 core, it couldn't do DPM or Eulerian VoF. Not only that, but when I tested a simple case between CPU and GPU, the GPU gave me a different answer than my CPU. My vendor calls it a "cutting edge" technology, while I call it "barebones" early adopter technology./QUOTE]

I observed the same weird behavior (different answers, or solver divergence when using the GPU) with FEA codes for electromagnetic simulations as well. Obviously not all bugs are cleared and "fast" doesn't necessarily mean "accurate"

January 27, 2023, 18:52	CPUs vs GPUs for CFD?	#1
hami11 New Member Prince Edward Island Join Date: May 2021 Posts: 26 Rep Power: 5	Given that star ccm+ and fluent now both support GPU acceleration, why don't more users opt for GPU accelerated computing vs CPU-only computation? GPUs offer a ton of speedup compared to CPUs in CFD (from what I heard) and honestly I don't see a significant drawback that would prevent anyone from using them. Is there a reason why they are not mainstream or will they eventually take over from CPUs?

January 30, 2023, 17:07		#4
the_phew Member Matt Join Date: May 2011 Posts: 44 Rep Power: 15	GPUs offer more computational throughout for a given hardware cost, but less memory capacity. So they can be a good fit if you are running lots of smaller simulations and/or have access to a massive GPU cluster. For instance, I often run simulations requiring over 1TB of RAM. That means I would need over a dozen 80GB A100s (at a cost of $18k+ apiece, over $220k total) to run my simulations on a GPU cluster. Meanwhile, you can build a single 2P EPYC Genoa node with 128 cores and 1.5TB of DDR5 RAM for under $30k. In this comparison, the GPU cluster would have about 24x the memory bandwidth of the CPU node, while costing about 7x as much; so scale favors GPUs. But in the real world, many of us have to run large simulations on smaller/cheaper clusters, so CPU wins out in those scenarios. Not to mention, commercial CFD solvers do their licensing so GPUs and CPUs cost about the same to license for a given computational throughout, so it really comes down to whether you'd rather spend your hardware $ on speed or memory capacity. CorbinMG and SphericalCube like this.

January 31, 2023, 03:01		#5
gnwt4a Member EM Join Date: Sep 2019 Posts: 59 Rep Power: 6	despite the spectacular theoretical gpu gflops, realizing a fair fraction of these is subject to less known obstacles -- ie above and beyond there being enough parallelizable operation count. one of the most important of them is data structures; ideally, gpu data must be contiguously stored and accessed (ie unit stride), but this not largely possible for 3d data structures and it gets much worse for complicated geometries. and even lesser known limitation (ie unless u have programmed in opencl/cuda) is that the programmer is responsible for managing the data transfers between video buffer and on-chip memory. here u have to keep count of the number of registers being used, local memory fill-up and bank access pattern to avoid having the read/write access serialized and/or your data being shoved off to the video buffer resulting in two orders of magnitude slowdown. it may also happen that some computation cannot be made to fit within the on-chip memory. lesser known still is that some operations are one two orders of magnitudes slower than + or - : /, trigs, logs, %, if() and some others. do not trust manufacturer's benchmarks. They silently get round all these to show those gflops that impress. overall, gpu acceleration is not given especially for general purpose commercial codes. oswald, wkernkamp, hurd and 2 others like this.

February 19, 2023, 06:23	Performance tests of Fluent Native GPU Solver	#8
techtuner New Member Dmitry Join Date: Feb 2013 Posts: 28 Rep Power: 13	Here it is some test results to compare performance of CPU+GPU vs CPUs in ANSYS Fluent. Formulation of the task: simulation of the water flow into the circular pipe. BC: inlet - V=1 [m/s], T=300[K]; outlet - static pressure 0 Pa; sidewall - wall heat flux 1e5 [W/m2]. SIMPLE algorithm. Mesh size: 2.1, 3.7 or 12.4 mln cells. Mesh type: unstructured mesh made by Sweep method with Inflation layers near sidewall. Initial conditions: V=(0,0,1) [m/s], P=0 [Pa], T=300 [K]. Number of iterations: 10. Performance results (Wall clock time for 10 iterations): (3.7 mln cells) task results (at this size of task it is enough of GTX 1660 SUPER vRAM in SINGLE precision, but in DOUBLE precision it isn't): DOUBLE precision solver: * ANSYS Fluent 2023 R1 Native GPU Solver: AMD Ryzen 5900x 12 cores SMT off (4x32 GB DDR4-3000 MT/s ECC Unbuffered, dual channel) + NVIDIA Geforce 1660 Super 6 GB vRAM (vRAM bandwidth computed by ANSYS Fluent: 320 GB/s): ~7500 sec, vRAM usage 5.9 GB. * ANSYS Fluent 2023 R1 CPU Solver: AMD Ryzen 5900x 12 cores: 53.43 sec; Peak RAM usage - 10.24 GB. * ANSYS Fluent 17.0 CPU Solver: 2 servers of dual AMD EPYC 7532, 128 cores total, SMT off; per processor: 8x64 GB DDR4-2933 MT/s ECC Reg, 2R, total 32 memory channels in 4 processors; custom liquid cooling of CPU's: 6.67 sec; Peak Resident RAM usage - 16.6 GB; Peak Virtual RAM usage - 181.2 GB. * ANSYS Fluent 17.0 CPU Solver: 4 servers of dual AMD Opteron 6380, 128 cores total (64 FPU, 128 IPU); per processor: 4x16 GB DDR3-1600 MT/s ECC Reg, total 32 memory channels in 8 processors: 17.01 sec; Peak Resident RAM usage - 17.8 GB; Peak Virtual RAM usage - 180.8 GB. * (Coupled/Pseudo transient method) ANSYS Fluent 17.0 CPU Solver: 2 servers of dual AMD EPYC 7532, 128 cores total, SMT off; per processor: 8x64 GB DDR4-2933 MT/s ECC Reg, 2R, total 32 memory channels in 4 processors; custom liquid cooling of CPU's: 13.3 sec. * (Coupled/Pseudo transient method) ANSYS Fluent 17.0 CPU Solver: 4 servers of dual AMD Opteron 6380, 128 cores total (64 FPU, 128 IPU); per processor: 4x16 GB DDR3-1600 MT/s ECC Reg, total 32 memory channels in 8 processors: 38.2 sec. SINGLE precision solver: * Native GPU Solver: AMD Ryzen 5900x 12 cores + NVIDIA Geforce 1660 Super 6 GB vRAM: 17.0 sec. * Native GPU Solver: AMD Ryzen 5900x 2 cores + NVIDIA Geforce 1660 Super 6 GB vRAM: 7.9 sec, vRAM usage - 4.2 GB, peak RAM usage - 8.2 GB. * CPU Solver: AMD Ryzen 5900x 12 cores: 77.88 sec; Peak RAM usage - 7.53 GB. * ANSYS Fluent 17.0 CPU Solver: 2 servers of dual AMD EPYC 7532, 128 cores total, SMT off; per processor: 8x64 GB DDR4-2933 MT/s ECC Reg, 2R, total 32 memory channels in 4 processors; custom liquid cooling of CPU's: 6.33 sec; Peak Resident RAM usage - 13.4 GB; Peak Virtual RAM usage - 177.6 GB. * ANSYS Fluent 17.0 CPU Solver: 4 servers of dual AMD Opteron 6380, 128 cores total (64 FPU, 128 IPU); per processor: 4x16 GB DDR3-1600 MT/s ECC Reg, total 32 memory channels in 8 processors: 15.12 sec; Peak Resident RAM usage - 14.6 GB; Peak Virtual RAM usage - 177.1 GB. (2.1 mln cells) task results, ANSYS Fluent 2023 R1 (at this size of task it is enough of GTX 1660 SUPER vRAM in SINGLE and DOUBLE precision): * Native GPU Solver, SINGLE precision, AMD Ryzen 5900x 2 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 5.08 sec; Peak vRAM usage by Solver - 2.47 GB; Peak RAM usage by Solver - 5.40 GB. * Native GPU Solver, DOUBLE precision, AMD Ryzen 5900x 2 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 8.17 sec; Peak vRAM usage by Solver - 3.54 GB; Peak RAM usage by Solver - 6.01 GB. * Native GPU Solver, SINGLE precision, AMD Ryzen 5900x 6 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 7.37 sec; Peak vRAM usage by Solver - 3.66 GB; Peak RAM usage by Solver - 7.29 GB. * Native GPU Solver, DOUBLE precision, AMD Ryzen 5900x 6 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 9.40 sec; Peak vRAM usage by Solver - 4.47 GB; Peak RAM usage by Solver - 8.55 GB. * CPU Solver, SINGLE precision, AMD Ryzen 5900x 6 cores (SMT off): 44.15 sec; Peak RAM usage by Solver - 4.12 GB. * CPU Solver, DOUBLE precision, AMD Ryzen 5900x 6 cores (SMT off): 30.96 sec; Peak RAM usage by Solver - 5.58 GB. * CPU Solver, SINGLE precision, AMD Ryzen 5900x 12 cores (SMT off): 35.67 sec; Peak RAM usage by Solver - 4.81 GB. * CPU Solver, DOUBLE precision, AMD Ryzen 5900x 12 cores (SMT off): 27.39 sec; Peak RAM usage by Solver - 6.22 GB. (12.4 mln cells) task results, ANSYS Fluent 2023 R1 (at this size of task it is not enough of GTX 1660 SUPER vRAM in SINGLE and DOUBLE precision): * Native GPU Solver, SINGLE precision solver: AMD Ryzen 5900x 2 cores + NVIDIA Geforce 1660 Super 6 GB vRAM: 23760 sec, vRAM usage - 5.9 GB. * CPU Solver, SINGLE precision solver: AMD Ryzen 5900x 12 cores: 301 sec, peak RAM usage - 20.1 GB. In case when vRAM is enough for task, performance of the ANSYS Fluent with GPU solver on low-mid range graphic card in gaming-PC is equivalent to dual server cluster that is based on Zen2 EPYC with system price about 45 times higher! In case when the task is requested more vRAM than available for GPU into the system, it was observed 30-40% PCI-E bus loading during computation with extremely low overall performance of simulation. Data transfer via x16 PCI-E 4.0 is almost stopping the simulation. So, new Native GPU solver is aimed, first of all, on modern GPU clusters with multiple NVIDIA H100 80 GB vRAM that are connected via NVLink/NVSwitch [900 GB/s]. In this case performance will be huge. In case of usage Geforce/Quadro GPU's it is possible to solve small tasks with high performance. According to information from NVIDIA.COM, compute performance of NVIDIA Geforce GTX 1660 SUPER 6 GB in FP32 - 5.0 TFLOPS, in FP64 - 0.15 TFLOPS. But in ANSYS the performance difference between single and double precision was less than 1.6x. flotus1, arvindpj, wkernkamp and 5 others like this. Last edited by techtuner; March 6, 2023 at 10:47. Reason: New data was added. The text was restructured.

May 12, 2024, 11:18	vram	#18
KEDELLE New Member Join Date: Jun 2018 Posts: 7 Rep Power: 8	[QUOTE=techtuner;844819] In case when the task is requested more vRAM than available for GPU into the system, it was observed 30-40% PCI-E bus loading during computation with extremely low overall performance of simulation. Data transfer via x16 PCI-E 4.0 is almost stopping the simulation." so PCI-E 5.0 should allow for smaller vram usage

January 27, 2023, 20:54		#3
hami11 New Member Prince Edward Island Join Date: May 2021 Posts: 26 Rep Power: 5	It looks like two premier cfd solvers (fluent and ccm+) support both AMD and Nvidia GPUs now, so from what I understand gpus should have an advantage over cpus on those platforms?

February 4, 2023, 13:28		#6
hami11 New Member Prince Edward Island Join Date: May 2021 Posts: 26 Rep Power: 5	I did suspect that the listed speedup numbers were not as great as they were shown to be. Do you know of any place where the actual speedup numbers are listed for GPU v CPU?

February 21, 2023, 09:43		#9
arjun Senior Member Arjun Join Date: Mar 2009 Location: Nurenberg, Germany Posts: 1,278 Rep Power: 34	The numbers are good and thank you for this detail post but i am curious as to what these timings show. Are they timing of 1 iteration. Or are they timing of converging the same case. Or are they timings for running certain fixed iterations etc. I ask this because i am working on a gpu framework myself now and one of the test i use is 4.1 million cells. So your mesh is similar to mine. Now for me in wildkatze solver single process solve on cpu gives average 15 second for segregated flow model. The gpu version take average of 0.35 second (average of 500 iterations). The gpu i have is rtx 2080 ti. I wanted to relate to your numbers but i couldn't. Ps: I don't use amx or any other lib from outside. Everything i am writing for my solver.

February 26, 2023, 08:12		#13
arjun Senior Member Arjun Join Date: Mar 2009 Location: Nurenberg, Germany Posts: 1,278 Rep Power: 34	Thank you for the reply. Now i normalise my results so that some comparison could be made. First i don't have single precision version i write double precision only. With that, the gpu i am using has roughly 3 times as much as cuda cores. My mesh is 4.15 million cell trimmer mesh generated by starccm (the mesher). Since i get as on average 0.3 seconds per iteration my timings in your comparion should be: 10 (iterations) x 0.35 x 3( ratio of cuda cores), that would make it around 11.5 seconds in your reference frame. Now since this segregated model is a simplified one and it needs more things like second order interpolation like what fluent does. Realistically finally it would end up around 15 seconds in your frame. Which is double of your single precision results. Ps: this is gpu native wildkatze results that means everything solved in gpu

August 8, 2023, 09:16		#14
ANEgorov New Member Саратовская обл& Join Date: Mar 2020 Posts: 1 Rep Power: 0	Hi all! I have several single-socket and several dual-socket machines and some GPUs. Perhaps off topic, but there is a question: is it possible in fluent to bind each GPU to its own socket on a two-socket machine? Or, in fluent, a two-socket machine cannot be conventionally represented as two separate single-socket machines with GPU?)

March 28, 2024, 08:36		#15
rikkiflow New Member Ricki Join Date: Mar 2024 Posts: 1 Rep Power: 0	@techtuner, I am interested to test my computer configuration performance to your results. I could use a similar model in cell size but it would still be quite different based on sizes and angles. Could you share your model so that I can compare more accurately - and have real data to support my request for new hardware budget.

April 1, 2024, 11:23		#16
northstrider New Member Artem Join Date: Aug 2023 Location: USA Posts: 7 Rep Power: 2	I tested out my 4090 GPU on Fluent and while yes it was about 8x-10x faster than my Ryzen9 7950x 16 core, it couldn't do DPM or Eulerian VoF. Not only that, but when I tested a simple case between CPU and GPU, the GPU gave me a different answer than my CPU. My vendor calls it a "cutting edge" technology, while I call it "barebones" early adopter technology.

May 12, 2024, 12:16		#20
CFDfan Senior Member Join Date: Jun 2011 Posts: 205 Rep Power: 16	[QUOTE I tested out my 4090 GPU on Fluent and while yes it was about 8x-10x faster than my Ryzen9 7950x 16 core, it couldn't do DPM or Eulerian VoF. Not only that, but when I tested a simple case between CPU and GPU, the GPU gave me a different answer than my CPU. My vendor calls it a "cutting edge" technology, while I call it "barebones" early adopter technology./QUOTE] I observed the same weird behavior (different answers, or solver divergence when using the GPU) with FEA codes for electromagnetic simulations as well. Obviously not all bugs are cleared and "fast" doesn't necessarily mean "accurate"

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
General recommendations for CFD hardware [WIP]	flotus1	Hardware	19	June 23, 2024 18:02
OpenFOAM benchmarks on various hardware	eric	Hardware	778	April 23, 2024 16:56
AMD Epyc 9004 "Genoa" buyers guide for CFD	flotus1	Hardware	8	January 16, 2023 05:23
CPU for Flow3d	mik_urb	Hardware	4	December 4, 2022 22:06
Parallel speedup scales better than number of CPUs	MikeWorth	OpenFOAM Running, Solving & CFD	5	August 21, 2020 17:30