CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

CPUs vs GPUs for CFD?

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree12Likes
  • 1 Post By the_phew
  • 4 Post By gnwt4a
  • 6 Post By techtuner
  • 1 Post By the_phew

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   January 27, 2023, 18:52
Default CPUs vs GPUs for CFD?
  #1
New Member
 
Prince Edward Island
Join Date: May 2021
Posts: 26
Rep Power: 4
hami11 is on a distinguished road
Given that star ccm+ and fluent now both support GPU acceleration, why don't more users opt for GPU accelerated computing vs CPU-only computation? GPUs offer a ton of speedup compared to CPUs in CFD (from what I heard) and honestly I don't see a significant drawback that would prevent anyone from using them. Is there a reason why they are not mainstream or will they eventually take over from CPUs?
hami11 is offline   Reply With Quote

Old   January 27, 2023, 19:00
Default
  #2
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,398
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
The main reason in my opinion: GPU acceleration does not have feature parity with CPU implementations of the same code. For cases where it has everything you need, it can work as advertised. For other cases, you are opening a can of worms.

See also General recommendations for CFD hardware [WIP]
Quote:
If you are still determined to leverage GPU acceleration for your CFD workstation, you need to do your own research. Important points to answer before buying a GPU for computing are:

Does your CFD package support GPU acceleration?
Do the solvers you intend to run benefit from GPU acceleration?
Are your models small enough to fit into GPU memory? Or vice versa, how much VRAM would you need to run your models?
Does GPU acceleration for your code work via CUDA (Nvidia only) or OpenCL (both AMD and Nvidia)
Which GPUs are allowed for GPU acceleration? Some commercial software comes with whitelists for supported GPUs for acceleration, and refuses to work with other GPUs.
Single or double precision? All GPUs have tons of single precision floating point performance. But especially for Nvidia, only a few GPUs at the very top end also have noteworthy double precision floating point performance.
flotus1 is offline   Reply With Quote

Old   January 27, 2023, 20:54
Default
  #3
New Member
 
Prince Edward Island
Join Date: May 2021
Posts: 26
Rep Power: 4
hami11 is on a distinguished road
It looks like two premier cfd solvers (fluent and ccm+) support both AMD and Nvidia GPUs now, so from what I understand gpus should have an advantage over cpus on those platforms?
hami11 is offline   Reply With Quote

Old   January 30, 2023, 17:07
Default
  #4
Member
 
Matt
Join Date: May 2011
Posts: 43
Rep Power: 14
the_phew is on a distinguished road
GPUs offer more computational throughout for a given hardware cost, but less memory capacity. So they can be a good fit if you are running lots of smaller simulations and/or have access to a massive GPU cluster.

For instance, I often run simulations requiring over 1TB of RAM. That means I would need over a dozen 80GB A100s (at a cost of $18k+ apiece, over $220k total) to run my simulations on a GPU cluster. Meanwhile, you can build a single 2P EPYC Genoa node with 128 cores and 1.5TB of DDR5 RAM for under $30k.

In this comparison, the GPU cluster would have about 24x the memory bandwidth of the CPU node, while costing about 7x as much; so scale favors GPUs. But in the real world, many of us have to run large simulations on smaller/cheaper clusters, so CPU wins out in those scenarios. Not to mention, commercial CFD solvers do their licensing so GPUs and CPUs cost about the same to license for a given computational throughout, so it really comes down to whether you'd rather spend your hardware $ on speed or memory capacity.
CorbinMG likes this.
the_phew is offline   Reply With Quote

Old   January 31, 2023, 03:01
Default
  #5
Member
 
EM
Join Date: Sep 2019
Posts: 58
Rep Power: 6
gnwt4a is on a distinguished road
despite the spectacular theoretical gpu gflops, realizing a fair
fraction of these is subject to less known obstacles -- ie above and
beyond there being enough parallelizable operation count. one of the
most important of them is data structures; ideally, gpu data must be
contiguously stored and accessed (ie unit stride), but this not
largely possible for 3d data structures and it gets much worse for
complicated geometries.

and even lesser known limitation (ie unless u have programmed in
opencl/cuda) is that the programmer is responsible for managing the
data transfers between video buffer and on-chip memory. here u have to
keep count of the number of registers being used, local memory fill-up
and bank access pattern to avoid having the read/write access
serialized and/or your data being shoved off to the video buffer
resulting in two orders of magnitude slowdown. it may also happen that
some computation cannot be made to fit within the on-chip memory.

lesser known still is that some operations are one two orders of
magnitudes slower than + or - : /, trigs, logs, %, if() and some
others. do not trust manufacturer's benchmarks. They silently get
round all these to show those gflops that impress.

overall, gpu acceleration is not given especially for general purpose
commercial codes.
oswald, wkernkamp, hurd and 1 others like this.
gnwt4a is offline   Reply With Quote

Old   February 4, 2023, 13:28
Default
  #6
New Member
 
Prince Edward Island
Join Date: May 2021
Posts: 26
Rep Power: 4
hami11 is on a distinguished road
I did suspect that the listed speedup numbers were not as great as they were shown to be. Do you know of any place where the actual speedup numbers are listed for GPU v CPU?
hami11 is offline   Reply With Quote

Old   February 6, 2023, 05:19
Default
  #7
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,273
Rep Power: 34
arjun will become famous soon enougharjun will become famous soon enough
Quote:
Originally Posted by hami11 View Post
It looks like two premier cfd solvers (fluent and ccm+) support both AMD and Nvidia GPUs now, so from what I understand gpus should have an advantage over cpus on those platforms?
Do they really support AMD GPUs?? I ask this because I know that Fluent and Starccm+ both are using amgx which i assume is CUDA.

Ansys was involved in development of amgx if i understand correctly. It is possible that ansys have developed inhouse alternative to amgx that supports AMDs too. Starccm definitely as of today not there (They are still on amgx).

(my impression is that they will develop in-house amg too based on what i hear).
arjun is offline   Reply With Quote

Old   February 19, 2023, 06:23
Default Performance tests of Fluent Native GPU Solver
  #8
New Member
 
Dmitry
Join Date: Feb 2013
Posts: 28
Rep Power: 13
techtuner is on a distinguished road
Here it is some test results to compare performance of CPU+GPU vs CPUs in ANSYS Fluent.

Formulation of the task: simulation of the water flow into the circular pipe.
BC: inlet - V=1 [m/s], T=300[K]; outlet - static pressure 0 Pa; sidewall - wall heat flux 1e5 [W/m2].
SIMPLE algorithm.
Mesh size: 2.1, 3.7 or 12.4 mln cells.
Mesh type: unstructured mesh made by Sweep method with Inflation layers near sidewall.
Initial conditions: V=(0,0,1) [m/s], P=0 [Pa], T=300 [K].
Number of iterations: 10.

Performance results (Wall clock time for 10 iterations):

(3.7 mln cells) task results (at this size of task it is enough of GTX 1660 SUPER vRAM in SINGLE precision, but in DOUBLE precision it isn't):
DOUBLE precision solver:
* ANSYS Fluent 2023 R1 Native GPU Solver: AMD Ryzen 5900x 12 cores SMT off (4x32 GB DDR4-3000 MT/s ECC Unbuffered, dual channel) + NVIDIA Geforce 1660 Super 6 GB vRAM (vRAM bandwidth computed by ANSYS Fluent: 320 GB/s): ~7500 sec, vRAM usage 5.9 GB.
* ANSYS Fluent 2023 R1 CPU Solver: AMD Ryzen 5900x 12 cores: 53.43 sec; Peak RAM usage - 10.24 GB.
* ANSYS Fluent 17.0 CPU Solver: 2 servers of dual AMD EPYC 7532, 128 cores total, SMT off; per processor: 8x64 GB DDR4-2933 MT/s ECC Reg, 2R, total 32 memory channels in 4 processors; custom liquid cooling of CPU's: 6.67 sec; Peak Resident RAM usage - 16.6 GB; Peak Virtual RAM usage - 181.2 GB.
* ANSYS Fluent 17.0 CPU Solver: 4 servers of dual AMD Opteron 6380, 128 cores total (64 FPU, 128 IPU); per processor: 4x16 GB DDR3-1600 MT/s ECC Reg, total 32 memory channels in 8 processors: 17.01 sec; Peak Resident RAM usage - 17.8 GB; Peak Virtual RAM usage - 180.8 GB.
* (Coupled/Pseudo transient method) ANSYS Fluent 17.0 CPU Solver: 2 servers of dual AMD EPYC 7532, 128 cores total, SMT off; per processor: 8x64 GB DDR4-2933 MT/s ECC Reg, 2R, total 32 memory channels in 4 processors; custom liquid cooling of CPU's: 13.3 sec.
* (Coupled/Pseudo transient method) ANSYS Fluent 17.0 CPU Solver: 4 servers of dual AMD Opteron 6380, 128 cores total (64 FPU, 128 IPU); per processor: 4x16 GB DDR3-1600 MT/s ECC Reg, total 32 memory channels in 8 processors: 38.2 sec.

SINGLE precision solver:
* Native GPU Solver: AMD Ryzen 5900x 12 cores + NVIDIA Geforce 1660 Super 6 GB vRAM: 17.0 sec.
* Native GPU Solver: AMD Ryzen 5900x 2 cores + NVIDIA Geforce 1660 Super 6 GB vRAM: 7.9 sec, vRAM usage - 4.2 GB, peak RAM usage - 8.2 GB.
* CPU Solver: AMD Ryzen 5900x 12 cores: 77.88 sec; Peak RAM usage - 7.53 GB.
* ANSYS Fluent 17.0 CPU Solver: 2 servers of dual AMD EPYC 7532, 128 cores total, SMT off; per processor: 8x64 GB DDR4-2933 MT/s ECC Reg, 2R, total 32 memory channels in 4 processors; custom liquid cooling of CPU's: 6.33 sec; Peak Resident RAM usage - 13.4 GB; Peak Virtual RAM usage - 177.6 GB.
* ANSYS Fluent 17.0 CPU Solver: 4 servers of dual AMD Opteron 6380, 128 cores total (64 FPU, 128 IPU); per processor: 4x16 GB DDR3-1600 MT/s ECC Reg, total 32 memory channels in 8 processors: 15.12 sec; Peak Resident RAM usage - 14.6 GB; Peak Virtual RAM usage - 177.1 GB.

(2.1 mln cells) task results, ANSYS Fluent 2023 R1 (at this size of task it is enough of GTX 1660 SUPER vRAM in SINGLE and DOUBLE precision):
* Native GPU Solver, SINGLE precision, AMD Ryzen 5900x 2 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 5.08 sec; Peak vRAM usage by Solver - 2.47 GB; Peak RAM usage by Solver - 5.40 GB.
* Native GPU Solver, DOUBLE precision, AMD Ryzen 5900x 2 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 8.17 sec; Peak vRAM usage by Solver - 3.54 GB; Peak RAM usage by Solver - 6.01 GB.
* Native GPU Solver, SINGLE precision, AMD Ryzen 5900x 6 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 7.37 sec; Peak vRAM usage by Solver - 3.66 GB; Peak RAM usage by Solver - 7.29 GB.
* Native GPU Solver, DOUBLE precision, AMD Ryzen 5900x 6 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 9.40 sec; Peak vRAM usage by Solver - 4.47 GB; Peak RAM usage by Solver - 8.55 GB.
* CPU Solver, SINGLE precision, AMD Ryzen 5900x 6 cores (SMT off): 44.15 sec; Peak RAM usage by Solver - 4.12 GB.
* CPU Solver, DOUBLE precision, AMD Ryzen 5900x 6 cores (SMT off): 30.96 sec; Peak RAM usage by Solver - 5.58 GB.
* CPU Solver, SINGLE precision, AMD Ryzen 5900x 12 cores (SMT off): 35.67 sec; Peak RAM usage by Solver - 4.81 GB.
* CPU Solver, DOUBLE precision, AMD Ryzen 5900x 12 cores (SMT off): 27.39 sec; Peak RAM usage by Solver - 6.22 GB.

(12.4 mln cells) task results, ANSYS Fluent 2023 R1 (at this size of task it is not enough of GTX 1660 SUPER vRAM in SINGLE and DOUBLE precision):
* Native GPU Solver, SINGLE precision solver: AMD Ryzen 5900x 2 cores + NVIDIA Geforce 1660 Super 6 GB vRAM: 23760 sec, vRAM usage - 5.9 GB.
* CPU Solver, SINGLE precision solver: AMD Ryzen 5900x 12 cores: 301 sec, peak RAM usage - 20.1 GB.

In case when vRAM is enough for task, performance of the ANSYS Fluent with GPU solver on low-mid range graphic card in gaming-PC is equivalent to dual server cluster that is based on Zen2 EPYC with system price about 45 times higher!

In case when the task is requested more vRAM than available for GPU into the system, it was observed 30-40% PCI-E bus loading during computation with extremely low overall performance of simulation. Data transfer via x16 PCI-E 4.0 is almost stopping the simulation.

So, new Native GPU solver is aimed, first of all, on modern GPU clusters with multiple NVIDIA H100 80 GB vRAM that are connected via NVLink/NVSwitch [900 GB/s]. In this case performance will be huge.
In case of usage Geforce/Quadro GPU's it is possible to solve small tasks with high performance.

According to information from NVIDIA.COM, compute performance of NVIDIA Geforce GTX 1660 SUPER 6 GB in FP32 - 5.0 TFLOPS, in FP64 - 0.15 TFLOPS. But in ANSYS the performance difference between single and double precision was less than 1.6x.

Last edited by techtuner; March 6, 2023 at 10:47. Reason: New data was added. The text was restructured.
techtuner is offline   Reply With Quote

Old   February 21, 2023, 09:43
Default
  #9
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,273
Rep Power: 34
arjun will become famous soon enougharjun will become famous soon enough
The numbers are good and thank you for this detail post but i am curious as to what these timings show.

Are they timing of 1 iteration. Or are they timing of converging the same case.

Or are they timings for running certain fixed iterations etc.

I ask this because i am working on a gpu framework myself now and one of the test i use is 4.1 million cells. So your mesh is similar to mine.

Now for me in wildkatze solver single process solve on cpu gives average 15 second for segregated flow model. The gpu version take average of 0.35 second (average of 500 iterations). The gpu i have is rtx 2080 ti.

I wanted to relate to your numbers but i couldn't.

Ps: I don't use amx or any other lib from outside. Everything i am writing for my solver.
arjun is offline   Reply With Quote

Old   February 21, 2023, 13:44
Default
  #10
Member
 
Matt
Join Date: May 2011
Posts: 43
Rep Power: 14
the_phew is on a distinguished road
Quote:
Originally Posted by techtuner View Post
So, performance in SINGLE presicion mode of GPU solver on low-mid range graphic card in gaming-PC is equivalent to dual server cluster that is based on Zen2 EPYC with system price about 45 times higher!
In most CFD benchmarks I've seen, simulation throughput mostly scales with memory bandwidth (for any CPU or GPU that would be used for CFD). Considering that a Geforce 1660 has roughly the same memory bandwidth as a single EPYC Rome/Milan CPU, I'd expect them to perform similarly. Although it would be difficult to do an apples-to-apples comparison, since a 1660 only has 6GB of RAM (not many practical CFD cases will fit inside 6GB of RAM).

Typically you'd be comparing say A100s or A6000s versus ~24-64 core server CPUs. In those comparisons, each GPU has more memory bandwidth than even a 2P CPU node, albeit much less total memory. GPUs offer WAY more memory bandwidth for a given cost vs. CPUs, but WAY less memory capacity for a given cost. So solver compatibility issues aside, it comes down to where the scale of your cluster/simulations falls on the RAM bandwidth vs. capacity spectrum.
wkernkamp likes this.

Last edited by the_phew; February 22, 2023 at 08:57.
the_phew is offline   Reply With Quote

Old   February 25, 2023, 15:00
Default
  #11
New Member
 
Dmitry
Join Date: Feb 2013
Posts: 28
Rep Power: 13
techtuner is on a distinguished road
Quote:
Originally Posted by arjun View Post
The numbers are good and thank you for this detail post but i am curious as to what these timings show.

Are they timing of 1 iteration. Or are they timing of converging the same case.

Or are they timings for running certain fixed iterations etc.

I ask this because i am working on a gpu framework myself now and one of the test i use is 4.1 million cells. So your mesh is similar to mine.

Now for me in wildkatze solver single process solve on cpu gives average 15 second for segregated flow model. The gpu version take average of 0.35 second (average of 500 iterations). The gpu i have is rtx 2080 ti.

I wanted to relate to your numbers but i couldn't.

Ps: I don't use amx or any other lib from outside. Everything i am writing for my solver.
The post was rewritten. I hope now it's clear that there are given wall clock time for 10 iterations.
techtuner is offline   Reply With Quote

Old   February 25, 2023, 16:07
Default
  #12
New Member
 
Dmitry
Join Date: Feb 2013
Posts: 28
Rep Power: 13
techtuner is on a distinguished road
Quote:
Originally Posted by the_phew View Post
In most CFD benchmarks I've seen, simulation throughput mostly scales with memory bandwidth (for any CPU or GPU that would be used for CFD). Considering that a Geforce 1660 has roughly the same memory bandwidth as a single EPYC Rome/Milan CPU, I'd expect them to perform similarly. Although it would be difficult to do an apples-to-apples comparison, since a 1660 only has 6GB of RAM (not many practical CFD cases will fit inside 6GB of RAM).

Typically you'd be comparing say A100s or A6000s versus ~24-64 core server CPUs. In those comparisons, each GPU has more memory bandwidth than even a 2P CPU node, albeit much less total memory. GPUs offer WAY more memory bandwidth for a given cost vs. CPUs, but WAY less memory capacity for a given cost. So solver compatibility issues aside, it comes down to where the scale of your cluster/simulations falls on the RAM bandwidth vs. capacity spectrum.
Thank's for your comment.
I completely agree with you that CFD in modern HPC hardware is mostly RAM bandwidth limited.
According to my tests on different CPU's, architecture of ANSYS Fluent and Siemens Simcenter Star-CCM+ is more CPU frequency depended than, for example, ANSYS CFX or OpenFOAM. But, anyway, in Fluent we have strong dependency on RAM bandwidth.

Performance comparison of CFD products on GPU's is something new for me. As result of series of numerical simulations on GPU it was clear that we have to use data center-class GPU's to perform simulations due to RAM amount and bandwidth limitation of Geforce/Quadro products. Also double precision performance of Geforce-class GPU's is poor.

Memory bandwidth of Nvidia 1660 SUPER [320 GB/s] is equivalent to dual AMD EPYC 7532 server with RAM limited to 2933 MT/s [2x188GB/s~375 GB/s]. For small tasks direct comparison of this GPU and CPU's looks good from position of RAM bandwidth.

Currently, CFD codes are using GPU's in very limited way. By this reason GPU simulations are observed like future work. Ability to perform HPC simulation in near future on Fluent Native GPU solver in ANSYS Discovery Live using segregated cluster GPU's looks fantastic. Realization of this idea by ANSYS will move CFD simulations on new level of performance/quality.
techtuner is offline   Reply With Quote

Old   February 26, 2023, 08:12
Default
  #13
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,273
Rep Power: 34
arjun will become famous soon enougharjun will become famous soon enough
Thank you for the reply. Now i normalise my results so that some comparison could be made.

First i don't have single precision version i write double precision only.

With that, the gpu i am using has roughly 3 times as much as cuda cores.

My mesh is 4.15 million cell trimmer mesh generated by starccm (the mesher).

Since i get as on average 0.3 seconds per iteration my timings in your comparion should be:

10 (iterations) x 0.35 x 3( ratio of cuda cores), that would make it around 11.5 seconds in your reference frame.

Now since this segregated model is a simplified one and it needs more things like second order interpolation like what fluent does. Realistically finally it would end up around 15 seconds in your frame.

Which is double of your single precision results.

Ps: this is gpu native wildkatze results that means everything solved in gpu
arjun is offline   Reply With Quote

Old   August 8, 2023, 09:16
Default
  #14
New Member
 
Саратовская обл&
Join Date: Mar 2020
Posts: 1
Rep Power: 0
ANEgorov is on a distinguished road
Hi all!
I have several single-socket and several dual-socket machines and some GPUs. Perhaps off topic, but there is a question: is it possible in fluent to bind each GPU to its own socket on a two-socket machine? Or, in fluent, a two-socket machine cannot be conventionally represented as two separate single-socket machines with GPU?)
ANEgorov is offline   Reply With Quote

Old   March 28, 2024, 08:36
Default
  #15
New Member
 
Ricki
Join Date: Mar 2024
Posts: 1
Rep Power: 0
rikkiflow is on a distinguished road
Send a message via Skype™ to rikkiflow
@techtuner, I am interested to test my computer configuration performance to your results. I could use a similar model in cell size but it would still be quite different based on sizes and angles. Could you share your model so that I can compare more accurately - and have real data to support my request for new hardware budget.
rikkiflow is offline   Reply With Quote

Old   April 1, 2024, 11:23
Default
  #16
New Member
 
northstrider's Avatar
 
Artem
Join Date: Aug 2023
Location: USA
Posts: 7
Rep Power: 2
northstrider is on a distinguished road
I tested out my 4090 GPU on Fluent and while yes it was about 8x-10x faster than my Ryzen9 7950x 16 core, it couldn't do DPM or Eulerian VoF. Not only that, but when I tested a simple case between CPU and GPU, the GPU gave me a different answer than my CPU. My vendor calls it a "cutting edge" technology, while I call it "barebones" early adopter technology.
northstrider is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
OpenFOAM benchmarks on various hardware eric Hardware 766 Yesterday 16:26
General recommendations for CFD hardware [WIP] flotus1 Hardware 19 February 29, 2024 12:48
AMD Epyc 9004 "Genoa" buyers guide for CFD flotus1 Hardware 8 January 16, 2023 05:23
CPU for Flow3d mik_urb Hardware 4 December 4, 2022 22:06
Parallel speedup scales better than number of CPUs MikeWorth OpenFOAM Running, Solving & CFD 5 August 21, 2020 17:30


All times are GMT -4. The time now is 12:09.