|
[Sponsors] |
January 27, 2023, 18:52 |
CPUs vs GPUs for CFD?
|
#1 |
New Member
Prince Edward Island
Join Date: May 2021
Posts: 26
Rep Power: 5 |
Given that star ccm+ and fluent now both support GPU acceleration, why don't more users opt for GPU accelerated computing vs CPU-only computation? GPUs offer a ton of speedup compared to CPUs in CFD (from what I heard) and honestly I don't see a significant drawback that would prevent anyone from using them. Is there a reason why they are not mainstream or will they eventually take over from CPUs?
|
|
January 27, 2023, 19:00 |
|
#2 | |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,426
Rep Power: 49 |
The main reason in my opinion: GPU acceleration does not have feature parity with CPU implementations of the same code. For cases where it has everything you need, it can work as advertised. For other cases, you are opening a can of worms.
See also General recommendations for CFD hardware [WIP] Quote:
|
||
January 27, 2023, 20:54 |
|
#3 |
New Member
Prince Edward Island
Join Date: May 2021
Posts: 26
Rep Power: 5 |
It looks like two premier cfd solvers (fluent and ccm+) support both AMD and Nvidia GPUs now, so from what I understand gpus should have an advantage over cpus on those platforms?
|
|
January 30, 2023, 17:07 |
|
#4 |
Member
Matt
Join Date: May 2011
Posts: 44
Rep Power: 15 |
GPUs offer more computational throughout for a given hardware cost, but less memory capacity. So they can be a good fit if you are running lots of smaller simulations and/or have access to a massive GPU cluster.
For instance, I often run simulations requiring over 1TB of RAM. That means I would need over a dozen 80GB A100s (at a cost of $18k+ apiece, over $220k total) to run my simulations on a GPU cluster. Meanwhile, you can build a single 2P EPYC Genoa node with 128 cores and 1.5TB of DDR5 RAM for under $30k. In this comparison, the GPU cluster would have about 24x the memory bandwidth of the CPU node, while costing about 7x as much; so scale favors GPUs. But in the real world, many of us have to run large simulations on smaller/cheaper clusters, so CPU wins out in those scenarios. Not to mention, commercial CFD solvers do their licensing so GPUs and CPUs cost about the same to license for a given computational throughout, so it really comes down to whether you'd rather spend your hardware $ on speed or memory capacity. |
|
January 31, 2023, 03:01 |
|
#5 |
Member
EM
Join Date: Sep 2019
Posts: 59
Rep Power: 7 |
despite the spectacular theoretical gpu gflops, realizing a fair
fraction of these is subject to less known obstacles -- ie above and beyond there being enough parallelizable operation count. one of the most important of them is data structures; ideally, gpu data must be contiguously stored and accessed (ie unit stride), but this not largely possible for 3d data structures and it gets much worse for complicated geometries. and even lesser known limitation (ie unless u have programmed in opencl/cuda) is that the programmer is responsible for managing the data transfers between video buffer and on-chip memory. here u have to keep count of the number of registers being used, local memory fill-up and bank access pattern to avoid having the read/write access serialized and/or your data being shoved off to the video buffer resulting in two orders of magnitude slowdown. it may also happen that some computation cannot be made to fit within the on-chip memory. lesser known still is that some operations are one two orders of magnitudes slower than + or - : /, trigs, logs, %, if() and some others. do not trust manufacturer's benchmarks. They silently get round all these to show those gflops that impress. overall, gpu acceleration is not given especially for general purpose commercial codes. |
|
February 4, 2023, 13:28 |
|
#6 |
New Member
Prince Edward Island
Join Date: May 2021
Posts: 26
Rep Power: 5 |
I did suspect that the listed speedup numbers were not as great as they were shown to be. Do you know of any place where the actual speedup numbers are listed for GPU v CPU?
|
|
February 6, 2023, 05:19 |
|
#7 | |
Senior Member
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,283
Rep Power: 34 |
Quote:
Ansys was involved in development of amgx if i understand correctly. It is possible that ansys have developed inhouse alternative to amgx that supports AMDs too. Starccm definitely as of today not there (They are still on amgx). (my impression is that they will develop in-house amg too based on what i hear). |
||
February 19, 2023, 06:23 |
Performance tests of Fluent Native GPU Solver
|
#8 |
New Member
Dmitry
Join Date: Feb 2013
Posts: 29
Rep Power: 13 |
Here it is some test results to compare performance of CPU+GPU vs CPUs in ANSYS Fluent.
Formulation of the task: simulation of the water flow into the circular pipe. BC: inlet - V=1 [m/s], T=300[K]; outlet - static pressure 0 Pa; sidewall - wall heat flux 1e5 [W/m2]. SIMPLE algorithm. Mesh size: 2.1, 3.7 or 12.4 mln cells. Mesh type: unstructured mesh made by Sweep method with Inflation layers near sidewall. Initial conditions: V=(0,0,1) [m/s], P=0 [Pa], T=300 [K]. Number of iterations: 10. Performance results (Wall clock time for 10 iterations): (3.7 mln cells) task results (at this size of task it is enough of GTX 1660 SUPER vRAM in SINGLE precision, but in DOUBLE precision it isn't): DOUBLE precision solver: * ANSYS Fluent 2023 R1 Native GPU Solver: AMD Ryzen 5900x 12 cores SMT off (4x32 GB DDR4-3000 MT/s ECC Unbuffered, dual channel) + NVIDIA Geforce 1660 Super 6 GB vRAM (vRAM bandwidth computed by ANSYS Fluent: 320 GB/s): ~7500 sec, vRAM usage 5.9 GB. * ANSYS Fluent 2023 R1 CPU Solver: AMD Ryzen 5900x 12 cores: 53.43 sec; Peak RAM usage - 10.24 GB. * ANSYS Fluent 17.0 CPU Solver: 2 servers of dual AMD EPYC 7532, 128 cores total, SMT off; per processor: 8x64 GB DDR4-2933 MT/s ECC Reg, 2R, total 32 memory channels in 4 processors; custom liquid cooling of CPU's: 6.67 sec; Peak Resident RAM usage - 16.6 GB; Peak Virtual RAM usage - 181.2 GB. * ANSYS Fluent 17.0 CPU Solver: 4 servers of dual AMD Opteron 6380, 128 cores total (64 FPU, 128 IPU); per processor: 4x16 GB DDR3-1600 MT/s ECC Reg, total 32 memory channels in 8 processors: 17.01 sec; Peak Resident RAM usage - 17.8 GB; Peak Virtual RAM usage - 180.8 GB. * (Coupled/Pseudo transient method) ANSYS Fluent 17.0 CPU Solver: 2 servers of dual AMD EPYC 7532, 128 cores total, SMT off; per processor: 8x64 GB DDR4-2933 MT/s ECC Reg, 2R, total 32 memory channels in 4 processors; custom liquid cooling of CPU's: 13.3 sec. * (Coupled/Pseudo transient method) ANSYS Fluent 17.0 CPU Solver: 4 servers of dual AMD Opteron 6380, 128 cores total (64 FPU, 128 IPU); per processor: 4x16 GB DDR3-1600 MT/s ECC Reg, total 32 memory channels in 8 processors: 38.2 sec. SINGLE precision solver: * Native GPU Solver: AMD Ryzen 5900x 12 cores + NVIDIA Geforce 1660 Super 6 GB vRAM: 17.0 sec. * Native GPU Solver: AMD Ryzen 5900x 2 cores + NVIDIA Geforce 1660 Super 6 GB vRAM: 7.9 sec, vRAM usage - 4.2 GB, peak RAM usage - 8.2 GB. * CPU Solver: AMD Ryzen 5900x 12 cores: 77.88 sec; Peak RAM usage - 7.53 GB. * ANSYS Fluent 17.0 CPU Solver: 2 servers of dual AMD EPYC 7532, 128 cores total, SMT off; per processor: 8x64 GB DDR4-2933 MT/s ECC Reg, 2R, total 32 memory channels in 4 processors; custom liquid cooling of CPU's: 6.33 sec; Peak Resident RAM usage - 13.4 GB; Peak Virtual RAM usage - 177.6 GB. * ANSYS Fluent 17.0 CPU Solver: 4 servers of dual AMD Opteron 6380, 128 cores total (64 FPU, 128 IPU); per processor: 4x16 GB DDR3-1600 MT/s ECC Reg, total 32 memory channels in 8 processors: 15.12 sec; Peak Resident RAM usage - 14.6 GB; Peak Virtual RAM usage - 177.1 GB. (2.1 mln cells) task results, ANSYS Fluent 2023 R1 (at this size of task it is enough of GTX 1660 SUPER vRAM in SINGLE and DOUBLE precision): * Native GPU Solver, SINGLE precision, AMD Ryzen 5900x 2 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 5.08 sec; Peak vRAM usage by Solver - 2.47 GB; Peak RAM usage by Solver - 5.40 GB. * Native GPU Solver, DOUBLE precision, AMD Ryzen 5900x 2 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 8.17 sec; Peak vRAM usage by Solver - 3.54 GB; Peak RAM usage by Solver - 6.01 GB. * Native GPU Solver, SINGLE precision, AMD Ryzen 5900x 6 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 7.37 sec; Peak vRAM usage by Solver - 3.66 GB; Peak RAM usage by Solver - 7.29 GB. * Native GPU Solver, DOUBLE precision, AMD Ryzen 5900x 6 cores (SMT off) + NVIDIA Geforce 1660 Super 6 GB vRAM: 9.40 sec; Peak vRAM usage by Solver - 4.47 GB; Peak RAM usage by Solver - 8.55 GB. * CPU Solver, SINGLE precision, AMD Ryzen 5900x 6 cores (SMT off): 44.15 sec; Peak RAM usage by Solver - 4.12 GB. * CPU Solver, DOUBLE precision, AMD Ryzen 5900x 6 cores (SMT off): 30.96 sec; Peak RAM usage by Solver - 5.58 GB. * CPU Solver, SINGLE precision, AMD Ryzen 5900x 12 cores (SMT off): 35.67 sec; Peak RAM usage by Solver - 4.81 GB. * CPU Solver, DOUBLE precision, AMD Ryzen 5900x 12 cores (SMT off): 27.39 sec; Peak RAM usage by Solver - 6.22 GB. (12.4 mln cells) task results, ANSYS Fluent 2023 R1 (at this size of task it is not enough of GTX 1660 SUPER vRAM in SINGLE and DOUBLE precision): * Native GPU Solver, SINGLE precision solver: AMD Ryzen 5900x 2 cores + NVIDIA Geforce 1660 Super 6 GB vRAM: 23760 sec, vRAM usage - 5.9 GB. * CPU Solver, SINGLE precision solver: AMD Ryzen 5900x 12 cores: 301 sec, peak RAM usage - 20.1 GB. In case when vRAM is enough for task, performance of the ANSYS Fluent with GPU solver on low-mid range graphic card in gaming-PC is equivalent to dual server cluster that is based on Zen2 EPYC with system price about 45 times higher! In case when the task is requested more vRAM than available for GPU into the system, it was observed 30-40% PCI-E bus loading during computation with extremely low overall performance of simulation. Data transfer via x16 PCI-E 4.0 is almost stopping the simulation. So, new Native GPU solver is aimed, first of all, on modern GPU clusters with multiple NVIDIA H100 80 GB vRAM that are connected via NVLink/NVSwitch [900 GB/s]. In this case performance will be huge. In case of usage Geforce/Quadro GPU's it is possible to solve small tasks with high performance. According to information from NVIDIA.COM, compute performance of NVIDIA Geforce GTX 1660 SUPER 6 GB in FP32 - 5.0 TFLOPS, in FP64 - 0.15 TFLOPS. But in ANSYS the performance difference between single and double precision was less than 1.6x. Last edited by techtuner; March 6, 2023 at 10:47. Reason: New data was added. The text was restructured. |
|
February 21, 2023, 09:43 |
|
#9 |
Senior Member
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,283
Rep Power: 34 |
The numbers are good and thank you for this detail post but i am curious as to what these timings show.
Are they timing of 1 iteration. Or are they timing of converging the same case. Or are they timings for running certain fixed iterations etc. I ask this because i am working on a gpu framework myself now and one of the test i use is 4.1 million cells. So your mesh is similar to mine. Now for me in wildkatze solver single process solve on cpu gives average 15 second for segregated flow model. The gpu version take average of 0.35 second (average of 500 iterations). The gpu i have is rtx 2080 ti. I wanted to relate to your numbers but i couldn't. Ps: I don't use amx or any other lib from outside. Everything i am writing for my solver. |
|
February 21, 2023, 13:44 |
|
#10 | |
Member
Matt
Join Date: May 2011
Posts: 44
Rep Power: 15 |
Quote:
Typically you'd be comparing say A100s or A6000s versus ~24-64 core server CPUs. In those comparisons, each GPU has more memory bandwidth than even a 2P CPU node, albeit much less total memory. GPUs offer WAY more memory bandwidth for a given cost vs. CPUs, but WAY less memory capacity for a given cost. So solver compatibility issues aside, it comes down to where the scale of your cluster/simulations falls on the RAM bandwidth vs. capacity spectrum. Last edited by the_phew; February 22, 2023 at 08:57. |
||
February 25, 2023, 15:00 |
|
#11 | |
New Member
Dmitry
Join Date: Feb 2013
Posts: 29
Rep Power: 13 |
Quote:
|
||
February 25, 2023, 16:07 |
|
#12 | |
New Member
Dmitry
Join Date: Feb 2013
Posts: 29
Rep Power: 13 |
Quote:
I completely agree with you that CFD in modern HPC hardware is mostly RAM bandwidth limited. According to my tests on different CPU's, architecture of ANSYS Fluent and Siemens Simcenter Star-CCM+ is more CPU frequency depended than, for example, ANSYS CFX or OpenFOAM. But, anyway, in Fluent we have strong dependency on RAM bandwidth. Performance comparison of CFD products on GPU's is something new for me. As result of series of numerical simulations on GPU it was clear that we have to use data center-class GPU's to perform simulations due to RAM amount and bandwidth limitation of Geforce/Quadro products. Also double precision performance of Geforce-class GPU's is poor. Memory bandwidth of Nvidia 1660 SUPER [320 GB/s] is equivalent to dual AMD EPYC 7532 server with RAM limited to 2933 MT/s [2x188GB/s~375 GB/s]. For small tasks direct comparison of this GPU and CPU's looks good from position of RAM bandwidth. Currently, CFD codes are using GPU's in very limited way. By this reason GPU simulations are observed like future work. Ability to perform HPC simulation in near future on Fluent Native GPU solver in ANSYS Discovery Live using segregated cluster GPU's looks fantastic. Realization of this idea by ANSYS will move CFD simulations on new level of performance/quality. |
||
February 26, 2023, 08:12 |
|
#13 |
Senior Member
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,283
Rep Power: 34 |
Thank you for the reply. Now i normalise my results so that some comparison could be made.
First i don't have single precision version i write double precision only. With that, the gpu i am using has roughly 3 times as much as cuda cores. My mesh is 4.15 million cell trimmer mesh generated by starccm (the mesher). Since i get as on average 0.3 seconds per iteration my timings in your comparion should be: 10 (iterations) x 0.35 x 3( ratio of cuda cores), that would make it around 11.5 seconds in your reference frame. Now since this segregated model is a simplified one and it needs more things like second order interpolation like what fluent does. Realistically finally it would end up around 15 seconds in your frame. Which is double of your single precision results. Ps: this is gpu native wildkatze results that means everything solved in gpu |
|
August 8, 2023, 09:16 |
|
#14 |
New Member
Саратовская обл&
Join Date: Mar 2020
Posts: 1
Rep Power: 0 |
Hi all!
I have several single-socket and several dual-socket machines and some GPUs. Perhaps off topic, but there is a question: is it possible in fluent to bind each GPU to its own socket on a two-socket machine? Or, in fluent, a two-socket machine cannot be conventionally represented as two separate single-socket machines with GPU?) |
|
March 28, 2024, 08:36 |
|
#15 |
New Member
|
@techtuner, I am interested to test my computer configuration performance to your results. I could use a similar model in cell size but it would still be quite different based on sizes and angles. Could you share your model so that I can compare more accurately - and have real data to support my request for new hardware budget.
|
|
April 1, 2024, 11:23 |
|
#16 |
New Member
Artem
Join Date: Aug 2023
Location: USA
Posts: 8
Rep Power: 3 |
I tested out my 4090 GPU on Fluent and while yes it was about 8x-10x faster than my Ryzen9 7950x 16 core, it couldn't do DPM or Eulerian VoF. Not only that, but when I tested a simple case between CPU and GPU, the GPU gave me a different answer than my CPU. My vendor calls it a "cutting edge" technology, while I call it "barebones" early adopter technology.
|
|
May 12, 2024, 11:15 |
SP vs DP
|
#17 | |
New Member
Join Date: Jun 2018
Posts: 7
Rep Power: 8 |
"
Quote:
are you saying we should ignore using double precision |
||
May 12, 2024, 11:18 |
vram
|
#18 |
New Member
Join Date: Jun 2018
Posts: 7
Rep Power: 8 |
[QUOTE=techtuner;844819]
In case when the task is requested more vRAM than available for GPU into the system, it was observed 30-40% PCI-E bus loading during computation with extremely low overall performance of simulation. Data transfer via x16 PCI-E 4.0 is almost stopping the simulation." so PCI-E 5.0 should allow for smaller vram usage |
|
May 12, 2024, 11:20 |
4090
|
#19 | |
New Member
Join Date: Jun 2018
Posts: 7
Rep Power: 8 |
Quote:
|
||
May 12, 2024, 12:16 |
|
#20 |
Senior Member
Join Date: Jun 2011
Posts: 206
Rep Power: 16 |
[QUOTE I tested out my 4090 GPU on Fluent and while yes it was about 8x-10x faster than my Ryzen9 7950x 16 core, it couldn't do DPM or Eulerian VoF. Not only that, but when I tested a simple case between CPU and GPU, the GPU gave me a different answer than my CPU. My vendor calls it a "cutting edge" technology, while I call it "barebones" early adopter technology./QUOTE]
I observed the same weird behavior (different answers, or solver divergence when using the GPU) with FEA codes for electromagnetic simulations as well. Obviously not all bugs are cleared and "fast" doesn't necessarily mean "accurate" |
|
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
OpenFOAM benchmarks on various hardware | eric | Hardware | 799 | Today 05:15 |
General recommendations for CFD hardware [WIP] | flotus1 | Hardware | 19 | June 23, 2024 18:02 |
AMD Epyc 9004 "Genoa" buyers guide for CFD | flotus1 | Hardware | 8 | January 16, 2023 05:23 |
CPU for Flow3d | mik_urb | Hardware | 4 | December 4, 2022 22:06 |
Parallel speedup scales better than number of CPUs | MikeWorth | OpenFOAM Running, Solving & CFD | 5 | August 21, 2020 17:30 |