|
[Sponsors] |
January 17, 2019, 09:29 |
GPU Parallelization in OpenFoam
|
#1 |
Senior Member
krishna kant
Join Date: Feb 2016
Location: Hyderabad, India
Posts: 133
Rep Power: 10 |
Hello Foamers,
I have a solver in our research group in OF 2.3.1 which need to be parallel in GPU. I am new to GPU parallelization, so can anyone guide me from where to start with. I did some google search on this topic and came across RapidCFD(this is good but not an opensource) and GPGPU(this is the linear solver in GPU, but I need to make the solver run fully in GPU). |
|
January 17, 2019, 21:45 |
|
#2 | |
Senior Member
Andrew Somorjai
Join Date: May 2013
Posts: 175
Rep Power: 12 |
Quote:
|
||
February 6, 2019, 19:41 |
|
#3 |
Senior Member
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22 |
Hello Krishna,
there are a number of technical aspects which need clarification because the solution could be pretty simple or incredibly difficult depending on what exactly you want to do. Do you want to run the entire cfd solver (with solver I mean here something like simpleFoam or pimpleFoam) on a GPU or just the matrix solvers like PCG or BiCGStab? Solution for ONE Nvidia or ONE AMD GPU = PARALLUTION library + OpenFoam plugin Then, the foam-extend project implements the CUFFLINK library which supports multi GPU computations. This could possibly be ported to other OpenFoam versions. In this case, the matrix computations would be moved to Nvidia GPUs (PCG, BiCGStab...). For multi GPU computations on one node, it might be easier to implement the MAGMA library which has options to span/compute a matrix across multiple GPUs. If you want to use AMD GPUs + CUFFLINK you would have to port CUFFLINK using the hipify tool. There are many more scenarios based on your requirements/plans and the specific hardware you want to use i.e. Nvidia or AMD GPUs, which GPU generation with how much GPU memory (often the showstopper) well, depending on your programming skills, why not going for hybrid computations using the CPU and GPU cores? Tell us a little more about what exactly you want to compute on which GPU models? Klaus |
|
February 10, 2019, 23:49 |
|
#4 |
Senior Member
krishna kant
Join Date: Feb 2016
Location: Hyderabad, India
Posts: 133
Rep Power: 10 |
Hello Klaus,
Sorry for the late reply. I would like to port the entire solver (icoFoam) on the Nvidia Tesla GPU K20 and K80. I am aware and have used the linear solver libraries i.e. cufflink, But i also want to get rid of the time elapsed while transferring the data to the GPU. So I would like to port the whole solver onto the GPU using CUDA programming model. |
|
February 11, 2019, 03:57 |
|
#5 | ||
Member
W.T
Join Date: Oct 2012
Posts: 35
Rep Power: 13 |
Quote:
In a Rapid-CFD github repo there is a COPYING file which contains GNU GPL license. Also on sim-flow (developers of rapidcfd) webiste (https://sim-flow.com/rapid-cfd-gpu/) i' ve found : Quote:
|
|||
February 11, 2019, 05:15 |
|
#6 |
Senior Member
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22 |
Hello Krishna,
the COPYING file in https://github.com/Atizar/RapidCFD-dev talks about GPL 3.0. Does this version include the 100.000 nodes limit of the free version or is this the free version? Here's an explanation how icoFoam works: https://openfoamwiki.net/index.php/IcoFoam An alternative approach could be to use MAGMA functionality to implement the above as there are MAGMA solvers and BLAS routines working with multiple GPUs. This means, you would convert the LDU matrix to CSR format as a starting point. I think MAGMA would relieve you from dealing with multiple GPUs, MPI, domain decomposition and handle that for you. The paper "Complete PISO and SIMPLE solvers on Graphics Processing Units" might give you some hints too. Let me know how you plan to approach it and I'll have another look into my archive? Klaus |
|
February 11, 2019, 05:26 |
|
#7 |
Member
W.T
Join Date: Oct 2012
Posts: 35
Rep Power: 13 |
To "clear things out" there are two products develpoed by simflow:
*simFlow - an OpenFOAM GUI, it has two licensing options FREE(up to 100 000 nodes, up to 2 or 4 cores) and PRO(unlimited mesh size etc) *rapidCFD - gpu port of openfoam 2.3 - opensource (GNU GPL) and doesn't have any limits regarding mesh size etc. |
|
June 4, 2019, 12:02 |
|
#8 |
Senior Member
Agustín Villa
Join Date: Apr 2013
Location: Alcorcón
Posts: 313
Rep Power: 15 |
Hi
do you have any news about this topic? I woudl like to do some test in a graphic card I have. I was thinking to use Rapid-CFD and compare it with the standard OpenFOAM. |
|
August 25, 2020, 16:20 |
|
#9 |
Super Moderator
Tobias Holzmann
Join Date: Oct 2010
Location: Tussenhausen
Posts: 2,708
Blog Entries: 6
Rep Power: 51 |
Hi Klaus, just one question as I came across your post. Did you already used openfoam with calculation the matrices on the GPU? I would be highly interested in that.
E. G. I do have a 32 core machine and I am going to buy the Nvidia 30xx GPU card soon. Hence, I would be interested in using the GPU cors for the matrix operations in the foundations version. If I got you correct, I would need to chech the Cufflink library of the extend project and port it to the foundation, right? However I cannot believe that it is so simple without any bottleneck but I am not an expert in that topic.
__________________
Keep foaming, Tobias Holzmann Last edited by Tobi; August 26, 2020 at 04:26. |
|
August 26, 2020, 05:20 |
|
#10 |
Senior Member
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22 |
Hi Tobi,
let me start with some general background information. There's no need to port the cufflink library, it was one of the first implementations many years ago, hence it's well known but not the latest approach. Implementations work with both mainstream OpenFOAM version OF7 and OF1812, the ones I last worked with. You're also NOT limited to Nvidia GPUs, AMD GPUs are fine as well! The upcoming founders edition from Nvidia is probably not a good choice as it's unlikely that it will support fp64 / double precision computations. Think of the "AMD Radeon Pro VII" or two of them. Many people who worked on GPU implementations did it commercially so they didn't share implementation details. With GPU computations becoming mainstream, this situation has changed. There are three concepts for an implementation: 1: Solve the linear system on the GPU ("transfer" the matrix solvers PCG, BiCGStab... to the GPU - in practice, use GPU solver libraries) 2: Use the GPU as an additional "big" core (create hybrid CPU/GPU solvers matrix solvers as suggested by Amani AlOnazi in her thesis 2013, again PCG...) 3: Move the entire solver (simpleFOAM, pisoFOAM...) to the GPU as it's done in RapidCFD to avoid extensive, slow data transfers between CPU and GPU Pros and Cons: 1: "easier to implement" but often limited by GPU memory size and data transfer speed, PCIE 4.0 should help with performance as data transfers should be less of a bottleneck 2: Probably best for engineering workstations with many CPU cores using the GPU as a "turbo booster", needs a hybrid CPU/GPU solver implementation from scratch 3: Removes most of the data transfer bottleneck as updates are done on the GPU but probably not worth the effort with PCIE 5.0, 6.0, data CPU/GPU coherency on the horizon Concept 1 is probably not worth the effort for you as you would simply transfer computations from the 32 core CPU to a GPU and the 32 CPU would be unused. If you invest in a dual socket motherboard and another CPU you might end up with a cheaper, faster solution. Implementation: Concept 1: Convert the matrix to CSR format (with global indices!), solve the linear system with the solvers provided by one of the following LA libraries: MAGMA or hipHMAGMA (for AMD), rocalution (for AMD which can also be compiled for Nvidia as it's coded in HIP) or AMGCL or GASPI's Linear Algebra solvers, maybe also AmgX. The OpenMP backends for CPU computations are usually a lot slower than using MPI, GPU computations are "automatic", as that's what the libraries are designed for, most of your CPU cores will be wasted, usually 1 CPU core supports 1 GPU, that's why GPU servers in data centers have low core count CPUs. Engineering workstations are different. Concept 2: I have started looking deeper into that but need to upgrade my hardware to move forward. The idea is to use the GPU as a big core or "turbo booster". The above mentioned LA libraries don't support hybrid CPU/GPU backend usage. Load balancing will be a key for good performance (a performance factor comparing the CPU core performance with the GPU performance could be a simple solution which could be applied to an uneven matrix decomposition), one core per GPU will be needed to "support" the GPU, the coding challenge is how to link/integrate/interface the "big" core = (supporting CPU core)<->GPU computations with the OpenFOAM MPI communication of the remaining CPU cores. Concept 3: Use RapidCFD, I see a high risk, that development work will be a wast of time with faster PCIE and CPU/GPU data coherency on the horizon Klaus |
|
August 26, 2020, 08:50 |
|
#11 |
Super Moderator
Tobias Holzmann
Join Date: Oct 2010
Location: Tussenhausen
Posts: 2,708
Blog Entries: 6
Rep Power: 51 |
Hi Klaus,
thank you for your comprehensive reply regarding GPU implementation and OpenFOAM (or CFD in general, or CPU/GPU in general). What is your conclusion regarding solving the momentum and pressure as a coupled system on the GPU rather than CPU. Shouldn´t it be much faster? Okay, I am not talking about test cases with 500.000 cells. More to the direction of > 20.000.000 cells. Out of the box, without any experience, I would expect that solving the linear system on GPU should be faster as the architecture of the GPU is already for parallel calculations compared to CPU´s. However, you are right, the 32 core AMD TR is a good one. Probably one has more benefit using either two 32 core CPU´s on one motherboard or a 64 core guy out of the box as there is no need for new programming. I will think about that but I will probably buy the founders 3090 card as I am also doing renderings and X-Ray-Tracing stuff. Tobi
__________________
Keep foaming, Tobias Holzmann |
|
August 26, 2020, 10:37 |
|
#12 |
Senior Member
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22 |
Hi Tobi,
I don't have any benchmarks with recent hardware and PCIE 4.0 comparing a recent 16 or better 32 core CPU with a recent 4.0 GPU like the AMD Radeon Pro VII. Real benefits for workstations should come from concept 2, concept 1 will be beneficial for cases of a GPU hardware specific size that fit into GPU memory and were the memory size is in tune with the PCIE speed. Asynchronous data transfer and other features that come with the latest LA libraries are also beneficial. When PCIE 2.0 was standard speedups of 1,2x-1,8x compared to the old 6 core XEONS or two of them, in rare cases up to 3,5x were realistic... claims of speedups of 10x...40x were comparisons between 4 core gaming processors with only 2 memory channels and a then professional GPU like the Nvidia K80 or neglected the slow PCIE data transfer comparing CPU hardware performance to GPU hardware performance but not the time needed to solve a CFD simulation. GPU computation should improve stability as reductions on the GPU produce smaller errors. There are many forms of hybridisation of CPU/GPU computations that could be beneficial but the cost/benefit calculation for an expensive GPU in an OpenFOAM workstation is difficult. Nowadays you can buy a lot of CPU performance for the price of a GPU and the CPU should be more versatile for a wider range of cases. If you want to leverage the 3080, check, whether your cases are suitable for mixed precision fp64/fp32 computations but that's a different topic. Klaus |
|
September 14, 2020, 09:08 |
|
#13 |
Senior Member
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22 |
@Tobi,
to use OpenFOAM with a new Nvidia card and your 32 core processor, you could try the following: Install "PETSc4FOAM" (a library to plug-in PETSc into the OpenFOAM framework) and extend it with the "AmgXWrapper" for PETSc. A nice feature of the "AmgXWrapper" is: "... when the number of MPI processes is greater than the number of GPU devices, this wrapper will do the system consolidation/data scattering/data gathering automatically. ..." See also "Integrating OpenFOAM and GPUs using amgX" by Rathnayake, T where you can find details about the AMGX setup which might be helpful. It would be great if you could provide a benchmark. Klaus |
|
December 24, 2020, 12:48 |
|
#14 | |
New Member
kenan
Join Date: Sep 2012
Posts: 9
Rep Power: 13 |
Hi Klaus,
I have been experimenting with cudasolvers of foam-extend-4.0 (where linear systems are solved on GPU only). Naturally, i imagined 1 GPU needs one core, and i setup everything like that. As I observed that the GPU memory was far from being full, and the rest of CPU cores were wasted, I wanted to try more MPI tasks per GPU. It worked like that surprisingly, and the resulting performance seemed to be better. Is this approach something advised against? I don't know anything about the essence of CPU-GPU memory copying, but I wonder if there is anything like mpi-tasks-are-competing-for-the-CPU-GPU-bandwidth-use thing. Note: this approach was not still good enough for me because i could not manage to use the two GPU cards on the machine. All the mpi tasks happened to simutaneously use the first card only. Quote:
|
||
December 26, 2020, 04:47 |
|
#15 |
Senior Member
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22 |
Hello,
I think Foam-extend still uses the CUFFLINK library. Check here, how to enable multi GPU mode: https://code.google.com/archive/p/cu...artedPage.wiki Sections "Multi-GPU Implementation" and "Possible changes in compute mode". If you find the time, maybe you could run and share a benchmark of a larger tutorial case like the motorBike tutorial on a current CPU vs your GPU. It would especially be interesting if you have a system supporting PCIE 4.0. Klaus |
|
December 5, 2022, 11:46 |
Integrating OpenFOAM and GPUs using amgX" by Rathnayake, T
|
#16 | |
New Member
Join Date: Aug 2022
Posts: 16
Rep Power: 3 |
Quote:
Thanks |
||
January 16, 2023, 14:08 |
Integrating OpenFOAM and GPUs using amgX
|
#17 |
Senior Member
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22 |
See:
https://github.com/barbagroup/AmgXWrapper AmgXWrapper library details from the Author: https://on-demand.gputechconf.com/gt...-cfd-codes.pdf |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
GPU acceleration in Ansys Fluent | flotus1 | Hardware | 63 | May 12, 2023 02:48 |
OpenFOAM course for beginners | Jibran | OpenFOAM Announcements from Other Sources | 2 | November 4, 2019 08:51 |
OpenFOAM Parallelization MPI Cluster Problem | arslan.ali | OpenFOAM Running, Solving & CFD | 4 | September 23, 2018 13:50 |
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days | joegi.geo | OpenFOAM Announcements from Other Sources | 0 | October 1, 2016 19:20 |
OpenFOAM Training Jan-Apr 2017, Virtual, London, Houston, Berlin | cfd.direct | OpenFOAM Announcements from Other Sources | 0 | September 21, 2016 11:50 |