CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Programming & Development

GPU Parallelization in OpenFoam

Register Blogs Community New Posts Updated Threads Search

Like Tree16Likes
  • 5 Post By klausb
  • 11 Post By klausb

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   January 17, 2019, 09:29
Smile GPU Parallelization in OpenFoam
  #1
Senior Member
 
krishna kant
Join Date: Feb 2016
Location: Hyderabad, India
Posts: 133
Rep Power: 10
kk415 is on a distinguished road
Hello Foamers,

I have a solver in our research group in OF 2.3.1 which need to be parallel in GPU. I am new to GPU parallelization, so can anyone guide me from where to start with. I did some google search on this topic and came across RapidCFD(this is good but not an opensource) and GPGPU(this is the linear solver in GPU, but I need to make the solver run fully in GPU).
kk415 is offline   Reply With Quote

Old   January 17, 2019, 21:45
Default
  #2
Senior Member
 
Andrew Somorjai
Join Date: May 2013
Posts: 175
Rep Power: 12
massive_turbulence is on a distinguished road
Quote:
Originally Posted by kk415 View Post
Hello Foamers,

I have a solver in our research group in OF 2.3.1 which need to be parallel in GPU. I am new to GPU parallelization, so can anyone guide me from where to start with. I did some google search on this topic and came across RapidCFD(this is good but not an opensource) and GPGPU(this is the linear solver in GPU, but I need to make the solver run fully in GPU).
AFAIK, OpenFOAM uses openmpi and that's cpu only, you'd have to port openfoam yourself to use gpu calculations. With CPU and ram you have more memory anyways for bigger meshes. GPU memory is expensive.
massive_turbulence is offline   Reply With Quote

Old   February 6, 2019, 19:41
Default
  #3
Senior Member
 
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22
klausb will become famous soon enough
Hello Krishna,

there are a number of technical aspects which need clarification because the solution could be pretty simple or incredibly difficult depending on what exactly you want to do.

Do you want to run the entire cfd solver (with solver I mean here something like simpleFoam or pimpleFoam) on a GPU or just the matrix solvers like PCG or BiCGStab? Solution for ONE Nvidia or ONE AMD GPU = PARALLUTION library + OpenFoam plugin

Then, the foam-extend project implements the CUFFLINK library which supports multi GPU computations. This could possibly be ported to other OpenFoam versions. In this case, the matrix computations would be moved to Nvidia GPUs (PCG, BiCGStab...). For multi GPU computations on one node, it might be easier to implement the MAGMA library which has options to span/compute a matrix across multiple GPUs. If you want to use AMD GPUs + CUFFLINK you would have to port CUFFLINK using the hipify tool.

There are many more scenarios based on your requirements/plans and the specific hardware you want to use i.e. Nvidia or AMD GPUs, which GPU generation with how much GPU memory (often the showstopper) well, depending on your programming skills, why not going for hybrid computations using the CPU and GPU cores?


Tell us a little more about what exactly you want to compute on which GPU models?

Klaus
klausb is offline   Reply With Quote

Old   February 10, 2019, 23:49
Default
  #4
Senior Member
 
krishna kant
Join Date: Feb 2016
Location: Hyderabad, India
Posts: 133
Rep Power: 10
kk415 is on a distinguished road
Hello Klaus,

Sorry for the late reply.

I would like to port the entire solver (icoFoam) on the Nvidia Tesla GPU K20 and K80.
I am aware and have used the linear solver libraries i.e. cufflink,
But i also want to get rid of the time elapsed while transferring the data to the GPU. So I would like to port the whole solver onto the GPU using CUDA programming model.
kk415 is offline   Reply With Quote

Old   February 11, 2019, 03:57
Default
  #5
Member
 
W.T
Join Date: Oct 2012
Posts: 35
Rep Power: 13
dybuk is on a distinguished road
Quote:
Originally Posted by kk415 View Post
Hello Foamers,
. I did some google search on this topic and came across RapidCFD(this is good but not an opensource)
What do you mean by "its not open source" ?
In a Rapid-CFD github repo there is a COPYING file which contains GNU GPL license. Also on sim-flow (developers of rapidcfd) webiste (https://sim-flow.com/rapid-cfd-gpu/) i' ve found :
Quote:
RapidCFD is distributed by SIMFLOW Technologies and is freely available and open source, licensed under the GNU General Public Licence. This offering is not approved or endorsed by ESI-Group, the producer of the OpenFOAM® software and owner of the OPENFOAM® trade mark.
dybuk is offline   Reply With Quote

Old   February 11, 2019, 05:15
Default
  #6
Senior Member
 
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22
klausb will become famous soon enough
Hello Krishna,

the COPYING file in https://github.com/Atizar/RapidCFD-dev talks about GPL 3.0. Does this version include the 100.000 nodes limit of the free version or is this the free version?

Here's an explanation how icoFoam works: https://openfoamwiki.net/index.php/IcoFoam

An alternative approach could be to use MAGMA functionality to implement the above as there are MAGMA solvers and BLAS routines working with multiple GPUs. This means, you would convert the LDU matrix to CSR format as a starting point. I think MAGMA would relieve you from dealing with multiple GPUs, MPI, domain decomposition and handle that for you.

The paper "Complete PISO and SIMPLE solvers on Graphics Processing Units" might give you some hints too.

Let me know how you plan to approach it and I'll have another look into my archive?

Klaus
klausb is offline   Reply With Quote

Old   February 11, 2019, 05:26
Default
  #7
Member
 
W.T
Join Date: Oct 2012
Posts: 35
Rep Power: 13
dybuk is on a distinguished road
To "clear things out" there are two products develpoed by simflow:
*simFlow - an OpenFOAM GUI, it has two licensing options FREE(up to 100 000 nodes, up to 2 or 4 cores) and PRO(unlimited mesh size etc)
*rapidCFD - gpu port of openfoam 2.3 - opensource (GNU GPL) and doesn't have any limits regarding mesh size etc.
dybuk is offline   Reply With Quote

Old   June 4, 2019, 12:02
Default
  #8
Senior Member
 
Agustín Villa
Join Date: Apr 2013
Location: Alcorcón
Posts: 313
Rep Power: 15
agustinvo is on a distinguished road
Hi


do you have any news about this topic? I woudl like to do some test in a graphic card I have. I was thinking to use Rapid-CFD and compare it with the standard OpenFOAM.
agustinvo is offline   Reply With Quote

Old   August 25, 2020, 16:20
Default
  #9
Super Moderator
 
Tobi's Avatar
 
Tobias Holzmann
Join Date: Oct 2010
Location: Tussenhausen
Posts: 2,708
Blog Entries: 6
Rep Power: 51
Tobi has a spectacular aura aboutTobi has a spectacular aura aboutTobi has a spectacular aura about
Send a message via ICQ to Tobi Send a message via Skype™ to Tobi
Hi Klaus, just one question as I came across your post. Did you already used openfoam with calculation the matrices on the GPU? I would be highly interested in that.

E. G. I do have a 32 core machine and I am going to buy the Nvidia 30xx GPU card soon. Hence, I would be interested in using the GPU cors for the matrix operations in the foundations version.

If I got you correct, I would need to chech the Cufflink library of the extend project and port it to the foundation, right? However I cannot believe that it is so simple without any bottleneck but I am not an expert in that topic.
__________________
Keep foaming,
Tobias Holzmann

Last edited by Tobi; August 26, 2020 at 04:26.
Tobi is offline   Reply With Quote

Old   August 26, 2020, 05:20
Default
  #10
Senior Member
 
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22
klausb will become famous soon enough
Hi Tobi,

let me start with some general background information. There's no need to port the cufflink library, it was one of the first implementations many years ago, hence it's well known but not the latest approach. Implementations work with both mainstream OpenFOAM version OF7 and OF1812, the ones I last worked with. You're also NOT limited to Nvidia GPUs, AMD GPUs are fine as well! The upcoming founders edition from Nvidia is probably not a good choice as it's unlikely that it will support fp64 / double precision computations. Think of the "AMD Radeon Pro VII" or two of them.

Many people who worked on GPU implementations did it commercially so they didn't share implementation details. With GPU computations becoming mainstream, this situation has changed.

There are three concepts for an implementation:

1: Solve the linear system on the GPU ("transfer" the matrix solvers PCG, BiCGStab... to the GPU - in practice, use GPU solver libraries)
2: Use the GPU as an additional "big" core (create hybrid CPU/GPU solvers matrix solvers as suggested by Amani AlOnazi in her thesis 2013, again PCG...)
3: Move the entire solver (simpleFOAM, pisoFOAM...) to the GPU as it's done in RapidCFD to avoid extensive, slow data transfers between CPU and GPU

Pros and Cons:

1: "easier to implement" but often limited by GPU memory size and data transfer speed, PCIE 4.0 should help with performance as data transfers should be less of a bottleneck
2: Probably best for engineering workstations with many CPU cores using the GPU as a "turbo booster", needs a hybrid CPU/GPU solver implementation from scratch
3: Removes most of the data transfer bottleneck as updates are done on the GPU but probably not worth the effort with PCIE 5.0, 6.0, data CPU/GPU coherency on the horizon

Concept 1 is probably not worth the effort for you as you would simply transfer computations from the 32 core CPU to a GPU and the 32 CPU would be unused. If you invest in a dual socket motherboard and another CPU you might end up with a cheaper, faster solution.

Implementation:

Concept 1: Convert the matrix to CSR format (with global indices!), solve the linear system with the solvers provided by one of the following LA libraries: MAGMA or hipHMAGMA (for AMD), rocalution (for AMD which can also be compiled for Nvidia as it's coded in HIP) or AMGCL or GASPI's Linear Algebra solvers, maybe also AmgX. The OpenMP backends for CPU computations are usually a lot slower than using MPI, GPU computations are "automatic", as that's what the libraries are designed for, most of your CPU cores will be wasted, usually 1 CPU core supports 1 GPU, that's why GPU servers in data centers have low core count CPUs. Engineering workstations are different.

Concept 2: I have started looking deeper into that but need to upgrade my hardware to move forward. The idea is to use the GPU as a big core or "turbo booster". The above mentioned LA libraries don't support hybrid CPU/GPU backend usage. Load balancing will be a key for good performance (a performance factor comparing the CPU core performance with the GPU performance could be a simple solution which could be applied to an uneven matrix decomposition), one core per GPU will be needed to "support" the GPU, the coding challenge is how to link/integrate/interface the "big" core = (supporting CPU core)<->GPU computations with the OpenFOAM MPI communication of the remaining CPU cores.

Concept 3: Use RapidCFD, I see a high risk, that development work will be a wast of time with faster PCIE and CPU/GPU data coherency on the horizon

Klaus
klausb is offline   Reply With Quote

Old   August 26, 2020, 08:50
Default
  #11
Super Moderator
 
Tobi's Avatar
 
Tobias Holzmann
Join Date: Oct 2010
Location: Tussenhausen
Posts: 2,708
Blog Entries: 6
Rep Power: 51
Tobi has a spectacular aura aboutTobi has a spectacular aura aboutTobi has a spectacular aura about
Send a message via ICQ to Tobi Send a message via Skype™ to Tobi
Hi Klaus,

thank you for your comprehensive reply regarding GPU implementation and OpenFOAM (or CFD in general, or CPU/GPU in general).

What is your conclusion regarding solving the momentum and pressure as a coupled system on the GPU rather than CPU. Shouldn´t it be much faster? Okay, I am not talking about test cases with 500.000 cells. More to the direction of > 20.000.000 cells. Out of the box, without any experience, I would expect that solving the linear system on GPU should be faster as the architecture of the GPU is already for parallel calculations compared to CPU´s.


However, you are right, the 32 core AMD TR is a good one. Probably one has more benefit using either two 32 core CPU´s on one motherboard or a 64 core guy out of the box as there is no need for new programming.


I will think about that but I will probably buy the founders 3090 card as I am also doing renderings and X-Ray-Tracing stuff.

Tobi
__________________
Keep foaming,
Tobias Holzmann
Tobi is offline   Reply With Quote

Old   August 26, 2020, 10:37
Default
  #12
Senior Member
 
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22
klausb will become famous soon enough
Hi Tobi,

I don't have any benchmarks with recent hardware and PCIE 4.0 comparing a recent 16 or better 32 core CPU with a recent 4.0 GPU like the AMD Radeon Pro VII. Real benefits for workstations should come from concept 2, concept 1 will be beneficial for cases of a GPU hardware specific size that fit into GPU memory and were the memory size is in tune with the PCIE speed. Asynchronous data transfer and other features that come with the latest LA libraries are also beneficial. When PCIE 2.0 was standard speedups of 1,2x-1,8x compared to the old 6 core XEONS or two of them, in rare cases up to 3,5x were realistic... claims of speedups of 10x...40x were comparisons between 4 core gaming processors with only 2 memory channels and a then professional GPU like the Nvidia K80 or neglected the slow PCIE data transfer comparing CPU hardware performance to GPU hardware performance but not the time needed to solve a CFD simulation.

GPU computation should improve stability as reductions on the GPU produce smaller errors. There are many forms of hybridisation of CPU/GPU computations that could be beneficial but the cost/benefit calculation for an expensive GPU in an OpenFOAM workstation is difficult. Nowadays you can buy a lot of CPU performance for the price of a GPU and the CPU should be more versatile for a wider range of cases.

If you want to leverage the 3080, check, whether your cases are suitable for mixed precision fp64/fp32 computations but that's a different topic.

Klaus
klausb is offline   Reply With Quote

Old   September 14, 2020, 09:08
Default
  #13
Senior Member
 
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22
klausb will become famous soon enough
@Tobi,

to use OpenFOAM with a new Nvidia card and your 32 core processor, you could try the following:

Install "PETSc4FOAM" (a library to plug-in PETSc into the OpenFOAM framework) and extend it with the "AmgXWrapper" for PETSc.

A nice feature of the "AmgXWrapper" is:

"... when the number of MPI processes is greater than the number of GPU devices, this wrapper will do the system consolidation/data scattering/data gathering automatically. ..."


See also "Integrating OpenFOAM and GPUs using amgX" by Rathnayake, T where you can find details about the AMGX setup which might be helpful.

It would be great if you could provide a benchmark.

Klaus
klausb is offline   Reply With Quote

Old   December 24, 2020, 12:48
Default
  #14
New Member
 
kenan
Join Date: Sep 2012
Posts: 9
Rep Power: 13
kcengiz is on a distinguished road
Hi Klaus,
I have been experimenting with cudasolvers of foam-extend-4.0 (where linear systems are solved on GPU only). Naturally, i imagined 1 GPU needs one core, and i setup everything like that. As I observed that the GPU memory was far from being full, and the rest of CPU cores were wasted, I wanted to try more MPI tasks per GPU. It worked like that surprisingly, and the resulting performance seemed to be better. Is this approach something advised against? I don't know anything about the essence of CPU-GPU memory copying, but I wonder if there is anything like mpi-tasks-are-competing-for-the-CPU-GPU-bandwidth-use thing.



Note: this approach was not still good enough for me because i could not manage to use the two GPU cards on the machine. All the mpi tasks happened to simutaneously use the first card only.







Quote:
Originally Posted by klausb View Post
Hi Tobi,

let me start with some general background information. There's no need to port the cufflink library, it was one of the first implementations many years ago, hence it's well known but not the latest approach. Implementations work with both mainstream OpenFOAM version OF7 and OF1812, the ones I last worked with. You're also NOT limited to Nvidia GPUs, AMD GPUs are fine as well! The upcoming founders edition from Nvidia is probably not a good choice as it's unlikely that it will support fp64 / double precision computations. Think of the "AMD Radeon Pro VII" or two of them.

Many people who worked on GPU implementations did it commercially so they didn't share implementation details. With GPU computations becoming mainstream, this situation has changed.

There are three concepts for an implementation:

1: Solve the linear system on the GPU ("transfer" the matrix solvers PCG, BiCGStab... to the GPU - in practice, use GPU solver libraries)
2: Use the GPU as an additional "big" core (create hybrid CPU/GPU solvers matrix solvers as suggested by Amani AlOnazi in her thesis 2013, again PCG...)
3: Move the entire solver (simpleFOAM, pisoFOAM...) to the GPU as it's done in RapidCFD to avoid extensive, slow data transfers between CPU and GPU

Pros and Cons:

1: "easier to implement" but often limited by GPU memory size and data transfer speed, PCIE 4.0 should help with performance as data transfers should be less of a bottleneck
2: Probably best for engineering workstations with many CPU cores using the GPU as a "turbo booster", needs a hybrid CPU/GPU solver implementation from scratch
3: Removes most of the data transfer bottleneck as updates are done on the GPU but probably not worth the effort with PCIE 5.0, 6.0, data CPU/GPU coherency on the horizon

Concept 1 is probably not worth the effort for you as you would simply transfer computations from the 32 core CPU to a GPU and the 32 CPU would be unused. If you invest in a dual socket motherboard and another CPU you might end up with a cheaper, faster solution.

Implementation:

Concept 1: Convert the matrix to CSR format (with global indices!), solve the linear system with the solvers provided by one of the following LA libraries: MAGMA or hipHMAGMA (for AMD), rocalution (for AMD which can also be compiled for Nvidia as it's coded in HIP) or AMGCL or GASPI's Linear Algebra solvers, maybe also AmgX. The OpenMP backends for CPU computations are usually a lot slower than using MPI, GPU computations are "automatic", as that's what the libraries are designed for, most of your CPU cores will be wasted, usually 1 CPU core supports 1 GPU, that's why GPU servers in data centers have low core count CPUs. Engineering workstations are different.

Concept 2: I have started looking deeper into that but need to upgrade my hardware to move forward. The idea is to use the GPU as a big core or "turbo booster". The above mentioned LA libraries don't support hybrid CPU/GPU backend usage. Load balancing will be a key for good performance (a performance factor comparing the CPU core performance with the GPU performance could be a simple solution which could be applied to an uneven matrix decomposition), one core per GPU will be needed to "support" the GPU, the coding challenge is how to link/integrate/interface the "big" core = (supporting CPU core)<->GPU computations with the OpenFOAM MPI communication of the remaining CPU cores.

Concept 3: Use RapidCFD, I see a high risk, that development work will be a wast of time with faster PCIE and CPU/GPU data coherency on the horizon

Klaus
kcengiz is offline   Reply With Quote

Old   December 26, 2020, 04:47
Default
  #15
Senior Member
 
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22
klausb will become famous soon enough
Hello,

I think Foam-extend still uses the CUFFLINK library.

Check here, how to enable multi GPU mode: https://code.google.com/archive/p/cu...artedPage.wiki

Sections "Multi-GPU Implementation" and "Possible changes in compute mode".

If you find the time, maybe you could run and share a benchmark of a larger tutorial case like the motorBike tutorial on a current CPU vs your GPU. It would especially be interesting if you have a system supporting PCIE 4.0.

Klaus
klausb is offline   Reply With Quote

Old   December 5, 2022, 11:46
Default Integrating OpenFOAM and GPUs using amgX" by Rathnayake, T
  #16
Dcn
New Member
 
Join Date: Aug 2022
Posts: 16
Rep Power: 3
Dcn is on a distinguished road
Quote:
Originally Posted by klausb View Post
@Tobi,

to use OpenFOAM with a new Nvidia card and your 32 core processor, you could try the following:

Install "PETSc4FOAM" (a library to plug-in PETSc into the OpenFOAM framework) and extend it with the "AmgXWrapper" for PETSc.

A nice feature of the "AmgXWrapper" is:

"... when the number of MPI processes is greater than the number of GPU devices, this wrapper will do the system consolidation/data scattering/data gathering automatically. ..."


See also "Integrating OpenFOAM and GPUs using amgX" by Rathnayake, T where you can find details about the AMGX setup which might be helpful.

It would be great if you could provide a benchmark.

Klaus
Hi Klaus can you kindly tell me the procedure to get the "Integrating OpenFOAM and GPUs using amgX" by Rathnayake, T
Thanks
Dcn is offline   Reply With Quote

Old   January 16, 2023, 14:08
Default Integrating OpenFOAM and GPUs using amgX
  #17
Senior Member
 
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22
klausb will become famous soon enough
See:

https://github.com/barbagroup/AmgXWrapper

AmgXWrapper library details from the Author:

https://on-demand.gputechconf.com/gt...-cfd-codes.pdf
klausb is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
GPU acceleration in Ansys Fluent flotus1 Hardware 63 May 12, 2023 02:48
OpenFOAM course for beginners Jibran OpenFOAM Announcements from Other Sources 2 November 4, 2019 08:51
OpenFOAM Parallelization MPI Cluster Problem arslan.ali OpenFOAM Running, Solving & CFD 4 September 23, 2018 13:50
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days joegi.geo OpenFOAM Announcements from Other Sources 0 October 1, 2016 19:20
OpenFOAM Training Jan-Apr 2017, Virtual, London, Houston, Berlin cfd.direct OpenFOAM Announcements from Other Sources 0 September 21, 2016 11:50


All times are GMT -4. The time now is 12:43.