CFD Online Discussion Forums - OpenFOAM on AMD GPUs. Container from Infinity Hub: user experiences and performance

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- OpenFOAM Verification & Validation (https://www.cfd-online.com/Forums/openfoam-verification-validation/)

- - OpenFOAM on AMD GPUs. Container from Infinity Hub: user experiences and performance (https://www.cfd-online.com/Forums/openfoam-verification-validation/248079-openfoam-amd-gpus-container-infinity-hub-user-experiences-performance.html)

OpenFOAM on AMD GPUs. Container from Infinity Hub: user experiences and performance

AMD recently provided an OpenFOAM container capable of running on AMD GPUs.

It is in their Infinity Hub:
https://www.amd.com/en/technologies/...y-hub/openfoam

And my questions are:

-How have been the experiences of the community using this OpenFOAM container on AMD GPUs?
-Are you reaching cool performance improvements vs just CPU solvers?

Thanks a lot,
Alexis

(PS. I will start using it and post my experiences too)

Hi,

Were you able to launch any simulation using the GPU version? Is it 100% GPU or only the pressure solver is solved in the GPU?

Do you know if it could be compatible que Nvidia GPU to test it?

Best Regards
Marcelino

OpenFOAM on AMD GPUs. Container from Infinity Hub: Experiences with Radeon VII

Thought I'd share my experiences with this! :D

My findings, unfortunately with my setup, have been that it remains much faster to solve on CPU than GPU. :(

I used the HPC_Motorbike example and code provided by AMD in the docker container (not available on the link any longer btw) as-is on my Radeon VII. For the CPU examples, I modified the run to suit a typical CPU-based set of solvers using the standard tutorial fvSolution files.

Results as follows. Times shown are SimpleFoam total ClockTime to 20 iterations; and time per iteration, excluding the first time step:

GPU: 473 seconds; 20.8s per iteration
CPU, with 'GPU-aligned' solvers: 343 seconds ; 16.7s per iteration
CPU, with 'normal' solvers: 205 seconds; 9.9s per iteration

Velocity and pressure solvers for each as follows:

GPU: PETSc-bcgs & PETSc-cg
CPU, with 'GPU-aligned' solvers: DILUPBiCGStab & DICPCG
CPU, with 'normal' solvers, per tutorial: smoothSolver & GAMG

The GPU appears seldom used, with sporadic spikes in utilisation and barely exceeding 40% of the GPU pipe. Most of the time within iterations cycle being seems to be spent doing not much (I/O maybe?:confused:). Unsurprisingly the 1st iteration is much longer as the model is read into vram, which you can see quite easily, but subsequent iterations are also slower than similar solvers on CPU. I have included a time per iteration from iterations 2-20 to illustrate the per-iteration slowdown to account for this.

I get that GPUs are made for large models but I am already nearly reaching the 16GB of vram even in this model (5,223,573 cells). I can't run the Medium sized model (~9M cells I think) because I run out of vram :( I'm running this on my desktop PC for funzies because I don't even want to know how much faster this will be on my usual solving machine (48 core xeon).

So, in summary, based on my experiences with a Radeon VII and the Small HPC_motorbike case:

GPU is half as fast as CPU when using CPU-native solvers
GPU is 20% slower vs CPU when using less-efficient 'GPU-aligned' solvers

Next step I think is to find more GPUs to test the scaling of larger models (love an excuse to keep scouring ebay for deals hehe)

Cheers,
Tom

Quote:

Originally Posted by be_inspired (Post 864420)

100% GPU as far as I'm aware. All solvers are petsc.

The initial run script appears to be flexible to support CUDA devices too. I've not dug any deeper and don't have a suitable GPU to test with further, sorry.

Code:

Available Options: HIP or CUDA

Only HIP is mentioned in the fvSolution file though, so I'd guess that the petsc solver has been tuned for AMD.

Thanks for your input. Much appreciated.

1/ Can you confirm that the bulk of the CPU time goes into the pressure-solve (independent of CPU vs. GPU)?

2/ How do you precondition PETSc-CG for the pressure solve?

3/ Are you willing to walk an extra mile and compare two flavours of PETSc-CG.

Flavour-1: using AMG to precondition PETSc-CG allowing AMG to do a set-up at each linear system solve.

Flavour-2: using AMG to precondition PETSc-CG (so far identical to Flavour-1), this time freezing the hierarchy that AMG construct.

Quote:

Originally Posted by dlahaye (Post 868431)

1) I don't have a specific clocktime breakdown, but it would appear so, yes.
2) PETSC-CG is preconditioned using GAMG:

Code:

p

    {

        solver          petsc;

        petsc

        {               

            options

            {

                ksp_type  cg;

                ksp_cg_single_reduction  true;

                ksp_norm_type none;

                mat_type    mpiaijhipsparse; //HIPSPARSE

                vec_type    hip;



                //preconditioner 

                pc_type gamg;

                pc_gamg_type "agg"; // smoothed aggregation                                                                            

                pc_gamg_agg_nsmooths "1"; // number of smooths for smoothed aggregation (not smoother iterations)                      

                pc_gamg_coarse_eq_limit "100";

                pc_gamg_reuse_interpolation true;

                pc_gamg_aggressive_coarsening "2"; //square the graph on the finest N levels

                pc_gamg_threshold "-1"; // increase to 0.05 if coarse grids get larger                                                 

                pc_gamg_threshold_scale "0.5"; // thresholding on coarse grids

                pc_gamg_use_sa_esteig true;



                // mg_level config

                mg_levels_ksp_max_it "1"; // use 2 or 4 if problem is hard (i.e stretched grids)

                mg_levels_esteig_ksp_type cg; //max_it "1"; // use 2 or 4 if problem is hard (i.e stretched grids)                     



                // coarse solve (indefinite PC in parallel with 2 cores)                                                               

                mg_coarse_ksp_type "gmres";

                mg_coarse_ksp_max_it "2";

        

                // smoother (cheby)                                                                                                    

                mg_levels_ksp_type chebyshev;

                mg_levels_ksp_chebyshev_esteig "0,0.05,0,1.1";

                mg_levels_pc_type "jacobi";

                

            }



            caching

            {

                matrix

                {

                    update always;

                }



                preconditioner

                {

                    //update always;     

                    update periodic;



                    periodicCoeffs

                    {

                        frequency  40;

                    }

                }

            }

        }

        tolerance       1e-07;

        relTol          0.1;

    }

3/ Sure, happy to. I'll need some guidance on how to set those flavours up.

Thanks again.

It appears that by setting

Code:

periodicCoeffs

    {

       frequency  40;

    }

you already have a blend between Flavour-1 (frequency 1) and Flavour-2 (frequency infinity). My question has thus been answered.

To obtain statistics on OpenFoam native GAMG coarsening, insert in system/controlDict

I have two follow-up questions if you allow.

1/ How does runtime of PETSc-GAMG compare with OpenFoam-native-GAMG (the latter used as a preconditioner to be fair)?

2/ Do you see statistics of PETSc-GAMG coarsening printed somewhere? It would be interesting to compare these statistics (in particular the geometric and algebraic complexities) with the statistics of OpenFoam-native-GAMG. The latter can be easily obtained by inserting debug switches in system/controlDict;

Quote:

// see /opt/OpenFOAM/OpenFOAM-v1906/etc/controlDict for a complete list of DebugSwitches
DebugSwitches
{
GAMG 2;
GAMGAgglomeration 0;
GAMGInterface 0;
GAMGInterfaceField 0;
GaussSeidel 0;
fvScalarMatrix 0;
lduMatrix 0;
lduMesh 0;
}