CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   OpenFOAM Verification & Validation (https://www.cfd-online.com/Forums/openfoam-verification-validation/)
-   -   OpenFOAM on AMD GPUs. Container from Infinity Hub: user experiences and performance (https://www.cfd-online.com/Forums/openfoam-verification-validation/248079-openfoam-amd-gpus-container-infinity-hub-user-experiences-performance.html)

alexisespinosa February 23, 2023 03:28

OpenFOAM on AMD GPUs. Container from Infinity Hub: user experiences and performance
 
AMD recently provided an OpenFOAM container capable of running on AMD GPUs.


It is in their Infinity Hub:
https://www.amd.com/en/technologies/...y-hub/openfoam

And my questions are:

-How have been the experiences of the community using this OpenFOAM container on AMD GPUs?
-Are you reaching cool performance improvements vs just CPU solvers?

Thanks a lot,
Alexis

(PS. I will start using it and post my experiences too)

be_inspired February 9, 2024 03:51

Hi,

Were you able to launch any simulation using the GPU version? Is it 100% GPU or only the pressure solver is solved in the GPU?

Do you know if it could be compatible que Nvidia GPU to test it?

Best Regards
Marcelino

mesh-monkey April 28, 2024 09:15

OpenFOAM on AMD GPUs. Container from Infinity Hub: Experiences with Radeon VII
 
Thought I'd share my experiences with this! :D

My findings, unfortunately with my setup, have been that it remains much faster to solve on CPU than GPU. :(

I used the HPC_Motorbike example and code provided by AMD in the docker container (not available on the link any longer btw) as-is on my Radeon VII. For the CPU examples, I modified the run to suit a typical CPU-based set of solvers using the standard tutorial fvSolution files.


Results as follows. Times shown are SimpleFoam total ClockTime to 20 iterations; and time per iteration, excluding the first time step:
  • GPU: 473 seconds; 20.8s per iteration
  • CPU, with 'GPU-aligned' solvers: 343 seconds ; 16.7s per iteration
  • CPU, with 'normal' solvers: 205 seconds; 9.9s per iteration
Velocity and pressure solvers for each as follows:
  • GPU: PETSc-bcgs & PETSc-cg
  • CPU, with 'GPU-aligned' solvers: DILUPBiCGStab & DICPCG
  • CPU, with 'normal' solvers, per tutorial: smoothSolver & GAMG
The GPU appears seldom used, with sporadic spikes in utilisation and barely exceeding 40% of the GPU pipe. Most of the time within iterations cycle being seems to be spent doing not much (I/O maybe?:confused:). Unsurprisingly the 1st iteration is much longer as the model is read into vram, which you can see quite easily, but subsequent iterations are also slower than similar solvers on CPU. I have included a time per iteration from iterations 2-20 to illustrate the per-iteration slowdown to account for this.


I get that GPUs are made for large models but I am already nearly reaching the 16GB of vram even in this model (5,223,573 cells). I can't run the Medium sized model (~9M cells I think) because I run out of vram :( I'm running this on my desktop PC for funzies because I don't even want to know how much faster this will be on my usual solving machine (48 core xeon).



So, in summary, based on my experiences with a Radeon VII and the Small HPC_motorbike case:

  • GPU is half as fast as CPU when using CPU-native solvers
  • GPU is 20% slower vs CPU when using less-efficient 'GPU-aligned' solvers
Next step I think is to find more GPUs to test the scaling of larger models (love an excuse to keep scouring ebay for deals hehe)

Cheers,
Tom

mesh-monkey April 28, 2024 09:23

Quote:

Originally Posted by be_inspired (Post 864420)
Hi,

Were you able to launch any simulation using the GPU version? Is it 100% GPU or only the pressure solver is solved in the GPU?

Do you know if it could be compatible que Nvidia GPU to test it?

Best Regards
Marcelino

100% GPU as far as I'm aware. All solvers are petsc.

The initial run script appears to be flexible to support CUDA devices too. I've not dug any deeper and don't have a suitable GPU to test with further, sorry.

Code:

Available Options: HIP or CUDA
Only HIP is mentioned in the fvSolution file though, so I'd guess that the petsc solver has been tuned for AMD.

dlahaye April 28, 2024 13:51

Thanks for your input. Much appreciated.

1/ Can you confirm that the bulk of the CPU time goes into the pressure-solve (independent of CPU vs. GPU)?

2/ How do you precondition PETSc-CG for the pressure solve?

3/ Are you willing to walk an extra mile and compare two flavours of PETSc-CG.

Flavour-1: using AMG to precondition PETSc-CG allowing AMG to do a set-up at each linear system solve.

Flavour-2: using AMG to precondition PETSc-CG (so far identical to Flavour-1), this time freezing the hierarchy that AMG construct.

mesh-monkey April 29, 2024 04:42

Quote:

Originally Posted by dlahaye (Post 868431)
Thanks for your input. Much appreciated.

1/ Can you confirm that the bulk of the CPU time goes into the pressure-solve (independent of CPU vs. GPU)?

2/ How do you precondition PETSc-CG for the pressure solve?

3/ Are you willing to walk an extra mile and compare two flavours of PETSc-CG.

Flavour-1: using AMG to precondition PETSc-CG allowing AMG to do a set-up at each linear system solve.

Flavour-2: using AMG to precondition PETSc-CG (so far identical to Flavour-1), this time freezing the hierarchy that AMG construct.


1) I don't have a specific clocktime breakdown, but it would appear so, yes.
2) PETSC-CG is preconditioned using GAMG:

Code:

p
    {
        solver          petsc;
        petsc
        {             
            options
            {
                ksp_type  cg;
                ksp_cg_single_reduction  true;
                ksp_norm_type none;
                mat_type    mpiaijhipsparse; //HIPSPARSE
                vec_type    hip;

                //preconditioner
                pc_type gamg;
                pc_gamg_type "agg"; // smoothed aggregation                                                                           
                pc_gamg_agg_nsmooths "1"; // number of smooths for smoothed aggregation (not smoother iterations)                     
                pc_gamg_coarse_eq_limit "100";
                pc_gamg_reuse_interpolation true;
                pc_gamg_aggressive_coarsening "2"; //square the graph on the finest N levels
                pc_gamg_threshold "-1"; // increase to 0.05 if coarse grids get larger                                               
                pc_gamg_threshold_scale "0.5"; // thresholding on coarse grids
                pc_gamg_use_sa_esteig true;

                // mg_level config
                mg_levels_ksp_max_it "1"; // use 2 or 4 if problem is hard (i.e stretched grids)
                mg_levels_esteig_ksp_type cg; //max_it "1"; // use 2 or 4 if problem is hard (i.e stretched grids)                   

                // coarse solve (indefinite PC in parallel with 2 cores)                                                             
                mg_coarse_ksp_type "gmres";
                mg_coarse_ksp_max_it "2";
       
                // smoother (cheby)                                                                                                   
                mg_levels_ksp_type chebyshev;
                mg_levels_ksp_chebyshev_esteig "0,0.05,0,1.1";
                mg_levels_pc_type "jacobi";
               
            }

            caching
            {
                matrix
                {
                    update always;
                }

                preconditioner
                {
                    //update always;   
                    update periodic;

                    periodicCoeffs
                    {
                        frequency  40;
                    }
                }
            }
        }
        tolerance      1e-07;
        relTol          0.1;
    }

3/ Sure, happy to. I'll need some guidance on how to set those flavours up.

dlahaye April 29, 2024 05:16

Thanks again.

It appears that by setting

Code:

periodicCoeffs
    {
      frequency  40;
    }

you already have a blend between Flavour-1 (frequency 1) and Flavour-2 (frequency infinity). My question has thus been answered.

To obtain statistics on OpenFoam native GAMG coarsening, insert in system/controlDict


I have two follow-up questions if you allow.

1/ How does runtime of PETSc-GAMG compare with OpenFoam-native-GAMG (the latter used as a preconditioner to be fair)?

2/ Do you see statistics of PETSc-GAMG coarsening printed somewhere? It would be interesting to compare these statistics (in particular the geometric and algebraic complexities) with the statistics of OpenFoam-native-GAMG. The latter can be easily obtained by inserting debug switches in system/controlDict;

Quote:

// see /opt/OpenFOAM/OpenFOAM-v1906/etc/controlDict for a complete list of DebugSwitches
DebugSwitches
{
GAMG 2;
GAMGAgglomeration 0;
GAMGInterface 0;
GAMGInterfaceField 0;
GaussSeidel 0;
fvScalarMatrix 0;
lduMatrix 0;
lduMesh 0;
}


All times are GMT -4. The time now is 02:21.