how do we unify cpu+gpu development?

aerosayan · March 25, 2022, 02:25

Hello everyone,

Developing separate code bases for CPUs and GPUs might be a limiting factor in future.

OpenHyperflow for example, has separate branches setup for its CPU and CUDA code : https://github.com/sergeas67/OpenHyperFLOW2D

Maintaining such a codebase becomes difficult in the long run, and if possible it would be good to have the same code run on both CPUs and GPUs.

Is this a worthwhile effort?

Another dev asked this same question before : https://stackoverflow.com/questions/9631833/ and the main answer stated that due to the hardware architectural differences, the codes needs to be optimized for the device it runs on.

That's true. The differences will be even more severe as the GPU architecture evolves in the future. CPU architecture has been more or less stable for one or two decades. But GPU architecture is continuously improving and that will result in way more different coding strategies for both.

So I don't know if it's possible to have the same code run on CPUs and GPUs.

Should we just stick with MPI/OpenMP based parallelization for CPUs?

OpenMP can now technically be used for coding GPUs : https://stackoverflow.com/questions/28962655/ but since this new feature is not being widely used today, we still don't know what are the limiting factors.

Currently I can only see one strategy that might be useful : represent the equations in standard BLAS or LAPACK form, and use libraries that are either CPU/GPU accelerated.

For example, we can create separate CPU/GPU accelerated function to add two arrays of floats. Then we would just need to use the appropriate function in our code.

I'm looking for other strategies that might be good, but this is the only strategy which looks feasible to me currently. Since we need parallelization to do large amounts of work, it makes sense to write our code to allow better parallelization of these simple but very widely used mathematical methods.

Some other non BLAS/LAPACK example of such a widely used method would be Gauss-Siedel Elimination, LU decomposition, Cholesky factorization, Fast Fourier Transforms etc....

Is this the correct approach? What can we do better?

Thanks
~sayan

March 25, 2022, 02:25	how do we unify cpu+gpu development?	#1
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	Hello everyone, Developing separate code bases for CPUs and GPUs might be a limiting factor in future. OpenHyperflow for example, has separate branches setup for its CPU and CUDA code : https://github.com/sergeas67/OpenHyperFLOW2D Maintaining such a codebase becomes difficult in the long run, and if possible it would be good to have the same code run on both CPUs and GPUs. Is this a worthwhile effort? Another dev asked this same question before : https://stackoverflow.com/questions/9631833/ and the main answer stated that due to the hardware architectural differences, the codes needs to be optimized for the device it runs on. That's true. The differences will be even more severe as the GPU architecture evolves in the future. CPU architecture has been more or less stable for one or two decades. But GPU architecture is continuously improving and that will result in way more different coding strategies for both. So I don't know if it's possible to have the same code run on CPUs and GPUs. Should we just stick with MPI/OpenMP based parallelization for CPUs? OpenMP can now technically be used for coding GPUs : https://stackoverflow.com/questions/28962655/ but since this new feature is not being widely used today, we still don't know what are the limiting factors. Currently I can only see one strategy that might be useful : represent the equations in standard BLAS or LAPACK form, and use libraries that are either CPU/GPU accelerated. For example, we can create separate CPU/GPU accelerated function to add two arrays of floats. Then we would just need to use the appropriate function in our code. I'm looking for other strategies that might be good, but this is the only strategy which looks feasible to me currently. Since we need parallelization to do large amounts of work, it makes sense to write our code to allow better parallelization of these simple but very widely used mathematical methods. Some other non BLAS/LAPACK example of such a widely used method would be Gauss-Siedel Elimination, LU decomposition, Cholesky factorization, Fast Fourier Transforms etc.... Is this the correct approach? What can we do better? Thanks ~sayan

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
General recommendations for CFD hardware [WIP]	flotus1	Hardware	18	February 29, 2024 12:48
GPU acceleration in Ansys Fluent	flotus1	Hardware	63	May 12, 2023 02:48
[Resolved] GPU on Fluent	Daveo643	FLUENT	4	March 7, 2018 08:02
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 05:36
Star cd es-ice solver error	ernarasimman	STAR-CD	2	September 12, 2014 00:01