CFD Online Discussion Forums - how do we unify cpu+gpu development?

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Main CFD Forum (https://www.cfd-online.com/Forums/main/)

- - how do we unify cpu+gpu development? (https://www.cfd-online.com/Forums/main/241885-how-do-we-unify-cpu-gpu-development.html)

how do we unify cpu+gpu development?

Hello everyone,

Developing separate code bases for CPUs and GPUs might be a limiting factor in future.

OpenHyperflow for example, has separate branches setup for its CPU and CUDA code : https://github.com/sergeas67/OpenHyperFLOW2D

Maintaining such a codebase becomes difficult in the long run, and if possible it would be good to have the same code run on both CPUs and GPUs.

Is this a worthwhile effort?

Another dev asked this same question before : https://stackoverflow.com/questions/9631833/ and the main answer stated that due to the hardware architectural differences, the codes needs to be optimized for the device it runs on.

That's true. The differences will be even more severe as the GPU architecture evolves in the future. CPU architecture has been more or less stable for one or two decades. But GPU architecture is continuously improving and that will result in way more different coding strategies for both.

So I don't know if it's possible to have the same code run on CPUs and GPUs.

Should we just stick with MPI/OpenMP based parallelization for CPUs?

OpenMP can now technically be used for coding GPUs : https://stackoverflow.com/questions/28962655/ but since this new feature is not being widely used today, we still don't know what are the limiting factors.

Currently I can only see one strategy that might be useful : represent the equations in standard BLAS or LAPACK form, and use libraries that are either CPU/GPU accelerated.

For example, we can create separate CPU/GPU accelerated function to add two arrays of floats. Then we would just need to use the appropriate function in our code.

I'm looking for other strategies that might be good, but this is the only strategy which looks feasible to me currently. Since we need parallelization to do large amounts of work, it makes sense to write our code to allow better parallelization of these simple but very widely used mathematical methods.

Some other non BLAS/LAPACK example of such a widely used method would be Gauss-Siedel Elimination, LU decomposition, Cholesky factorization, Fast Fourier Transforms etc....

Is this the correct approach? What can we do better?

Thanks
~sayan

My totally uninformed point of view is that if Fortran was able to bake in distributed parallelism, then the real issue for GPUs is not one of language. Indeed NVIDIA has CUDA-Fortran which, roughly speaking, is not different from an OpenMP-like approach.

The issue, as you recognize it, is about performances on a given hardware. I don't think is really required (or suggested) to actually split repositories as of today. Yet, you may want to retain a capability of fine tuning your application for a given hardware with any sort of compilation flag in and out of code.

I don't think a library really solves the problem in the question. What if I work on LAPACK? Is there some other library for me to leverage? And the guy working on that library? At some point, someone has to do the heavy lifting.

Quote:

Originally Posted by sbaffini (Post 824758)

I don't think a library really solves the problem in the question. What if I work on LAPACK? Is there some other library for me to leverage? And the guy working on that library? At some point, someone has to do the heavy lifting.

I meant that, we could create CPU and GPU versions of code for methods that do heavy number crunching. Things like vector addition, matrix-matrix multiplication, linear solvers, etc.

These need not be from LAPACK/BLAS or other open source libraries, but can be your own custom library. Using OpenCL might be best, because it's widely supported.

I think that it is fundamentally a hardware issue. Although there exists a lot of library issues today, I have full confidence that libraries can be developed in a very reasonable amount of time that allows me to write high-level code that runs seamlessly on either hardware. I think it's fundamentally the issue in hardware performance differences and these differences are driven by the cpu and gpu having different motivations; and these need to be unified.

Heck I use a cluster that has a large number of fast cpu's and lesser number of slower cpu's. And job scheduling on these things is already a bit of a nightmare. You either run code exclusively on the fast nodes or exclusively on the slow nodes. But any time you mix fast and slow nodes together, you just have a lot of idle time. This situation is practically cpu-accelerated cpu computing and it's already shitty. They all have the same architecture and they're all running the same code optimized for similar hardware. And it occurs in a controlled environment where I only have a binary composition (all my fast nodes are the same harware and slow nodes are the same hardware). Imagine what happens if I make it more heterogeneous with several cpu speeds and aliens (gpus).

Quote:

Originally Posted by LuckyTran (Post 824783)

Heck I use a cluster that has a large number of fast cpu's and lesser number of slower cpu's. And job scheduling on these things is already a bit of a nightmare. You either run code exclusively on the fast nodes or exclusively on the slow nodes. But any time you mix fast and slow nodes together, you just have a lot of idle time.

Why? It's possible to distribute the workload based on how much work each processor can do.

1. Run diagnostic iterations (100 or so) to measure the performance of each core. Specifically note down what's the average/mean time each core takes for 1 cell. Or note down how much time each core spends idling.

2. Distribute less work to the slower cores and distribute more work to the faster cores. i.e let the slower cores work on less number of cells, and let the faster cores work on larger number of cells. The idea would be to either reduce the idle time, and fill it up with work, or approximately distribute equal amount of work on each processor.

It's a rudimentary method, and we haven't considered the effect of memory bandwidth or latency, but I don't think we have to. In the end, wall time is all that matters. Just distributing the work based on time required to solve for 1 cell is simpler and portable across different machines.

Most modern CPUs aren't too slow. If the faster CPU runs at 3.2 GHz, the slower ones will atleast run at 2.1 GHz.

One limiting factor might be RAM. If you have more RAM for the faster CPUs to use, they can solve for higher number of cells. If the difference in speed is too much between the faster and slower CPUs, then you would need more RAM to be available for the faster CPUs.

The code will not become complicated. Just monitoring the idle time for each core, and redistributing workload after 1000 or so iterations would be efficient. If you're not using Adaptive Mesh Refinement, for most cases, you won't need to redistribute work again and again.

It's just an example of the issues already arising in an environment where the throughput is similar and already you have got a list of implementations that need to be carried out! CPUs and GPUs have a way more significant difference in throughput and that still has to be bridged.