CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Main CFD Forum

how do we unify cpu+gpu development?

Register Blogs Community New Posts Updated Threads Search

Like Tree4Likes
  • 1 Post By sbaffini
  • 2 Post By LuckyTran
  • 1 Post By LuckyTran

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   March 25, 2022, 02:25
Default how do we unify cpu+gpu development?
  #1
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Hello everyone,

Developing separate code bases for CPUs and GPUs might be a limiting factor in future.

OpenHyperflow for example, has separate branches setup for its CPU and CUDA code : https://github.com/sergeas67/OpenHyperFLOW2D

Maintaining such a codebase becomes difficult in the long run, and if possible it would be good to have the same code run on both CPUs and GPUs.

Is this a worthwhile effort?

Another dev asked this same question before : https://stackoverflow.com/questions/9631833/ and the main answer stated that due to the hardware architectural differences, the codes needs to be optimized for the device it runs on.

That's true. The differences will be even more severe as the GPU architecture evolves in the future. CPU architecture has been more or less stable for one or two decades. But GPU architecture is continuously improving and that will result in way more different coding strategies for both.

So I don't know if it's possible to have the same code run on CPUs and GPUs.

Should we just stick with MPI/OpenMP based parallelization for CPUs?

OpenMP can now technically be used for coding GPUs : https://stackoverflow.com/questions/28962655/ but since this new feature is not being widely used today, we still don't know what are the limiting factors.

Currently I can only see one strategy that might be useful : represent the equations in standard BLAS or LAPACK form, and use libraries that are either CPU/GPU accelerated.

For example, we can create separate CPU/GPU accelerated function to add two arrays of floats. Then we would just need to use the appropriate function in our code.

I'm looking for other strategies that might be good, but this is the only strategy which looks feasible to me currently. Since we need parallelization to do large amounts of work, it makes sense to write our code to allow better parallelization of these simple but very widely used mathematical methods.

Some other non BLAS/LAPACK example of such a widely used method would be Gauss-Siedel Elimination, LU decomposition, Cholesky factorization, Fast Fourier Transforms etc....

Is this the correct approach? What can we do better?

Thanks
~sayan
aerosayan is offline   Reply With Quote

Old   March 25, 2022, 04:02
Default
  #2
Senior Member
 
sbaffini's Avatar
 
Paolo Lampitella
Join Date: Mar 2009
Location: Italy
Posts: 2,152
Blog Entries: 29
Rep Power: 39
sbaffini will become famous soon enoughsbaffini will become famous soon enough
Send a message via Skype™ to sbaffini
My totally uninformed point of view is that if Fortran was able to bake in distributed parallelism, then the real issue for GPUs is not one of language. Indeed NVIDIA has CUDA-Fortran which, roughly speaking, is not different from an OpenMP-like approach.

The issue, as you recognize it, is about performances on a given hardware. I don't think is really required (or suggested) to actually split repositories as of today. Yet, you may want to retain a capability of fine tuning your application for a given hardware with any sort of compilation flag in and out of code.

I don't think a library really solves the problem in the question. What if I work on LAPACK? Is there some other library for me to leverage? And the guy working on that library? At some point, someone has to do the heavy lifting.
aerosayan likes this.
sbaffini is offline   Reply With Quote

Old   March 25, 2022, 04:19
Default
  #3
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Quote:
Originally Posted by sbaffini View Post
I don't think a library really solves the problem in the question. What if I work on LAPACK? Is there some other library for me to leverage? And the guy working on that library? At some point, someone has to do the heavy lifting.

I meant that, we could create CPU and GPU versions of code for methods that do heavy number crunching. Things like vector addition, matrix-matrix multiplication, linear solvers, etc.



These need not be from LAPACK/BLAS or other open source libraries, but can be your own custom library. Using OpenCL might be best, because it's widely supported.
aerosayan is offline   Reply With Quote

Old   March 25, 2022, 14:38
Default
  #4
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,676
Rep Power: 66
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
I think that it is fundamentally a hardware issue. Although there exists a lot of library issues today, I have full confidence that libraries can be developed in a very reasonable amount of time that allows me to write high-level code that runs seamlessly on either hardware. I think it's fundamentally the issue in hardware performance differences and these differences are driven by the cpu and gpu having different motivations; and these need to be unified.


Heck I use a cluster that has a large number of fast cpu's and lesser number of slower cpu's. And job scheduling on these things is already a bit of a nightmare. You either run code exclusively on the fast nodes or exclusively on the slow nodes. But any time you mix fast and slow nodes together, you just have a lot of idle time. This situation is practically cpu-accelerated cpu computing and it's already shitty. They all have the same architecture and they're all running the same code optimized for similar hardware. And it occurs in a controlled environment where I only have a binary composition (all my fast nodes are the same harware and slow nodes are the same hardware). Imagine what happens if I make it more heterogeneous with several cpu speeds and aliens (gpus).
sbaffini and aerosayan like this.
LuckyTran is offline   Reply With Quote

Old   March 26, 2022, 06:27
Default
  #5
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Quote:
Originally Posted by LuckyTran View Post
Heck I use a cluster that has a large number of fast cpu's and lesser number of slower cpu's. And job scheduling on these things is already a bit of a nightmare. You either run code exclusively on the fast nodes or exclusively on the slow nodes. But any time you mix fast and slow nodes together, you just have a lot of idle time.

Why? It's possible to distribute the workload based on how much work each processor can do.


1. Run diagnostic iterations (100 or so) to measure the performance of each core. Specifically note down what's the average/mean time each core takes for 1 cell. Or note down how much time each core spends idling.



2. Distribute less work to the slower cores and distribute more work to the faster cores. i.e let the slower cores work on less number of cells, and let the faster cores work on larger number of cells. The idea would be to either reduce the idle time, and fill it up with work, or approximately distribute equal amount of work on each processor.



It's a rudimentary method, and we haven't considered the effect of memory bandwidth or latency, but I don't think we have to. In the end, wall time is all that matters. Just distributing the work based on time required to solve for 1 cell is simpler and portable across different machines.


Most modern CPUs aren't too slow. If the faster CPU runs at 3.2 GHz, the slower ones will atleast run at 2.1 GHz.


One limiting factor might be RAM. If you have more RAM for the faster CPUs to use, they can solve for higher number of cells. If the difference in speed is too much between the faster and slower CPUs, then you would need more RAM to be available for the faster CPUs.


The code will not become complicated. Just monitoring the idle time for each core, and redistributing workload after 1000 or so iterations would be efficient. If you're not using Adaptive Mesh Refinement, for most cases, you won't need to redistribute work again and again.
aerosayan is offline   Reply With Quote

Old   March 26, 2022, 07:18
Default
  #6
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,676
Rep Power: 66
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
It's just an example of the issues already arising in an environment where the throughput is similar and already you have got a list of implementations that need to be carried out! CPUs and GPUs have a way more significant difference in throughput and that still has to be bridged.
aerosayan likes this.
LuckyTran is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
General recommendations for CFD hardware [WIP] flotus1 Hardware 18 February 29, 2024 12:48
GPU acceleration in Ansys Fluent flotus1 Hardware 63 May 12, 2023 02:48
[Resolved] GPU on Fluent Daveo643 FLUENT 4 March 7, 2018 08:02
Superlinear speedup in OpenFOAM 13 msrinath80 OpenFOAM Running, Solving & CFD 18 March 3, 2015 05:36
Star cd es-ice solver error ernarasimman STAR-CD 2 September 12, 2014 00:01


All times are GMT -4. The time now is 16:38.