CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)
-   -   8x icoFoam speed up with Cufflink CUDA solver library (https://www.cfd-online.com/Forums/openfoam-solving/99828-8x-icofoam-speed-up-cufflink-cuda-solver-library.html)

kmooney April 12, 2012 13:27

8x icoFoam speed up with Cufflink CUDA solver library
 
https://udrive.oit.umass.edu/kmooney...UBenchmark.pngHowdy foamers,

Last night I took a leap of faith and installed the Cufflink library for Openfoam-ext. It appears to reformat OF sparse matrices, sends them onto an NVIDIA card, and uses the built in CUSP linear algebra functions to accelerate the solution time of various flavors of CG solvers.

I found it here:
http://code.google.com/p/cufflink-library/
I had to hack the compile setup a little to avoid any MPI/parallel stuff as I'm not ready to delve that deep into it quite yet. Other than that installation was pretty straight forward.

I ran the icoFoam cavity tutorial at various mesh sizes and plotted the execution times. I figured I would share the results with the community. Keep in mind that this was a really quick A-B comparison to satisfy my own curiosity. Solver tolerances were matched between the CPU and GPU runs.

A little bit about my machine:
Intel core i7, 8 gbs ram
Nvidia GeForce GTX 260
OpenSuse 11.4

akidess April 12, 2012 13:59

Great, thanks for sharing! Are you using single or double precision?

kmooney April 12, 2012 14:00

It was all run in DP.

vinz April 13, 2012 03:05

Dear Kyle,

How many cores do you use on your CPU and on your GPU? For each case you use the same number?
Is N the number of cells in your graphic?

alberto April 13, 2012 03:43

Any comparison with SpeedIT plugin for OpenFOAM?

kmooney April 13, 2012 10:02

Hi Vincent, The CPU runs were done on just a single core and yes, N is the number of cells in the domain. I'm actually kind of curious as to how many multiprocessors the video card ended up using. I'm not sure how the library decides on grid-block-thread allocations.

Alberto, I was considering trying the speedIT library but I think only the single precision version is free of charge. Do you have any experience with it?

lordvon April 13, 2012 18:09

Thanks for posting this! I just bought a GTX 550 to utilize this. I will post some data. I will be doing transient parallel cpu simulations with GGI to compare.

chegdan April 14, 2012 15:54

Very nice Kyle,

I'm glad to see someone using Cufflink, I've been reluctant to post anything since its still developing. I haven't been able to add more solvers or publicize Cufflink, since I'm writing the last little bit of my thesis and didn't have time to fix bugs if everyone started to have problems.

@Kyle
What version of CUSP and CUDA were you using?
What was the type of problem you were solving?
What was your mesh composition (cell shape and number, number of faces, boundary conditions,etc.)
What is your motherboard and or bandwidth speed to your GPU?
Which preconditioners did you use?

@those interested
* there are plans to port it to the SGI version in the next few months (unless someone wants to help).
* I had no plans to do anything with windows or mac....but if there is interest...this could be a nice project.
* If you want to add more CUDA based linear system solver/preconditioners this can be done by contributing to Cufflink directly (cufflink-library.googlecode.com) or to the CUSP (http://code.google.com/p/cusp-library/) project. If it is contributed to CUSP, then it will be included in Cufflink.
* in general, the CUDA based solvers work more effectively if you problem is solved using many inner iterations (linear system solver iterations) and less effectively if outer iterations are dominant. This is due to the cost of shooting data back and forth to the GPU. So I would expect results to be different for a steady state solver that relies on lots of outer iterations where you would use a relTol rather than an tolerance as your stopping criteria.
* Lastly, I would stay away from single precision as the smoothed aggregate preconditioner had some single precision issues in earlier versions of CUSP...so Cufflink (though it can be compiled in single precision) is meant for double precision.

Anyway, Kyle...good work.

Edit: The multi-GPU version works, but it is still in development. I'm not current on the speedup numbers (yes there is speedup) for the parallel version (included in the cufflink download already), but it is getting some work from another group to use UVA and GPUDirect with testing on large clusters. Any multi-gpu improvements will be updated in the google-code library.

chegdan April 15, 2012 12:41

Since I'm not selling anything or making money from Cufflink (since its open source), I think I can make a few comments.

* Though we always compare everything against base case of the non-preconditioned (bi)conjugate gradient, when looking at the CUDA based linear system solvers one should be fair. Make sure to compare the best OpenFOAM solver vs. the best of the GPU solvers (of course both against the base of the non-preconditioned (bi)conjugate gradient). If your metric is focused on solving a problem quickly, will multiple CPUs running in parallel (e.g. using GAMG) be better than a single high end GPU or several high end GPUs (or even multiple lower cost GPUs)?

* I definitely think GPU solvers have their place and they will have a tremendous impact, once the bottlenecks are worked through. What is important now is to understand where heterogeneous computing thrives and outperforms our current computing paradigm. I gave an example in the last post about inner iterations and outer iterations.

* The speedup is highly dependent on hardware and the problem being solved. You might even see some variability in the numbers if you ran the test a few times. One can have a really amazing setup, but a mediocre cluster if the communication is slow between nodes.

* There are known differences in precision in the GPU vs the CPU, i.e. double precision vs extended double precision (http://en.wikipedia.org/wiki/Extended_precision). And I have wondered if this loss of a few decimal places could also be an additional increase in speed (I'm no expert, this is just thinking out loud).

* There is a lot of hype in to sell these GPU solvers, so be careful of that. Fact: When the GPU is used in the right situation (in algorithms that can be parallelized i.e. linear system solvers), there is amazing and real speedup.

I hope people find this helpful.

lordvon April 16, 2012 22:01

Hello all, on the cufflink installation instructions webpage, it says that a complete recompilation of openfoam is required under the heading 'Changes in lduInterface.H'.

Could someone give more details about how to do this?

chegdan April 16, 2012 22:07

Quote:

Originally Posted by lordvon (Post 355013)
Hello all, on the cufflink installation instructions webpage, it says that a complete recompilation of openfoam is required under the heading 'Changes in lduInterface.H'.

Could someone give more details about how to do this?

Yeah, make sure you are using extends version first of all. If you have compiled OF before, then this will be easy. You just need to take the lduInterface.H provided in the Cufflink folder (maybe save a copy of your old lduInterface.H) and and place it in the

OpenFOAM/OpenFOAM-1.6-ext/src/OpenFOAM/matrices/lduMatrix/lduAddressing/lduInterface

folder and recompile. Of course this is only necessary for the MultiGPU usage. Then just recompile the install of openfoam, recompile cufflink, and you should be good (in theory). This may throw off your git or svn repo...so if/when you update the ext then you may get some warnings.

Also, I just noticed that you were going to use GGI. This may be a problem as it will take some more thought to program cufflink to work with all the interfaces (as of now cufflink only works with the processor interfaces, i.e. nothing special like cyclics) and the regular boundary conditions.

lordvon April 16, 2012 22:10

Thanks, but the recompiling part is what I was asking about. Just some simple instructions, please.

chegdan April 16, 2012 22:19

oh...this may be difficult if there are errors. to recompile OF-extends, just type

Code:

foam
and that will take you to the right OpenFOAM directory, and then type

Code:

./Allwmake
and then go get a coffee. if all runs smoothly it will compile fine...if not then you will be an expert by the time you get it working again.

Lukasz April 17, 2012 04:27

Quote:

Originally Posted by alberto (Post 354503)
Any comparison with SpeedIT plugin for OpenFOAM?

Actually, we did compare icoFoam CPU vs. GPU. Here is a link to more detailed report.

We analyzed cavity3D up to 9M cells for transient/steady-state flows run on Intel Dual Core and Intel Xeon E5620. Accelerations were up to x33 and x5.4, respectively.

Larger cases, such as motorbike, simpleFoam with 32M cells, had to be run on a cluster. If you are interested you may take a look at this report.

lordvon April 17, 2012 11:23

Lukasz, your presentation link says that memory bottleneck was the cause of no speedup using PISO. However the reference for that figure says that:
Quote:

OpenFOAM implements the PISO method using the GAMG method, which was not ported to the GPU.
Two things:
-I am pretty sure you can just change the solver while still using PISO.
-This means that memory bottleneck was not the cause; the GPU simply wasnt being used!

Someone verify this please.

Here is a link to the referenced paper:
http://am.ippt.gov.pl/index.php/am/a...ewFile/516/196

Lukasz April 18, 2012 07:39

Thanks for your comments!

There were two tests in our publication. We compared SpeedIT with CG and diagonal preconditioner on GPU with 1) pisoFoam with CG and DIC/DILU preconditioner 2) pisoFoam with GAMG.

The quoted sentence meant that GAMG was used on CPU. This procedure was not ported to GPU and therefore SpeedIT was not so succesful in terms of acceleration. Maybe you are right, CPU should be more emphasized in this sentence.

In a few days, we will publish a report where AMG preconditioner was used which converges faster than a diagonal precondtioner.

lordvon April 18, 2012 08:33

Hi Lukasz, thanks for the reply. So are you saying that the dark bar ("diagonal vs. diagonal", tiny speedup) represents CG solver with diagonal preconditioner, while the lighter bar ("diagonal vs. other", no speedup) represents GAMG solver with diagonal preconditioner? It seemed to me from the text only preconditioners were varied, not the solvers.

The speedup chart caption:
Quote:

Fig. 6. Acceleration of GPU-accelerated OpenFOAM solvers against the corresponding original CPU implementations with various preconditioners of linear solvers. Dark bars show the
acceleration of diagonally preconditioned GPU-accelerated solvers over the CPU implmentations with recommended preconditioners – GAMG for the PISO, and DILU/DIC for the
simpleFOAM and potentialFOAM solvers.
Here is the whole paragraph of my quote above:
Quote:

The results for the GPU acceleration are presented in Fig. 6 and show that the
acceleration of the PISO algorithm is hardly noticeable. This is a result of the fact
that OpenFOAM implements the PISO method using the Geometric Agglomerated Algebraic Multigrid (GAMG) method, which was not ported to the GPU.

Lukasz April 18, 2012 14:20

Quote:

Originally Posted by lordvon (Post 355385)
It seemed to me from the text only preconditioners were varied, not the solvers.

You are correct. I asked my collegues and indeed, the preconditioners were varying, not the solvers.
Sorry about the misleading reply, the tests were done a while ago.

BTW, what solvers do you think are worthwile to accelerate with CG being accelerated on GPU?

lordvon April 22, 2012 13:30

Hmm.. About that icoFoam tutorial listed in the CUDA installation instructions page has some lines of code to create a custom solver directory named 'my_icoFoam'; I tried it out and it wierdly deleted all of my solvers... No problem I just had to remove and reinstall of1.6ext.

lordvon April 22, 2012 13:38

Oh, and Lukasz, so it seems that the speedup comparison in the presentation in your link showing that there was speedup in porting the matrix solving to the GPU in SIMPLE and potential solvers, but no speedup with PISO, is wrong. The GPU was not even being used. This is good news, because that implied that GPU acceleration was not going to give any benefit with PISO owing to its nature. GGI is implemented with PISO / PIMPLE and it is what I wanted to use Cufflink with. But in fact there still may be speedup with PISO if a solver other than GAMG is used (in the implementation you referenced).

Lukasz June 13, 2012 07:17

I fully agree that communication matters. On high-end HPC clusters we were able to get about only 1.5x acceleration (see report) for a motorbike test with 32M cells (nGPUs vs. nCPUs, where n was a number of threads/GPU cards). This is probably due to communication. Please note, that on CPU GAMG was run.

This is also the reason why we implemented PISO and SIMPLE fully on a GPU card. Now, we can run the whole case on a single GPU card up to 10M cells and the inter-node communication problem is now an intra-node communication (GPU<->CPU on a single node bounded by memory bandwith).

kmooney June 13, 2012 08:46

Quote:

Originally Posted by looyirock (Post 366199)
I had gone through the post, The speedup is highly dependent on hardware and the problem being solved. We might even see some variability in the numbers if we can ran the test a few times. One can have a really amazing setup, but a mediocre cluster if the communication is slow between nodes. Please produce some more attachments about the topic for view detail information.


This was just a quick timing comparison. I wasn't trying to nor am I planning on delivering a full benchmark on this stuff.

The point of all of this is that after about 20 minutes of library installations I was able to get a foam solver to run at 1/8th the run time on a relatively old gpu, for free.

atg August 12, 2012 14:24

anyone figure out how to compile the testCufflinkFoam application?
 
Openfoam, CUDA and the nvcc test scripts are all working on Ubuntu 10.04; I'm sure I'm missing something obvious but I can't see an obvious way to compile the testCufflinkFoam application to run the test examples. What am I missing? Thanks.

PS this is all from the GettingStarted page HERE

chegdan August 12, 2012 14:29

Quote:

Originally Posted by atg (Post 376680)
Openfoam, CUDA and the nvcc test scripts are all working on Ubuntu 10.04; I'm sure I'm missing something obvious but I can't see an obvious way to compile the testCufflinkFoam application to run the test examples. What am I missing? Thanks.

PS this is all from the GettingStarted page HERE

Did you install cusp and try to compile a cusp example? Also, make sure you are using the extend version of open foam.

atg August 12, 2012 14:51

OpenFoam 1.6-ext

Yes it is looking pretty much like the result in the getting started document:

(OF:1.6-ext Opt) caelinux@karl-OF:cufflink-library-read-only$ nvcc -o testcusp testcg.cu
/usr/local/cuda/bin/../include/thrust/detail/tuple_transform.h(130): Warning: Cannot tell what pointer points to, assuming global memory space
/usr/local/cuda/bin/../include/thrust/detail/tuple_transform.h(130): Warning: Cannot tell what pointer points to, assuming global memory space
(OF:1.6-ext Opt) caelinux@karl-OF:cufflink-library-read-only$ ./testcusp
Solver will continue until residual norm 0.01 or reaching 100 iterations
Iteration Number | Residual Norm
0 1.000000e+01
1 1.414214e+01
2 1.093707e+01
3 8.949320e+00
4 6.190057e+00
5 3.835190e+00
6 1.745482e+00
7 5.963549e-01
8 2.371135e-01
9 1.152524e-01
10 3.134469e-02
11 1.144416e-02
12 1.824177e-03
Successfully converged after 12 iterations.

Quote:

Originally Posted by chegdan (Post 376681)
Did you install cusp and try to compile a cusp example? Also, make sure you are using the extend version of open foam.


atg August 12, 2012 14:57

For some reason the test cannot find blockMesh, or at least I think that is what is going on:

(OF:1.6-ext Opt) caelinux@karl-OF:testCases$ sudo ./runSerialTests
./runSerialTests: line 3: /bin/tools/RunFunctions: No such file or directory
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_CG/N10
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_CG/N50
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_CG/N100
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_CG/N500
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_CG/N1000
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_CG/N2000
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_DiagPCG/N10
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_DiagPCG/N50
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_DiagPCG/N100
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_DiagPCG/N500
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_DiagPCG/N1000
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_DiagPCG/N2000
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_SmAPCG/N10
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_SmAPCG/N50
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_SmAPCG/N100
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_SmAPCG/N500
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_SmAPCG/N1000
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/cufflink/cufflink_SmAPCG/N2000
./runSerialTests: line 16: runApplication: command not found
./runSerialTests: line 17: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/CG/N10
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/CG/N50
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/CG/N100
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/CG/N500
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/CG/N1000
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/CG/N2000
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/DPCG/N10
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/DPCG/N50
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/DPCG/N100
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/DPCG/N500
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/DPCG/N1000
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/DPCG/N2000
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/GAMG/N10
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/GAMG/N50
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/GAMG/N100
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/GAMG/N500
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/GAMG/N1000
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found
/home/caelinux/OpenFOAM/caelinux-1.6-ext/run/cufflinkTest/testCases/OpenFOAM/GAMG/N2000
./runSerialTests: line 34: runApplication: command not found
./runSerialTests: line 35: runApplication: command not found

atg August 12, 2012 15:05

nvccWmakeAll log:

https://dl.dropbox.com/u/34549456/Link%20to%20make.log

chegdan August 12, 2012 15:22

Quote:

Originally Posted by atg (Post 376689)


Looking at your error log....I can tell you forgot your lduInterface.H that is discussed on the getting started page. Go there and follow those steps and you should be good.

atg August 12, 2012 17:03

Sorry I thought that was only required for multi GPU. My mistake.

Thanks Very Much!

Karl

atg August 13, 2012 01:35

M2090
Quadro600

Six cores of E1280 are better than one CPU core and a Quadro 600 (96 gpu cores). In fact, four CPU cores on the E1280 are faster than six, and either is faster than the Quadro 600.

But with the M2090, 512 CUDA cores (and a Quadro? It said 2 processes and the Quadro was heating up) crush the six core CPU by about five times apparently, at least for these test tasks. 75 seconds vs ~360

I hope this is of some benefit for an incompressible simpleFoam case. Time will tell I suppose.

Thanks Dan and Kyle for your help.

Karl

chegdan August 13, 2012 05:02

Karl,

I'm glad to help. Just to add a little to your last statement

Quote:

I hope this is of some benefit for an incompressible simpleFoam case. Time will tell I suppose.
From my experience (I may be repeating myself from an earlier post on this thread...but i haven't read it in a while), when you use a solution strategy that uses a relative residual/more outer-iterations approach...the GPU will lose its speed-up. The reason for this is that in solution strategies that rely on the relative residual/more outer-iteration method will require more back and forth data transfer between the host and device (GPU). This back and forth is the bottleneck and it will make speed-up only minor. For transient cases where you need to drive down the residuals (i.e. lots of inner-iterations) then the GPU will be better suited for your problem and you will most likely see better speed-up.

akidess August 13, 2012 05:32

Quote:

Originally Posted by chegdan (Post 376768)
For transient cases where you need to drive down the residuals (i.e. lots of inner-iterations) then the GPU will be better suited for your problem and you will most likely see better speed-up.

I have never run GPU simulations myself, but this statement clashes with every published result I've seen so far.

chegdan August 13, 2012 06:39

Quote:

Originally Posted by akidess (Post 376776)
...but this statement clashes with every published result I've seen so far.

You will see in my post

Quote:

From my experience...
I'd be interested to see where they show phenomenal speed-up when a solution strategy using the GPU is dominated with data transfer rather than actually solving a Navier-Stokes problem. If you have sources it would be a great thing to share.

steady-state solution strategies that use relative residual convergence criteria with dominant outer-iterations (e.g. one could use simpleFoam) to drive down convergence will have more instances that data will have to be transferred to the GPU. therefore it will spend less time solving the linear system and spend more time transferring things back and forth if it is a large system. Its also highly dependent on your hardware setup.

Quote:

I have never run GPU simulations myself,...
Grab speedIt, grab the symscape plugin, download cufflink and try them all out.

akidess August 13, 2012 07:03

Dan, as an example have a look at this document that compares the performance of simpleFoam and pisoFoam using a GPU accelerated linear solver: http://am.ippt.gov.pl/index.php/am/article/view/516/196 (or also http://www.slideshare.net/LukaszMiro...el-performance).

chegdan August 13, 2012 07:30

1) Areal offloads the pressure velocity coupling completely on the GPU, so there is not as much back and forth data transfer (this is how they are getting around the problem I suggested)

2) The first link makes no mention of relative residual and they are using and convergence criteria. So there is no way to know if they were using a strategy of driving down the residual with many outer-iterations with a relative residual convergence criteria like I am suggesting.

atg August 13, 2012 18:44

Thanks Dan I had read your earlier comment but am just coming to terms with the outer vs inner iteration part. It is just a reflection of my poor grasp of the fundamentals at play here.

Along the CPU/GPU communication line however I should add that the Quadro600 in the above test was only working at 8x, whereas the Tesla was in the 16x slot (I only have one 16x slot on the board). So the Quadro might do better in the faster slot vs. the CPU.

It will be interesting to note how much better PCI 3.0 slots perform, though the overall hardware seems likely to take a quantum leap at about the same time with the Kepler stuff.

I will post some comparative run times when I get a case going in simpleFoam.

Quote:

Originally Posted by chegdan (Post 376768)

Karl,

I'm glad to help. Just to add a little to your last statement



From my experience (I may be repeating myself from an earlier post on this thread...but i haven't read it in a while), when you use a solution strategy that uses a relative residual/more outer-iterations approach...the GPU will lose its speed-up. The reason for this is that in solution strategies that rely on the relative residual/more outer-iteration method will require more back and forth data transfer between the host and device (GPU). This back and forth is the bottleneck and it will make speed-up only minor. For transient cases where you need to drive down the residuals (i.e. lots of inner-iterations) then the GPU will be better suited for your problem and you will most likely see better speed-up.


atg September 15, 2012 13:31

In keeping with Dan's advice to compare the best cpu solver with the best gpu solver, I ran a few tests. On the cavity tutorial re-blockmeshed to 1000x1000 cells, I get a speedup of 1.5 (310 vs 463s) for one hundred timesteps of this:

p solver cufflink SmAPCG preconditioner FDIC
U solver cufflink DiagPBiCGStab
U preconditioner DILU

over a cpu run on single processor of xeon e1280 with:
p: FDIC/GAMG
U: DILU/PBiCG

I am still at the stage where I don't really know the solvers very well, so take it with a grain of salt, but it was all set at the same relative tolerances with 1e-08 for p and 1e-05 for U.

I tried the parallel version of cpu, but it was slower; 491 vs 463s for 4 vs 1 core on the cpu. The gpu is an M2090 Tesla with like 512 cores or something and 6Gb memory I think. I didn't try anything with multiple gpus.

So it would appear that modest speedup is obtained for this example in icoFoam. I do not know why my parallel cpu runs are slower on 4 cores vs 1; I expected it to be at least somewhat faster.

chegdan September 20, 2012 11:28

Quote:

Originally Posted by atg (Post 381933)
p solver cufflink SmAPCG preconditioner FDIC
U solver cufflink DiagPBiCGStab
U preconditioner DILU

For this setup, though you set the preconditioner to DILU, it will actually be Diagonal preconditioned. You might get some more speedup is you use the Ainv preconditioned PBiCGStab with AinvPBiCGStab as your solver name in cufflink.

Edit: the preconditioner FDIC for the SmAPCG will do not actually use the FDIC preconditioner, it will use the Smoothed Aggregate AMG preconditioner. I guess I need to change this in the examples to be more straight forward. The preconditioners aren't runtime selectable with the preconditoner name, but rather with the name of the solver.

atg September 20, 2012 11:50

Quote:

Originally Posted by chegdan (Post 382807)
For this setup, though you set the preconditioner to DILU, it will actually be Diagonal preconditioned. You might get some more speedup is you use the Ainv preconditioned PBiCGStab with AinvPBiCGStab as your solver name in cufflink.

OK I'll try that. It looked like the cufflink solvers were specifying their own preconditioners for most of the runs I did, but I missed that one apparently. Thanks.

alquimista November 5, 2012 19:07

2 Attachment(s)
Hello everyone,

I have been using cufflink for some applications for incompressible flow with good results using one Tesla C2050. However I don't obtain speedup using more than one GPU with the parallel versions of the cufflink solvers and unexpectedly using 2 GPUs results slower than the serial one.

I discussed some issues in the cufflink-users linked in the code page.

Since I tested the same behavior in different machines and GPUs I wanna share some test case with you if anyone would check it or have experienced also that. The test case is the same provided by Dan with testCufflinkFoam application. I have fixed some wrong values in some blockMeshDict and tolerances and maxIter for the Parallel solvers since they were different.

The test case can be reproduced running the same scripts as the original folder:

./runParallelGPU
./runSerialGPU
./runGetTimes

I attach the case here and some figures for CG and diagPCG tested in a GeForce GTX 690. smAPCG for some reason the number of iterations don't fix for 1GPU and 2GPU and I must to check it first.

Thank you very much. I find this library useful, for my experience there are situation where one can't use GAMG solver for p, especially in applications with bad quality meshes so I guess that the speed up can't be always comparable with CG since it should be robust. There the speed up is considerable.

Regards


All times are GMT -4. The time now is 09:50.