CFD Online Discussion Forums - OpenFOAM and gpgpu

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- OpenFOAM (https://www.cfd-online.com/Forums/openfoam/)

- - OpenFOAM and gpgpu (https://www.cfd-online.com/Forums/openfoam/60982-openfoam-gpgpu.html)

I would like to know if the op

I would like to know if the openFOAM dev's are perhaps considering or have had a look at gpgpu: http://www.gpgpu.org ?
From the results obtained by these guys, ecspecially Brookgpu from stanford, it would seem the way to go to achieve a huge hardware computation speed up by utilising high end gpu's. This would of course require that all the solvers have to be modified to be able to use gpgpu.
Personally I envision that a quad sli system, ie two dual core nvidia sli cards (with 1gig of memory each) on a sli system (with two iexpress x16 slots) could be the equivalent of a small cluster! Maybe we could soon get motherboards able to use 6 or more of these monsters.
http://www.pcper.com/article.php?aid=195 and
http://www.tomshardware.com/2005/12/14/sneak_preview_of_the_nvidia_quad_gpu_setu p/
Dewald

Looks very interesting. Hadn't

Looks very interesting. Hadn't seen that one before. Does it run under Linux (our development platform)? In general these things could speed up simple, often repeated loops like e.g. in the linear solver. Our indirect addressing though tends to make this a bit harder (but probably not impossible)
Should make an interesting research project though!

Yes gpgpu does run under linux

Yes gpgpu does run under linux and windows, brookgpu and Sh, the two main gpgpu langauges or c libraries both are linux based. I think a combination between gpu and cpu will probably be the best, this things looks like it is exploding, there is a lot of papers out, including some cfd and fem work.

Of course I'm talking about th

Of course I'm talking about things I don't know too much about (that's one of the things I'm really good at):

I'd agree with Mattjis that the most promising candidate for this would be the linear solver. If someone does this (introduce "vector-processor-like"-architectures to the linear solver) the interface should be general enough to accomodate similar technologies like the forthcoming Cell-processor (Sony/IBM)

Hi Bernhard, the following

Hi Bernhard,

the following info might be interesting regarding the Cell-processor and CFD:

Re: StarCD and Cell
Datum: 10.06.2005 23:59
Von: Stephen R Behling <sbehling@us.ibm.com>
An: Jörn Beilke <joern@beilke-cfd.de>

Hi Jörn,

While the Cell processor is a very interesting processor, we are not yet
planning to do a STAR-CD port. The processor is quite new and to get the
performance out of the SPEs (extra functional units) an application's
source code has to be modified to directly call the SPE access functions.
As the Cell processor matures we might have compilers that do it
automatically and then a port would be possible. However, the SPEs
currently only support 32-bit floating point representations (Fortran
REAL*4 or C float) and much of STAR-CD needs 64-bit floating point (Fortran
REAL*8 or C double).

Double precision accuracy with

Double precision accuracy with a combination of cpu and gpu.

http://numod.ins.uni-bonn.de/researc...Tu05double.pdf

http://www.gpgpu.org/cgi-bin/blosxom...ing/index.html

This paper by Dominik Göddeke, Robert Strzodka and Stefan Turek describes a preliminary algorithm to achieve double precision results by adding a CPU-based defect correction to iterative linear system solvers on the GPU. We demonstrate that identical accuracy as compared to a full CPU double precision solver is possible while still gaining a factor of 2 in speedup compared to a highly tuned cache-aware CPU reference implementation in double precision. (Accelerating Double Precision FEM Simulations with GPUs. Dominik Göddeke, Robert Strzodka and Stefan Turek. To appear in Proceedings of ASIM 2005 - 18th Symposium on Simulation Technique.)

*bump*

So, has there been any progress with CUDA and OpenFOAM? I'm looking around (search enginewise) at the moment, in an effort to answer this but so far, few results......

BR,

Mark.

Hi Mark

I'm interested in using OpenFOAM through CUDA too so let us know if you find anything.

I'm a mechanical engineer by training and with programming being a steep learning curve for me I assumed that coding for the nVidia route would be easier than the gpgpu route. Could someone let me know if I'm wrong.

Typically I'm doing crossflow turbine simulations using a rotating domain within a static one or a rotating annulus in a static domain with a static central domain. How I chop it up for optimal parallelisation I have yet to investigate, although I guess that I base my decisions on grid density. Any pointers would be welcome

Cheers

Nick

is anyone of you working on openCL / CUDA Solver for OpenFoam ?

If there is more interest, how about a forum/thread for sharing experiences ?

I am trying to compile an OpenCL Solver, but I am still at the beginning

Im gonna stick my chin out...just so that someone can punch me easier :)

CUDA and OpenFOAM...never gonna happen.

I've looked into it and to my (limited) understanding its gonna require major recoding on a bottom level to utilize the architecture.

So, it will require someone with deep foam knowledge (not that many)
plus someone with the time and knowledge in recoding it for CUDA (not that many).

Has anyone downloaded the CUDA SDK?
These are the 3 files you get in the tutorials to calculate a scalar product.
(just so you can get an idea of what kind of work that needs to be done)
oh...and it must be compiled with the cuda compiler.

Now, Id like someone to prove me wrong :)

scalarProd.cu

Code:

/*

 * This sample calculates scalar products of a 

 * given set of input vector pairs

 */







#include <stdio.h>

#include <stdlib.h>

#include <time.h>

#include <string.h>

#include <cutil.h>







///////////////////////////////////////////////////////////////////////////////

// Calculate scalar products of VectorN vectors of ElementN elements on CPU

///////////////////////////////////////////////////////////////////////////////

extern "C"

void scalarProdCPU(

    float *h_C,

    float *h_A,

    float *h_B,

    int vectorN,

    int elementN

);







///////////////////////////////////////////////////////////////////////////////

// Calculate scalar products of VectorN vectors of ElementN elements on GPU

///////////////////////////////////////////////////////////////////////////////

#include "scalarProd_kernel.cu"







////////////////////////////////////////////////////////////////////////////////

// Helper function, returning uniformly distributed

// random float in [low, high] range

////////////////////////////////////////////////////////////////////////////////

float RandFloat(float low, float high){

    float t = (float)rand() / (float)RAND_MAX;

    return (1.0f - t) * low + t * high;

}







///////////////////////////////////////////////////////////////////////////////

// Data configuration

///////////////////////////////////////////////////////////////////////////////



//Total number of input vector pairs; arbitrary

const int VECTOR_N = 256;

//Number of elements per vector; arbitrary, 

//but strongly preferred to be a multiple of warp size

//to meet memory coalescing constraints

const int ELEMENT_N = 4096;

//Total number of data elements

const int    DATA_N = VECTOR_N * ELEMENT_N;



const int   DATA_SZ = DATA_N * sizeof(float);

const int RESULT_SZ = VECTOR_N  * sizeof(float);







///////////////////////////////////////////////////////////////////////////////

// Main program

///////////////////////////////////////////////////////////////////////////////

int main(int argc, char **argv){

    float *h_A, *h_B, *h_C_CPU, *h_C_GPU;

    float *d_A, *d_B, *d_C;

    double delta, ref, sum_delta, sum_ref, L1norm;

    unsigned int hTimer;

    int i;





    CUT_DEVICE_INIT(argc, argv);

    CUT_SAFE_CALL( cutCreateTimer(&hTimer) );



    printf("Initializing data...\n");

        printf("...allocating CPU memory.\n");

        h_A     = (float *)malloc(DATA_SZ);

        h_B     = (float *)malloc(DATA_SZ);

        h_C_CPU = (float *)malloc(RESULT_SZ);

        h_C_GPU = (float *)malloc(RESULT_SZ);



        printf("...allocating GPU memory.\n");

        CUDA_SAFE_CALL( cudaMalloc((void **)&d_A, DATA_SZ)   );

        CUDA_SAFE_CALL( cudaMalloc((void **)&d_B, DATA_SZ)   );

        CUDA_SAFE_CALL( cudaMalloc((void **)&d_C, RESULT_SZ) );



        printf("...generating input data in CPU mem.\n");

        srand(123);

        //Generating input data on CPU

        for(i = 0; i < DATA_N; i++){

            h_A[i] = RandFloat(0.0f, 1.0f);

            h_B[i] = RandFloat(0.0f, 1.0f);

        }



        printf("...copying input data to GPU mem.\n");

        //Copy options data to GPU memory for further processing 

        CUDA_SAFE_CALL( cudaMemcpy(d_A, h_A, DATA_SZ, cudaMemcpyHostToDevice) );

        CUDA_SAFE_CALL( cudaMemcpy(d_B, h_B, DATA_SZ, cudaMemcpyHostToDevice) );

    printf("Data init done.\n");





    printf("Executing GPU kernel...\n");

        CUDA_SAFE_CALL( cudaThreadSynchronize() );

        CUT_SAFE_CALL( cutResetTimer(hTimer) );

        CUT_SAFE_CALL( cutStartTimer(hTimer) );

        scalarProdGPU<<<128, 256>>>(d_C, d_A, d_B, VECTOR_N, ELEMENT_N);

        CUT_CHECK_ERROR("scalarProdGPU() execution failed\n");

        CUDA_SAFE_CALL( cudaThreadSynchronize() );

        CUT_SAFE_CALL( cutStopTimer(hTimer) );

    printf("GPU time: %f msecs.\n", cutGetTimerValue(hTimer));



    printf("Reading back GPU result...\n");

        //Read back GPU results to compare them to CPU results

        CUDA_SAFE_CALL( cudaMemcpy(h_C_GPU, d_C, RESULT_SZ, cudaMemcpyDeviceToHost) );





    printf("Checking GPU results...\n");

        printf("..running CPU scalar product calculation\n");

        scalarProdCPU(h_C_CPU, h_A, h_B, VECTOR_N, ELEMENT_N);



        printf("...comparing the results\n");

        //Calculate max absolute difference and L1 distance

        //between CPU and GPU results

        sum_delta = 0;

        sum_ref   = 0;

        for(i = 0; i < VECTOR_N; i++){

            delta = fabs(h_C_GPU[i] - h_C_CPU[i]);

            ref   = h_C_CPU[i];

            sum_delta += delta;

            sum_ref   += ref;

        }

        L1norm = sum_delta / sum_ref;

    printf("L1 error: %E\n", L1norm);

    printf((L1norm < 1e-6) ? "TEST PASSED\n" : "TEST FAILED\n");





    printf("Shutting down...\n");

        CUDA_SAFE_CALL( cudaFree(d_C) );

        CUDA_SAFE_CALL( cudaFree(d_B)   );

        CUDA_SAFE_CALL( cudaFree(d_A)   );

        free(h_C_GPU);

        free(h_C_CPU);

        free(h_B);

        free(h_A);

        CUT_SAFE_CALL( cutDeleteTimer(hTimer) );



    CUT_EXIT(argc, argv);

}

scalarProd_gold.cpp

Code:





////////////////////////////////////////////////////////////////////////////

// Calculate scalar products of VectorN vectors of ElementN elements on CPU.

// Straight accumulation in double precision.

////////////////////////////////////////////////////////////////////////////

extern "C"

void scalarProdCPU(

    float *h_C,

    float *h_A,

    float *h_B,

    int vectorN,

    int elementN

){

    for(int vec = 0; vec < vectorN; vec++){

        int vectorBase = elementN * vec;

        int vectorEnd  = vectorBase + elementN;



        double sum = 0;

        for(int pos = vectorBase; pos < vectorEnd; pos++)

            sum += h_A[pos] * h_B[pos];



        h_C[vec] = (float)sum;

    }

}

scalarProd_kernel.cu

Code:





///////////////////////////////////////////////////////////////////////////////

// On G80-class hardware 24-bit multiplication takes 4 clocks per warp

// (the same as for floating point  multiplication and addition),

// whereas full 32-bit multiplication takes 16 clocks per warp.

// So if integer multiplication operands are  guaranteed to fit into 24 bits

// (always lie withtin [-8M, 8M - 1] range in signed case),

// explicit 24-bit multiplication is preferred for performance.

///////////////////////////////////////////////////////////////////////////////

#define IMUL(a, b) __mul24(a, b)







///////////////////////////////////////////////////////////////////////////////

// Calculate scalar products of VectorN vectors of ElementN elements on GPU

// Parameters restrictions:

// 1) ElementN is strongly preferred to be a multiple of warp size to 

//    meet alignment constraints of memory coalescing.

// 2) ACCUM_N must be a power of two.

///////////////////////////////////////////////////////////////////////////////

#define ACCUM_N 1024

__global__ void scalarProdGPU(

    float *d_C,

    float *d_A,

    float *d_B,

    int vectorN,

    int elementN

){

    //Accumulators cache

    __shared__ float accumResult[ACCUM_N];



    ////////////////////////////////////////////////////////////////////////////

    // Cycle through every pair of vectors,

    // taking into account that vector counts can be different

    // from total number of thread blocks

    ////////////////////////////////////////////////////////////////////////////

    for(int vec = blockIdx.x; vec < vectorN; vec += gridDim.x){

        int vectorBase = IMUL(elementN, vec);

        int vectorEnd  = vectorBase + elementN;



        ////////////////////////////////////////////////////////////////////////

        // Each accumulator cycles through vectors with

        // stride equal to number of total number of accumulators ACCUM_N

        // At this stage ACCUM_N is only preferred be a multiple of warp size

        // to meet memory coalescing alignment constraints.

        ////////////////////////////////////////////////////////////////////////

        for(int iAccum = threadIdx.x; iAccum < ACCUM_N; iAccum += blockDim.x){

            float sum = 0;



            for(int pos = vectorBase + iAccum; pos < vectorEnd; pos += ACCUM_N)

                sum += d_A[pos] * d_B[pos];



            accumResult[iAccum] = sum;

        }



        ////////////////////////////////////////////////////////////////////////

        // Perform tree-like reduction of accumulators' results.

        // ACCUM_N has to be power of two at this stage

        ////////////////////////////////////////////////////////////////////////

        for(int stride = ACCUM_N / 2; stride > 0; stride >>= 1){

            __syncthreads();

            for(int iAccum = threadIdx.x; iAccum < stride; iAccum += blockDim.x)

                accumResult[iAccum] += accumResult[stride + iAccum];

        }



        if(threadIdx.x == 0) d_C[vec] = accumResult[0];

    }

}

Hi Niklas,
I agree, that it wont be done in a few minutes.

I read some diploma thesis about openFoam and Cuda.
It seems that some universities have already compiled some files, but the Prof. I had asked, has not answered yet.

I have installed the full OpenCL + Cuda SDK, tried the examples and they work really fine !

Now to Cuda. The first you have to do is to tell your machine, that it has to use the GPU. That is almost easy with openCL or Cuda.
Now the main problem I guess: How to use all GPU cores with the most efficiency ?
In a diploma thesis I read that the solver ran many times faster on the GPU but the data sharing between GPU and Host was many times slower with the consequence, that the full progress needs the same time.

To come to this point would be a great success i think, because the next graphic cards (fermi) are special developed to handle more data.

> I read some diploma thesis about openFoam and Cuda.

Is it available? I would be interested in this.

Quote:

Originally Posted by andrewryan (Post 232253)

> I read some diploma thesis about openFoam and Cuda.

Is it available? I would be interested in this.

Both available but in german.
One of them got the engineFoam running with CUDA

I speak German.. link?

Quote:

Originally Posted by andrewryan (Post 232262)

I speak German.. link?

link

There is another one here:

http://itec.uka.de/capp/diploma/da/doeffinger-2009.pdf

I was unaware of these, thx for posting!

Quote:

Originally Posted by andrewryan (Post 232266)

There is another one here:

http://itec.uka.de/capp/diploma/da/doeffinger-2009.pdf

I was unaware of these, thx for posting!

I read this too but I think it is not as interesting as the first one.

Are you already working with Cuda+OpenFoam ?

> Are you already working with Cuda+OpenFoam ?

No right now I'm trying to understand how some things work in OF, but later maybe it would be interesting to use CUDA or OpenCL.

Quote:

Originally Posted by soeren87 (Post 232203)

is anyone of you working on openCL / CUDA Solver for OpenFoam ?

Well, we have been working on a plugin to OpenFOAM that allows easily for replacing existing solvers with their CUDA versions, such as BiCGStab or CG. It should be ready in 1Q of 2010. We want to make the installation as easy as possible (1. run wmake compilation script with our plugin 3. change two OF configuration files 3. run the simulation). I hope this is simple enough but as we are still in the development process any comments will be appreciated.

Hi Lukasz,

We've been working with cuda for our own code (not OpenFOAM related) inside my company but for other applications.
Do you think it would be possible to try to help you in your development or testing phase in some way, maybe testing one of the solver on our machines?
Really waiting forward to see your plugin released.

Regards,

Vincent

Hi Vincent,

Sounds interesting! We are looking for good testing scenarios where for example our BiCGStab could be validated. Later we will have also CG and preconditioners: Jakobi and DILU.
OpenFoam demos include mostly relative small and sparse matrices. Also "the solver" part takes only about 10-20% of the overall time (probably OF uses SIMPLE algorithm, that's why).

Let me know if you have the application that could benefit from such solver.

Best wishes,
Lukasz

Quote:

Originally Posted by Lukasz (Post 240354)

We want to make the installation as easy as possible

I would suggest to try freefoam distribution, it closely follows upstream development and has cmake based build system.

a little more info would be appreciated

How very exciting. Could you let us know more about this effort such as the scope and test cases, where it will be released? Also, the double precision boards from Nvidea is coming out in April, any thoughts/plans on using this? Many thanks. J.

Quote:

Originally Posted by Lukasz (Post 240354)

Dear Jerry,

Right now we are in the testing phase. The testing matrices were collected from University of Florida Sparse Matrix Collection. Since 90% of the solving process is devoted to Sparce Matrix Vector Multiplication we focused on testing this operation in the first place. For around 20 different matrices with different size, number of NNZ and the structure the average performance is about 7-8 GFLOPS in average (>10 GFLOPS peak performance) in double precision and 9-10 GFLOPS (13 GFLOPS) in single precision. Of course, everything depends on the matrix structure. The denser the matrix the better. Please also note that because of the memory transfers / PCIe bottleneck it is not worth to use our solvers only for few iterations. The more iterations are needed the better performance we can get.

If you give me your personal email address I could send you more diagrams with preliminary results and more information about the scope of the library.

Best wishes, Lukasz

Hello! I am very new to using openFOAM, and I was reading about the Live-SUSE USB drive version for my win7 based computer. I have been following the gpgpu.org site for quite some time, and being an AMD solutions provider, I came across this in an email released a few weeks ago:

http://developer.amd.com/Assets/ACML-GPUreadme.pdf

http://developer.amd.com/GPU/ACMLGPU/Pages/default.aspx

1. Is it even possible to compile this lib into OF as a replacement?

2. If it is possible, would it even help?

3. If it is possible, is there anyone out there that can direct me to how to implement it into a Live-USB build to run on my desktop?

I have a AMD 955BE quad core CPU, 8GB DDR3 1600 RAM, dual 4870x2 ATI video cards with 3200 double precision stream cores and 4GB DDR5 RAM. If anyone can make this happen, I would be MORE THAN HAPPY to test it out!

PS: With the new cards and motherboards, it would be possible to scale this distro to run on 12800 double precision paralell Stream processors in a single desktop. I'm sure that anyone can see where that would take the possibilities of todays CFD solver times.

PPS: With this, would it also be possible to include a fractal turbulence modeler? Maybe that would be a bit more accurate than the common "guesstimate" models?

THANKS!

I can be reached at cknopp@gmail.com if anyone would like to contact me directly about this!

Thanks again!

Chris

Hi,

You asked me once about the plugin for OpenFOAM with CUDA-based solvers.

Drop me a line and I will send you the documentation (Reference Manual + Programming Guide). We have just finished the implementation and testing (flow in human aorta).

We also have a GPL version for playing around.

Best wishes,
Lukasz

With the release of OpenCL, I suspect that all GPU accelerated applications (especially commercial code) will be ported due to its multi-manufacturer support, including the Linux/Windows/MacOS operating system support.

I would like to try your plugin as well, and I would LOVE to try any OpenCL implemetation of OpenFOAM that some developers may be working on!

OpenCL is the way of the future. If you are not sold yet, check out the LuxRender forums. They are seeing a 1600% speedup in some cases, and that is utterly rediculous for a software that has been optimized for the last few years. The first CFD solution that is written in CL is going to dominate the commercial market... Especially since the F1 teams that are currently CPU FLOPS limited by regulation could add GPU's and still get around the regulations!

I hope to see more soon!

Please introduce yourself and send me your private email address.

Quote:

Originally Posted by cknopp (Post 248579)

We also considered OpenCL as a potential programming platform. However, they don't support STL which is a serious limitation for optimal programming.

We will wait for the next release.

Can you clarify what STL is? I know that there are a few devs out there that are using CL for CFD (COMSOL, SYMSCAPE). I have word from members of both teams that they are working on a CL implemetation.

If STL is a limiting factor, can you explain how? Maybe I can ask my contacts to see how they are working around it!

THANKS!

STL makes your code look more optimal:
http://pl.wikipedia.org/wiki/Standard_Template_Library

If you have some contacts with OpenCL developers let them know this is a feature request.

Hi ,Lukas, could you please send me a copy of documents ,Thank you

I'm doing research in CFD about Openfoam and CUDA, Thanks.
my email address is :
mengweichao999@gmail.com

Quote:

Originally Posted by Lukasz (Post 248417)

Hi Lukasz,

I am working upon accelerating OpenFOAM on heterogenious systems with OpenCL. And I would like to have a look upon the plug-in to OpenFOAM that allows replacing existing solvers with their CUDA versions.
It will be nice if you could send me the manual and programmers guide for that.

And can you tell me how much progress has been made by other people, who are working on the same thing as me, like, porting OpenFOAM to OpenCL? If you have some information about it.

Regards,
Gaurav

Hi Gaurav,

After registration at speedit.vratis.com you can download the OpenFOAM plugin that allows to replace the standard solvers with their GPU versions. Also, see the documentation about the performance of our CG and BCGSTAB solvers. To my knowledge we are the fastest so far.

Best wishes,
Lukasz

Lukasz,

Do I just sign up at speedit.vratis.com to test out the openfoam plugin? Is it part of the eXtreme package? Also, the eXtreme package Is there a tutorial or a test case I can use to try out the openforam plugin? Also, I am intersted in using it to accelerate the Lagrangian solver in openform (for particle tracking). So, is there a plugin for that?

Jerry

Quote:

Originally Posted by leejc (Post 243352)

Hi,
I'm interesting to GPU for OpenFOAM.
I installed the SpeedIT Classic version but I have a problem you can see here:

http://www.cfd-online.com/Forums/ope...-openfoam.html

Could anyone help me?

Thanks

Andrea

Dear All,

We are happy to announce a new release of the OpenFOAM plugin 1.1 (GPL License).
Here is the list of features:

-Multi-GPU support.
-Tested on Fermi architecture (GTX460 and Tesla C2050).
-Automated submission of the domain to the GPU cards (using decomposePar from OpenFOAM).
-Optimized submission of computational tasks to the best GPU card in the system for any number of computational threads.
-Plugin picks the most powerful GPU card for a single thread cases.

You can freely download it at speedit.vratis.com. Enjoy!

simple and piso on GPU

Now we solve selected industry-relevant OpenFOAM cases on GPU ca. x3 faster vs. Intel Xeon E5649 running 12 cores, e.g. external aero in motorbike, 6m cells.

See this presentation for the latest results.