CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM

OpenFOAM and gpgpu

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   January 25, 2006, 04:24
Default I would like to know if the op
  #1
dewald
Guest
 
Posts: n/a
I would like to know if the openFOAM dev's are perhaps considering or have had a look at gpgpu: http://www.gpgpu.org ?
From the results obtained by these guys, ecspecially Brookgpu from stanford, it would seem the way to go to achieve a huge hardware computation speed up by utilising high end gpu's. This would of course require that all the solvers have to be modified to be able to use gpgpu.
Personally I envision that a quad sli system, ie two dual core nvidia sli cards (with 1gig of memory each) on a sli system (with two iexpress x16 slots) could be the equivalent of a small cluster! Maybe we could soon get motherboards able to use 6 or more of these monsters.
http://www.pcper.com/article.php?aid=195 and
http://www.tomshardware.com/2005/12/14/sneak_preview_of_the_nvidia_quad_gpu_setu p/
Dewald
  Reply With Quote

Old   January 25, 2006, 04:53
Default Looks very interesting. Hadn't
  #2
Senior Member
 
Mattijs Janssens
Join Date: Mar 2009
Posts: 1,419
Rep Power: 26
mattijs is on a distinguished road
Looks very interesting. Hadn't seen that one before. Does it run under Linux (our development platform)? In general these things could speed up simple, often repeated loops like e.g. in the linear solver. Our indirect addressing though tends to make this a bit harder (but probably not impossible)
Should make an interesting research project though!
mattijs is offline   Reply With Quote

Old   January 25, 2006, 05:03
Default Yes gpgpu does run under linux
  #3
dewald
Guest
 
Posts: n/a
Yes gpgpu does run under linux and windows, brookgpu and Sh, the two main gpgpu langauges or c libraries both are linux based. I think a combination between gpu and cpu will probably be the best, this things looks like it is exploding, there is a lot of papers out, including some cfd and fem work.
  Reply With Quote

Old   January 25, 2006, 05:14
Default Of course I'm talking about th
  #4
Assistant Moderator
 
Bernhard Gschaider
Join Date: Mar 2009
Posts: 4,225
Rep Power: 51
gschaider will become famous soon enoughgschaider will become famous soon enough
Of course I'm talking about things I don't know too much about (that's one of the things I'm really good at):

I'd agree with Mattjis that the most promising candidate for this would be the linear solver. If someone does this (introduce "vector-processor-like"-architectures to the linear solver) the interface should be general enough to accomodate similar technologies like the forthcoming Cell-processor (Sony/IBM)
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request
gschaider is offline   Reply With Quote

Old   January 25, 2006, 16:00
Default Hi Bernhard, the following
  #5
Senior Member
 
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 497
Rep Power: 20
JBeilke is on a distinguished road
Hi Bernhard,

the following info might be interesting regarding the Cell-processor and CFD:

Re: StarCD and Cell
Datum: 10.06.2005 23:59
Von: Stephen R Behling <sbehling@us.ibm.com>
An: Jörn Beilke <joern@beilke-cfd.de>

Hi Jörn,

While the Cell processor is a very interesting processor, we are not yet
planning to do a STAR-CD port. The processor is quite new and to get the
performance out of the SPEs (extra functional units) an application's
source code has to be modified to directly call the SPE access functions.
As the Cell processor matures we might have compilers that do it
automatically and then a port would be possible. However, the SPEs
currently only support 32-bit floating point representations (Fortran
REAL*4 or C float) and much of STAR-CD needs 64-bit floating point (Fortran
REAL*8 or C double).
JBeilke is offline   Reply With Quote

Old   May 20, 2006, 05:17
Default Double precision accuracy with
  #6
bmeagle
Guest
 
Posts: n/a
Double precision accuracy with a combination of cpu and gpu.

http://numod.ins.uni-bonn.de/researc...Tu05double.pdf

http://www.gpgpu.org/cgi-bin/blosxom...ing/index.html

This paper by Dominik Göddeke, Robert Strzodka and Stefan Turek describes a preliminary algorithm to achieve double precision results by adding a CPU-based defect correction to iterative linear system solvers on the GPU. We demonstrate that identical accuracy as compared to a full CPU double precision solver is possible while still gaining a factor of 2 in speedup compared to a highly tuned cache-aware CPU reference implementation in double precision. (Accelerating Double Precision FEM Simulations with GPUs. Dominik Göddeke, Robert Strzodka and Stefan Turek. To appear in Proceedings of ASIM 2005 - 18th Symposium on Simulation Technique.)
  Reply With Quote

Old   March 23, 2009, 10:40
Default
  #7
Member
 
ms
Join Date: Mar 2009
Location: West London
Posts: 47
Rep Power: 17
anothr_acc is on a distinguished road
*bump*

So, has there been any progress with CUDA and OpenFOAM? I'm looking around (search enginewise) at the moment, in an effort to answer this but so far, few results......

BR,

Mark.
anothr_acc is offline   Reply With Quote

Old   April 17, 2009, 10:17
Default
  #8
Member
 
Nick Gardiner
Join Date: Apr 2009
Location: Chichester, UK
Posts: 94
Rep Power: 16
NickG is on a distinguished road
Hi Mark

I'm interested in using OpenFOAM through CUDA too so let us know if you find anything.

I'm a mechanical engineer by training and with programming being a steep learning curve for me I assumed that coding for the nVidia route would be easier than the gpgpu route. Could someone let me know if I'm wrong.

Typically I'm doing crossflow turbine simulations using a rotating domain within a static one or a rotating annulus in a static domain with a static central domain. How I chop it up for optimal parallelisation I have yet to investigate, although I guess that I base my decisions on grid density. Any pointers would be welcome

Cheers

Nick
NickG is offline   Reply With Quote

Old   October 12, 2009, 06:14
Default
  #9
New Member
 
Sören
Join Date: Oct 2009
Location: Bremen, Germany
Posts: 15
Rep Power: 16
soeren87 is on a distinguished road
is anyone of you working on openCL / CUDA Solver for OpenFoam ?

If there is more interest, how about a forum/thread for sharing experiences ?

I am trying to compile an OpenCL Solver, but I am still at the beginning
soeren87 is offline   Reply With Quote

Old   October 12, 2009, 07:08
Default
  #10
Super Moderator
 
niklas's Avatar
 
Niklas Nordin
Join Date: Mar 2009
Location: Stockholm, Sweden
Posts: 693
Rep Power: 29
niklas will become famous soon enoughniklas will become famous soon enough
Im gonna stick my chin out...just so that someone can punch me easier

CUDA and OpenFOAM...never gonna happen.

I've looked into it and to my (limited) understanding its gonna require major recoding on a bottom level to utilize the architecture.

So, it will require someone with deep foam knowledge (not that many)
plus someone with the time and knowledge in recoding it for CUDA (not that many).

Has anyone downloaded the CUDA SDK?
These are the 3 files you get in the tutorials to calculate a scalar product.
(just so you can get an idea of what kind of work that needs to be done)
oh...and it must be compiled with the cuda compiler.

Now, Id like someone to prove me wrong

scalarProd.cu
Code:
/*
 * This sample calculates scalar products of a 
 * given set of input vector pairs
 */



#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cutil.h>



///////////////////////////////////////////////////////////////////////////////
// Calculate scalar products of VectorN vectors of ElementN elements on CPU
///////////////////////////////////////////////////////////////////////////////
extern "C"
void scalarProdCPU(
    float *h_C,
    float *h_A,
    float *h_B,
    int vectorN,
    int elementN
);



///////////////////////////////////////////////////////////////////////////////
// Calculate scalar products of VectorN vectors of ElementN elements on GPU
///////////////////////////////////////////////////////////////////////////////
#include "scalarProd_kernel.cu"



////////////////////////////////////////////////////////////////////////////////
// Helper function, returning uniformly distributed
// random float in [low, high] range
////////////////////////////////////////////////////////////////////////////////
float RandFloat(float low, float high){
    float t = (float)rand() / (float)RAND_MAX;
    return (1.0f - t) * low + t * high;
}



///////////////////////////////////////////////////////////////////////////////
// Data configuration
///////////////////////////////////////////////////////////////////////////////

//Total number of input vector pairs; arbitrary
const int VECTOR_N = 256;
//Number of elements per vector; arbitrary, 
//but strongly preferred to be a multiple of warp size
//to meet memory coalescing constraints
const int ELEMENT_N = 4096;
//Total number of data elements
const int    DATA_N = VECTOR_N * ELEMENT_N;

const int   DATA_SZ = DATA_N * sizeof(float);
const int RESULT_SZ = VECTOR_N  * sizeof(float);



///////////////////////////////////////////////////////////////////////////////
// Main program
///////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv){
    float *h_A, *h_B, *h_C_CPU, *h_C_GPU;
    float *d_A, *d_B, *d_C;
    double delta, ref, sum_delta, sum_ref, L1norm;
    unsigned int hTimer;
    int i;


    CUT_DEVICE_INIT(argc, argv);
    CUT_SAFE_CALL( cutCreateTimer(&hTimer) );

    printf("Initializing data...\n");
        printf("...allocating CPU memory.\n");
        h_A     = (float *)malloc(DATA_SZ);
        h_B     = (float *)malloc(DATA_SZ);
        h_C_CPU = (float *)malloc(RESULT_SZ);
        h_C_GPU = (float *)malloc(RESULT_SZ);

        printf("...allocating GPU memory.\n");
        CUDA_SAFE_CALL( cudaMalloc((void **)&d_A, DATA_SZ)   );
        CUDA_SAFE_CALL( cudaMalloc((void **)&d_B, DATA_SZ)   );
        CUDA_SAFE_CALL( cudaMalloc((void **)&d_C, RESULT_SZ) );

        printf("...generating input data in CPU mem.\n");
        srand(123);
        //Generating input data on CPU
        for(i = 0; i < DATA_N; i++){
            h_A[i] = RandFloat(0.0f, 1.0f);
            h_B[i] = RandFloat(0.0f, 1.0f);
        }

        printf("...copying input data to GPU mem.\n");
        //Copy options data to GPU memory for further processing 
        CUDA_SAFE_CALL( cudaMemcpy(d_A, h_A, DATA_SZ, cudaMemcpyHostToDevice) );
        CUDA_SAFE_CALL( cudaMemcpy(d_B, h_B, DATA_SZ, cudaMemcpyHostToDevice) );
    printf("Data init done.\n");


    printf("Executing GPU kernel...\n");
        CUDA_SAFE_CALL( cudaThreadSynchronize() );
        CUT_SAFE_CALL( cutResetTimer(hTimer) );
        CUT_SAFE_CALL( cutStartTimer(hTimer) );
        scalarProdGPU<<<128, 256>>>(d_C, d_A, d_B, VECTOR_N, ELEMENT_N);
        CUT_CHECK_ERROR("scalarProdGPU() execution failed\n");
        CUDA_SAFE_CALL( cudaThreadSynchronize() );
        CUT_SAFE_CALL( cutStopTimer(hTimer) );
    printf("GPU time: %f msecs.\n", cutGetTimerValue(hTimer));

    printf("Reading back GPU result...\n");
        //Read back GPU results to compare them to CPU results
        CUDA_SAFE_CALL( cudaMemcpy(h_C_GPU, d_C, RESULT_SZ, cudaMemcpyDeviceToHost) );


    printf("Checking GPU results...\n");
        printf("..running CPU scalar product calculation\n");
        scalarProdCPU(h_C_CPU, h_A, h_B, VECTOR_N, ELEMENT_N);

        printf("...comparing the results\n");
        //Calculate max absolute difference and L1 distance
        //between CPU and GPU results
        sum_delta = 0;
        sum_ref   = 0;
        for(i = 0; i < VECTOR_N; i++){
            delta = fabs(h_C_GPU[i] - h_C_CPU[i]);
            ref   = h_C_CPU[i];
            sum_delta += delta;
            sum_ref   += ref;
        }
        L1norm = sum_delta / sum_ref;
    printf("L1 error: %E\n", L1norm);
    printf((L1norm < 1e-6) ? "TEST PASSED\n" : "TEST FAILED\n");


    printf("Shutting down...\n");
        CUDA_SAFE_CALL( cudaFree(d_C) );
        CUDA_SAFE_CALL( cudaFree(d_B)   );
        CUDA_SAFE_CALL( cudaFree(d_A)   );
        free(h_C_GPU);
        free(h_C_CPU);
        free(h_B);
        free(h_A);
        CUT_SAFE_CALL( cutDeleteTimer(hTimer) );

    CUT_EXIT(argc, argv);
}
scalarProd_gold.cpp
Code:

////////////////////////////////////////////////////////////////////////////
// Calculate scalar products of VectorN vectors of ElementN elements on CPU.
// Straight accumulation in double precision.
////////////////////////////////////////////////////////////////////////////
extern "C"
void scalarProdCPU(
    float *h_C,
    float *h_A,
    float *h_B,
    int vectorN,
    int elementN
){
    for(int vec = 0; vec < vectorN; vec++){
        int vectorBase = elementN * vec;
        int vectorEnd  = vectorBase + elementN;

        double sum = 0;
        for(int pos = vectorBase; pos < vectorEnd; pos++)
            sum += h_A[pos] * h_B[pos];

        h_C[vec] = (float)sum;
    }
}
scalarProd_kernel.cu
Code:

///////////////////////////////////////////////////////////////////////////////
// On G80-class hardware 24-bit multiplication takes 4 clocks per warp
// (the same as for floating point  multiplication and addition),
// whereas full 32-bit multiplication takes 16 clocks per warp.
// So if integer multiplication operands are  guaranteed to fit into 24 bits
// (always lie withtin [-8M, 8M - 1] range in signed case),
// explicit 24-bit multiplication is preferred for performance.
///////////////////////////////////////////////////////////////////////////////
#define IMUL(a, b) __mul24(a, b)



///////////////////////////////////////////////////////////////////////////////
// Calculate scalar products of VectorN vectors of ElementN elements on GPU
// Parameters restrictions:
// 1) ElementN is strongly preferred to be a multiple of warp size to 
//    meet alignment constraints of memory coalescing.
// 2) ACCUM_N must be a power of two.
///////////////////////////////////////////////////////////////////////////////
#define ACCUM_N 1024
__global__ void scalarProdGPU(
    float *d_C,
    float *d_A,
    float *d_B,
    int vectorN,
    int elementN
){
    //Accumulators cache
    __shared__ float accumResult[ACCUM_N];

    ////////////////////////////////////////////////////////////////////////////
    // Cycle through every pair of vectors,
    // taking into account that vector counts can be different
    // from total number of thread blocks
    ////////////////////////////////////////////////////////////////////////////
    for(int vec = blockIdx.x; vec < vectorN; vec += gridDim.x){
        int vectorBase = IMUL(elementN, vec);
        int vectorEnd  = vectorBase + elementN;

        ////////////////////////////////////////////////////////////////////////
        // Each accumulator cycles through vectors with
        // stride equal to number of total number of accumulators ACCUM_N
        // At this stage ACCUM_N is only preferred be a multiple of warp size
        // to meet memory coalescing alignment constraints.
        ////////////////////////////////////////////////////////////////////////
        for(int iAccum = threadIdx.x; iAccum < ACCUM_N; iAccum += blockDim.x){
            float sum = 0;

            for(int pos = vectorBase + iAccum; pos < vectorEnd; pos += ACCUM_N)
                sum += d_A[pos] * d_B[pos];

            accumResult[iAccum] = sum;
        }

        ////////////////////////////////////////////////////////////////////////
        // Perform tree-like reduction of accumulators' results.
        // ACCUM_N has to be power of two at this stage
        ////////////////////////////////////////////////////////////////////////
        for(int stride = ACCUM_N / 2; stride > 0; stride >>= 1){
            __syncthreads();
            for(int iAccum = threadIdx.x; iAccum < stride; iAccum += blockDim.x)
                accumResult[iAccum] += accumResult[stride + iAccum];
        }

        if(threadIdx.x == 0) d_C[vec] = accumResult[0];
    }
}
niklas is offline   Reply With Quote

Old   October 12, 2009, 08:15
Default
  #11
New Member
 
Sören
Join Date: Oct 2009
Location: Bremen, Germany
Posts: 15
Rep Power: 16
soeren87 is on a distinguished road
Hi Niklas,
I agree, that it wont be done in a few minutes.

I read some diploma thesis about openFoam and Cuda.
It seems that some universities have already compiled some files, but the Prof. I had asked, has not answered yet.

I have installed the full OpenCL + Cuda SDK, tried the examples and they work really fine !

Now to Cuda. The first you have to do is to tell your machine, that it has to use the GPU. That is almost easy with openCL or Cuda.
Now the main problem I guess: How to use all GPU cores with the most efficiency ?
In a diploma thesis I read that the solver ran many times faster on the GPU but the data sharing between GPU and Host was many times slower with the consequence, that the full progress needs the same time.

To come to this point would be a great success i think, because the next graphic cards (fermi) are special developed to handle more data.

Last edited by soeren87; October 12, 2009 at 09:52.
soeren87 is offline   Reply With Quote

Old   October 12, 2009, 10:22
Default
  #12
Member
 
Andrew Ryan
Join Date: Mar 2009
Posts: 47
Rep Power: 17
andrewryan is on a distinguished road
> I read some diploma thesis about openFoam and Cuda.

Is it available? I would be interested in this.
andrewryan is offline   Reply With Quote

Old   October 12, 2009, 10:31
Default
  #13
New Member
 
Sören
Join Date: Oct 2009
Location: Bremen, Germany
Posts: 15
Rep Power: 16
soeren87 is on a distinguished road
Quote:
Originally Posted by andrewryan View Post
> I read some diploma thesis about openFoam and Cuda.

Is it available? I would be interested in this.

Both available but in german.
One of them got the engineFoam running with CUDA
soeren87 is offline   Reply With Quote

Old   October 12, 2009, 10:53
Default
  #14
Member
 
Andrew Ryan
Join Date: Mar 2009
Posts: 47
Rep Power: 17
andrewryan is on a distinguished road
I speak German.. link?
andrewryan is offline   Reply With Quote

Old   October 12, 2009, 11:02
Default
  #15
New Member
 
Sören
Join Date: Oct 2009
Location: Bremen, Germany
Posts: 15
Rep Power: 16
soeren87 is on a distinguished road
Quote:
Originally Posted by andrewryan View Post
I speak German.. link?
link
soeren87 is offline   Reply With Quote

Old   October 12, 2009, 11:17
Default
  #16
Member
 
Andrew Ryan
Join Date: Mar 2009
Posts: 47
Rep Power: 17
andrewryan is on a distinguished road
There is another one here:

http://itec.uka.de/capp/diploma/da/doeffinger-2009.pdf

I was unaware of these, thx for posting!
andrewryan is offline   Reply With Quote

Old   October 12, 2009, 11:25
Default
  #17
New Member
 
Sören
Join Date: Oct 2009
Location: Bremen, Germany
Posts: 15
Rep Power: 16
soeren87 is on a distinguished road
Quote:
Originally Posted by andrewryan View Post
There is another one here:

http://itec.uka.de/capp/diploma/da/doeffinger-2009.pdf

I was unaware of these, thx for posting!
I read this too but I think it is not as interesting as the first one.

Are you already working with Cuda+OpenFoam ?
soeren87 is offline   Reply With Quote

Old   October 12, 2009, 11:47
Default
  #18
Member
 
Andrew Ryan
Join Date: Mar 2009
Posts: 47
Rep Power: 17
andrewryan is on a distinguished road
> Are you already working with Cuda+OpenFoam ?

No right now I'm trying to understand how some things work in OF, but later maybe it would be interesting to use CUDA or OpenCL.
andrewryan is offline   Reply With Quote

Old   December 17, 2009, 16:50
Default
  #19
Member
 
Lukasz Miroslaw
Join Date: Dec 2009
Location: Poland
Posts: 66
Rep Power: 16
Lukasz is on a distinguished road
Send a message via Skype™ to Lukasz
Quote:
Originally Posted by soeren87 View Post
is anyone of you working on openCL / CUDA Solver for OpenFoam ?
Well, we have been working on a plugin to OpenFOAM that allows easily for replacing existing solvers with their CUDA versions, such as BiCGStab or CG. It should be ready in 1Q of 2010. We want to make the installation as easy as possible (1. run wmake compilation script with our plugin 3. change two OF configuration files 3. run the simulation). I hope this is simple enough but as we are still in the development process any comments will be appreciated.
Lukasz is offline   Reply With Quote

Old   December 18, 2009, 08:21
Default
  #20
Senior Member
 
Vincent RIVOLA
Join Date: Mar 2009
Location: France
Posts: 283
Rep Power: 18
vinz is on a distinguished road
Hi Lukasz,

We've been working with cuda for our own code (not OpenFOAM related) inside my company but for other applications.
Do you think it would be possible to try to help you in your development or testing phase in some way, maybe testing one of the solver on our machines?
Really waiting forward to see your plugin released.

Regards,

Vincent

Last edited by vinz; December 18, 2009 at 09:11.
vinz is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
OpenFoam vs CFX5 mass balance in OpenFoam tangd OpenFOAM Running, Solving & CFD 33 May 23, 2010 17:36
[blockMesh] CheckMesh error using a tutorial from OpenFOAM 114 with openFOAM 13 martapajon OpenFOAM Meshing & Mesh Conversion 7 January 21, 2008 13:52
OpenFOAM users in Munich OpenFOAM benutzer in M%c3%bcnchen jaswi OpenFOAM 0 August 3, 2007 14:11
New Nvidia gpu aimed at gpgpu bmeagle OpenFOAM 0 November 9, 2006 10:41
A new Howto on the OpenFOAM Wiki Compiling OpenFOAM under Unix mbeaudoin OpenFOAM Installation 2 April 28, 2006 09:54


All times are GMT -4. The time now is 14:33.