CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Programming & Development

Parallel programming in OpenFOAM

Register Blogs Community New Posts Updated Threads Search

Like Tree8Likes
  • 6 Post By dogukanteber
  • 1 Post By dogukanteber

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   August 28, 2022, 14:45
Default Parallel programming in OpenFOAM
  #1
New Member
 
Doğukan Teber
Join Date: Jul 2022
Location: Izmir, Türkiye
Posts: 13
Rep Power: 3
dogukanteber is on a distinguished road
Dear Foamers,

This summer I interned in a program called Summer of HPC. During this internship, I wrote a tool to parallelize decomposePar utility on structured meshes. When I started the internship, I was lost and did not know what to do. I had no idea what OpenFOAM was and barely know parallel programming. I could have used a blog that teaches the basics of writing parallel applications on OpenFOAM.

To help others and make their OpenFOAM journey less hard, I have written a blog post that teaches the basics of developing parallel programs in OpenFOAM. It is not advanced or complicated. It contains the things I wish I know when I started the internship. If there is anything that can be improved, I am open to any criticism.

If you are experiencing difficulty with parallel programming in OpenFOAM, I would like to help you best of my abilities. But keep in mind I am not a physicist or mechanical engineer, I am just a software engineering student

Blog link: https://summerofhpc.prace-ri.eu/para...g-in-openfoam/
Project link: https://github.com/dogukanteber/hydr...llelProcessing

Kind regards,

dogukanteber
rmaries, jherb, dlahaye and 3 others like this.
dogukanteber is offline   Reply With Quote

Old   August 28, 2022, 15:12
Default
  #2
Senior Member
 
Domenico Lahaye
Join Date: Dec 2013
Posts: 722
Blog Entries: 1
Rep Power: 17
dlahaye is on a distinguished road
Well done!

I was unaware of ops.H

Possibly you want to embed in your post more references using MPI for scientific computing (in general) and this tutorial https://github.com/UnnamedMoose/Basi.../OFtutorial5.C is particular.

Congrats!
dlahaye is offline   Reply With Quote

Old   August 28, 2022, 15:25
Default
  #3
New Member
 
Doğukan Teber
Join Date: Jul 2022
Location: Izmir, Türkiye
Posts: 13
Rep Power: 3
dogukanteber is on a distinguished road
Thank you dlahaye. I will do that.
dlahaye likes this.
dogukanteber is offline   Reply With Quote

Old   August 28, 2022, 16:38
Default
  #4
Senior Member
 
Domenico Lahaye
Join Date: Dec 2013
Posts: 722
Blog Entries: 1
Rep Power: 17
dlahaye is on a distinguished road
Are you in any way able to measure the wall clock time for global and point-to-point communications?

Are you subsequently able to express this wall clock time on a given processor with the wall clock time for a floating point operation on that processor?

The idea here is to compare time required for intra-processor communication with on-processor computation.

Take e.g. a large vector (volScalarField in OpenFoam) shared by a number of processors. Imagine scaling this vector, adding two such vectors or computing the norm of such a vector. Suppose that this computation takes N units of wall clock time. How is N subdivided in time for computing and time for communication?

This question is important to answer in the debate on HPC for partial differential equations.
dlahaye is offline   Reply With Quote

Old   August 29, 2022, 07:14
Default
  #5
New Member
 
Doğukan Teber
Join Date: Jul 2022
Location: Izmir, Türkiye
Posts: 13
Rep Power: 3
dogukanteber is on a distinguished road
I actually did not need to measure it. However, according to you question and my understanding I would come up with this kind of solution:


Code:
int main(int argc, char* argv[]) {

    fileHandler().distributed(true);

    #include "setRootCase.H"

    #include "createTime.H"
    runTime.functionObjects().off();

    label len = 10000;

    labelList data(len, 9);
    labelList recvData(len, 0);

    Pout << "clock time1 : " << runTime.elapsedClockTime() << endl;
    Pout << "cpu time1 : " << runTime.elapsedCpuTime() << endl;

    if (Pstream::myProcNo() < Pstream::nProcs() / 2)
    {
        label destination = Pstream::nProcs() - (Pstream::myProcNo() + 1);
        OPstream send(Pstream::commsTypes::blocking, destination);

        send << data;
    }
    else
    {
        label target = Pstream::nProcs() - (Pstream::myProcNo() + 1);
        IPstream recv(Pstream::commsTypes::blocking, target);
        recv >> recvData;

        Pout << "clock time2 : " << runTime.elapsedClockTime() << endl;
        Pout << "cpu time2 : " << runTime.elapsedCpuTime() << endl;

        label sum = 0;
        for (label i=0; i<len; ++i)
        {
            for (label j=0; j<len; ++j)
            {
                sum += recvData[j];
            }
        }

        Pout << "clock time3 : " << runTime.elapsedClockTime() << endl;
        Pout << "cpu time3 : " << runTime.elapsedCpuTime() << endl;
    }

    Pstream::gather(data, sumOp<labelList>());
    Pstream::scatter(data);

    Pout << "clock time4 : " << runTime.elapsedClockTime() << endl;
    Pout << "cpu time4 : " << runTime.elapsedCpuTime() << endl;

 }
That may be too basic but if you measure the clock time and cpu time every time before computation and communication, I would you calculate how N is divided for computing and communication.


What do you think?
dogukanteber is offline   Reply With Quote

Old   October 7, 2022, 05:50
Default
  #6
Senior Member
 
Dongyue Li
Join Date: Jun 2012
Location: Beijing, China
Posts: 838
Rep Power: 17
sharonyue is on a distinguished road
dogukanteber,

I tried your decompose utility, it is much faster. But it does not support scotch. Is it possible to support scotch partition?

Thanks.
__________________
My OpenFOAM algorithm website: http://dyfluid.com
By far the largest Chinese CFD-based forum: http://www.cfd-china.com/category/6/openfoam
We provide lots of clusters to Chinese customers, and we are considering to do business overseas: http://dyfluid.com/DMCmodel.html
sharonyue is offline   Reply With Quote

Old   October 9, 2022, 15:58
Default A request:
  #7
Senior Member
 
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22
klausb will become famous soon enough
Regarding your offer, I would be interested in a step-by-step explanation of how the DIC preconditioner is calculated and how it`s applied, followed by a tutorial of how to implement the "Incomplete Poisson Preconditioner (IP)" (i.e. how to compute the required matrix operations and how to apply the preconditioner). The IP preconditioner (sort of an approximate inverse) is defined as: M exp-1 = (I - L*D exp-1)(I - D exp-1*L expT) where I is the identity matrix, D is the Diagonal and L is the strictly lower triangle. T means transpose, exp is used to express exponent.
klausb is offline   Reply With Quote

Old   October 14, 2022, 06:41
Default
  #8
New Member
 
Jackson
Join Date: Oct 2022
Posts: 2
Rep Power: 0
mikesie is on a distinguished road
A successful implementation team should include the people responsible for planning and the post-deployment stages. These individuals should be experts in the field they are responsible for. In other words, they should be knowledgeable about every aspect of the implementation process. In addition, they should have the appropriate qualifications and experience to handle the job https://mlsdev.com

Last edited by mikesie; October 16, 2022 at 06:17.
mikesie is offline   Reply With Quote

Old   October 14, 2022, 08:39
Default
  #9
New Member
 
Doğukan Teber
Join Date: Jul 2022
Location: Izmir, Türkiye
Posts: 13
Rep Power: 3
dogukanteber is on a distinguished road
Hello, sharonyue,

Thank you for taking the time and try out the tool. I developed this within the scope of my summer internship two months ago. Since the time was limited, the aim was to support only simple decomposition. Unfortunately, it is not possible to support scotch partition, at least for now.

All the best,

dogukanteber
dogukanteber is offline   Reply With Quote

Old   October 14, 2022, 08:40
Default
  #10
New Member
 
Doğukan Teber
Join Date: Jul 2022
Location: Izmir, Türkiye
Posts: 13
Rep Power: 3
dogukanteber is on a distinguished road
Quote:
Originally Posted by sharonyue View Post
dogukanteber,

I tried your decompose utility, it is much faster. But it does not support scotch. Is it possible to support scotch partition?

Thanks.
Hello, sharonyue,

Thank you for taking the time and try out the tool. I developed this within the scope of my summer internship two months ago. Since the time was limited, the aim was to support only simple decomposition. Unfortunately, it is not possible to support scotch partition, at least for now.

All the best,

dogukanteber
dogukanteber is offline   Reply With Quote

Old   October 19, 2022, 08:34
Default
  #11
Senior Member
 
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22
klausb will become famous soon enough
Is it possible to decompose the matrix horizontally in a way, that n rows are allocated to the individial processes i.e. can a 240x240 matrix be allocated to 3 processes in such way, that rows 0-79 and I mean L, D and U values as well as the vectors psi and source are allocated to process 0, rows 80-159 are allocated to process 1 and rows 160-239 are allocated to process 3?


Such an allocation is used by many potential external linear systems solvers.
klausb is offline   Reply With Quote

Old   October 19, 2022, 10:20
Default
  #12
Senior Member
 
Domenico Lahaye
Join Date: Dec 2013
Posts: 722
Blog Entries: 1
Rep Power: 17
dlahaye is on a distinguished road
Not sure.

How about making the mesh in essence 1D in x-direction (i.e., 1 cell in both y and z direction) and specifying the decomposition to be manual as (3 1 1)?

With more details on what you are trying to accomplish, I can try to supply more details.

Cheers, Domenico.
dlahaye is offline   Reply With Quote

Old   October 19, 2022, 12:40
Default
  #13
Senior Member
 
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22
klausb will become famous soon enough
It`s very simple, I need the data of the linear system allocated to the processes by sets of rows:

The matrix data allocation to processes should be like in this example:

Process 0
row 0: 1 2 2 0 0 0
row 1: 0 1 3 5 3 0

-----------------------------------
Process 1
row 2: 1 2 1 0 0 0
row 3: 0 1 3 1 3 0

-----------------------------------
Process 2
row 4: 0 0 2 0 1 0
row 5: 0 0 0 0 0 1

The first n rows go to process 0, the following n rows go to process 1 ....

I don´t want to "do" anything with it, it`s simply the defined input data distribution required by many external solvers. If it`s not provided, data gets re-allocated i.e. it`s moved between MPI processes during the setup phase which can be time consuming so allocating the data by rows to processes in the first place makes the initial setup more efficient as the setup would be limited to local operations.
klausb is offline   Reply With Quote

Old   October 20, 2022, 07:22
Default
  #14
Senior Member
 
Domenico Lahaye
Join Date: Dec 2013
Posts: 722
Blog Entries: 1
Rep Power: 17
dlahaye is on a distinguished road
Aha.

For proc0:
The interface between proc0 and proc1 is stored in LDU format with L and D left empty. Only U remains. Idem for interface between proc0 and proc2.

proc0 thus stores:
its local block L_0, D_0 and U_0 and its interfaces U_{01} and U_{02}.

thus for proc0
row0: concat( row0(L_0), row0(D_0), row0(U_0), row0(U_{01}), row0(U_{02}))
row1: concat( row1(L_0), row1(D_0), row1(U_0), row1(U_{01}), row1(U_{02}))

For proc1:
The interface between proc1 and proc0 is stored in LDU format with D and U left empty. Only L remains. The interface between proc1 and proc2 is stored in LDU format with L and D left empty. Only U remains.

proc1 thus stores
its local block L_1, D_1 and U_1 and its interfaces L_{10} and U_{12}.

thus for proc1
row2: concat( row2(L_{10}), row2(L_1), row2(D_1), row2(U_1), row2(U_{12}))
row3: concat( row3(L_{10}), row3(L_1), row3(D_1), row3(U_1), row3(U_{12}))


For proc2:
opposite proc2.

proc1 thus stores
its local block L_2, D_2 and U_2 and its interfaces L_{20} and L_{21}.

thus for proc2
row4: concat( row4(L_{20}), row4(L_{12}), row4(L_2), row4(D_2), row4(U_2))
row5: concat( row5(L_{20}), row5(L_{12}), row5(L_2), row5(D_2), row5(U_2))

I need to elaborate details.

A consensus on above would be wonderful to have.

See also buildMat() in L476 of https://develop.openfoam.com/modules.../petscSolver.C

Talk soon, D.

Last edited by dlahaye; October 20, 2022 at 07:55. Reason: Added information.
dlahaye is offline   Reply With Quote

Old   October 20, 2022, 10:33
Default
  #15
Senior Member
 
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22
klausb will become famous soon enough
No, that`s not what I mean, it`s much easier.


OpenFOAM normally allocates cells to a process/processor.


The data is stored locally in LDU format.


What I want is that I have the data of the GLOBAL matrix rows 0..n on processor 0, then the data of the GLOBAL matrix rows n+1..m on processor 1 etc..
klausb is offline   Reply With Quote

Old   October 20, 2022, 11:36
Default
  #16
Senior Member
 
Domenico Lahaye
Join Date: Dec 2013
Posts: 722
Blog Entries: 1
Rep Power: 17
dlahaye is on a distinguished road
I agree that OpenFOAM assign groups of cells to processors.

I, however, fail to understand your interpretation of the LDU matrix format in case that the matrix is decomposed according to groups of cells.

Can you pls. elaborate? Thx. D.
dlahaye is offline   Reply With Quote

Old   October 22, 2022, 05:27
Default
  #17
Senior Member
 
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22
klausb will become famous soon enough
I have been looking for a simple way to allocate local matrix data to processes as it would be required by some external solvers. LDU format wouldn`t matter as it would have to be converted anyway. May question was about getting the raw data to the right process (for external solvers) to avoid data transfers between processes. I did more reading on it and it turns out that I would have to start making changes at mesh level which would be too much effort. I`ll focus on external libraries which are compatible.
klausb is offline   Reply With Quote

Old   October 23, 2022, 14:02
Default
  #18
Senior Member
 
Domenico Lahaye
Join Date: Dec 2013
Posts: 722
Blog Entries: 1
Rep Power: 17
dlahaye is on a distinguished road
I agree that in OpenFOAM the linear algebra zero-halo approach (i.e. the hierarchy of lduMatrix and fvMatrix) and the mesh interfaces (i.e. the hierarchy of primitiveMesh, lduMesh and fvMesh) are closely intertwined.

This is my understanding leaves us two approaches.

(1/2) Not care care about the OpenFoam implementation of all and compare the same test case with other solver (fully PETSc based or otherwise);

(2/2) Go through the entire parallel LDU implementation in OpenFoam and make sense of it;

I am open to discuss both in more details.
dlahaye is offline   Reply With Quote

Old   October 26, 2022, 10:02
Default
  #19
Senior Member
 
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22
klausb will become famous soon enough
I have done a lot of what would be (1/2) of your suggestions in recent years, always with an eye on GPU computing - to make progress in that direction, I need a better practical understanding of the preconditioner implementations for "p", especially a complete understanding of the DIC implementation would help as a foundation for the implementation of preconditioners which are suitable for GPU computing (DIC is NOT suitable).

Why such a focus on preconditioners - because it`s the preconditioner that makes the solver fast.

Next would be topics like the implementation of diagonal scaling and later deflation as first stage preconditioners for a two stage preconditioning approach which is often used on GPUs.

In short, I would prefer (2/2) of your suggestions with a focus on particular aspects rather than in a broad, general way.

What are your priorities?
klausb is offline   Reply With Quote

Old   November 2, 2022, 03:53
Default
  #20
Senior Member
 
Domenico Lahaye
Join Date: Dec 2013
Posts: 722
Blog Entries: 1
Rep Power: 17
dlahaye is on a distinguished road
Apologies for the delay in the reply.

Focussing on the p-solve makes obvious sense. I am not sure how much sparse linear algebra can be accelerated on GPU.

I tend to focus on algorithmic aspects, understanding what is currently there and how to implement it.

I am currently looking GAMG as solver and as preconditioner in laplacianFoam.

Has your view on how to access the rows of the matrix changed in the meantime?

Cheers. Domenico.
dlahaye is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Frequently Asked Questions about Installing OpenFOAM wyldckat OpenFOAM Installation 3 November 14, 2023 11:58
Getting Started with OpenFOAM wyldckat OpenFOAM 25 August 14, 2022 13:55
Map of the OpenFOAM Forum - Understanding where to post your questions! wyldckat OpenFOAM 10 September 2, 2021 05:29
OpenFOAM can't be run in parallel in cluster sibo OpenFOAM Running, Solving & CFD 4 February 21, 2017 16:29
How can i make a parallel programming in OpenFOAM? jignesh_thaker2007 OpenFOAM Running, Solving & CFD 3 July 2, 2014 09:37


All times are GMT -4. The time now is 18:23.