Parallel programming in OpenFOAM

dogukanteber · August 28, 2022, 14:45

Dear Foamers,

This summer I interned in a program called Summer of HPC. During this internship, I wrote a tool to parallelize decomposePar utility on structured meshes. When I started the internship, I was lost and did not know what to do. I had no idea what OpenFOAM was and barely know parallel programming. I could have used a blog that teaches the basics of writing parallel applications on OpenFOAM.

To help others and make their OpenFOAM journey less hard, I have written a blog post that teaches the basics of developing parallel programs in OpenFOAM. It is not advanced or complicated. It contains the things I wish I know when I started the internship. If there is anything that can be improved, I am open to any criticism.

If you are experiencing difficulty with parallel programming in OpenFOAM, I would like to help you best of my abilities. But keep in mind I am not a physicist or mechanical engineer, I am just a software engineering student

Blog link: https://summerofhpc.prace-ri.eu/para...g-in-openfoam/
Project link: https://github.com/dogukanteber/hydr...llelProcessing

Kind regards,

dogukanteber

dlahaye · August 28, 2022, 15:12

Well done!

I was unaware of ops.H

Possibly you want to embed in your post more references using MPI for scientific computing (in general) and this tutorial https://github.com/UnnamedMoose/Basi.../OFtutorial5.C is particular.

Congrats!

dogukanteber · August 28, 2022, 15:25

Thank you dlahaye. I will do that.

dlahaye · August 28, 2022, 16:38

Are you in any way able to measure the wall clock time for global and point-to-point communications?

Are you subsequently able to express this wall clock time on a given processor with the wall clock time for a floating point operation on that processor?

The idea here is to compare time required for intra-processor communication with on-processor computation.

Take e.g. a large vector (volScalarField in OpenFoam) shared by a number of processors. Imagine scaling this vector, adding two such vectors or computing the norm of such a vector. Suppose that this computation takes N units of wall clock time. How is N subdivided in time for computing and time for communication?

This question is important to answer in the debate on HPC for partial differential equations.

dogukanteber · August 29, 2022, 07:14

I actually did not need to measure it. However, according to you question and my understanding I would come up with this kind of solution:

Code:

int main(int argc, char* argv[]) {

    fileHandler().distributed(true);

    #include "setRootCase.H"

    #include "createTime.H"
    runTime.functionObjects().off();

    label len = 10000;

    labelList data(len, 9);
    labelList recvData(len, 0);

    Pout << "clock time1 : " << runTime.elapsedClockTime() << endl;
    Pout << "cpu time1 : " << runTime.elapsedCpuTime() << endl;

    if (Pstream::myProcNo() < Pstream::nProcs() / 2)
    {
        label destination = Pstream::nProcs() - (Pstream::myProcNo() + 1);
        OPstream send(Pstream::commsTypes::blocking, destination);

        send << data;
    }
    else
    {
        label target = Pstream::nProcs() - (Pstream::myProcNo() + 1);
        IPstream recv(Pstream::commsTypes::blocking, target);
        recv >> recvData;

        Pout << "clock time2 : " << runTime.elapsedClockTime() << endl;
        Pout << "cpu time2 : " << runTime.elapsedCpuTime() << endl;

        label sum = 0;
        for (label i=0; i<len; ++i)
        {
            for (label j=0; j<len; ++j)
            {
                sum += recvData[j];
            }
        }

        Pout << "clock time3 : " << runTime.elapsedClockTime() << endl;
        Pout << "cpu time3 : " << runTime.elapsedCpuTime() << endl;
    }

    Pstream::gather(data, sumOp<labelList>());
    Pstream::scatter(data);

    Pout << "clock time4 : " << runTime.elapsedClockTime() << endl;
    Pout << "cpu time4 : " << runTime.elapsedCpuTime() << endl;

 }

That may be too basic but if you measure the clock time and cpu time every time before computation and communication, I would you calculate how N is divided for computing and communication.

What do you think?

sharonyue · October 7, 2022, 05:50

dogukanteber,

I tried your decompose utility, it is much faster. But it does not support scotch. Is it possible to support scotch partition?

Thanks.

klausb · October 9, 2022, 15:58

Regarding your offer, I would be interested in a step-by-step explanation of how the DIC preconditioner is calculated and how it`s applied, followed by a tutorial of how to implement the "Incomplete Poisson Preconditioner (IP)" (i.e. how to compute the required matrix operations and how to apply the preconditioner). The IP preconditioner (sort of an approximate inverse) is defined as:

M exp-1 = (I - L*D exp-1)(I - D exp-1*L expT) where I is the identity matrix, D is the Diagonal and L is the strictly lower triangle. T means transpose, exp is used to express exponent.

mikesie · October 14, 2022, 06:41

A successful implementation team should include the people responsible for planning and the post-deployment stages. These individuals should be experts in the field they are responsible for. In other words, they should be knowledgeable about every aspect of the implementation process. In addition, they should have the appropriate qualifications and experience to handle the job https://mlsdev.com

dogukanteber · October 14, 2022, 08:39

Hello, sharonyue,

Thank you for taking the time and try out the tool. I developed this within the scope of my summer internship two months ago. Since the time was limited, the aim was to support only simple decomposition. Unfortunately, it is not possible to support scotch partition, at least for now.

All the best,

dogukanteber

dogukanteber · October 14, 2022, 08:40

Quote:

Originally Posted by sharonyue

dogukanteber,

I tried your decompose utility, it is much faster. But it does not support scotch. Is it possible to support scotch partition?

Thanks.

Hello, sharonyue,

Thank you for taking the time and try out the tool. I developed this within the scope of my summer internship two months ago. Since the time was limited, the aim was to support only simple decomposition. Unfortunately, it is not possible to support scotch partition, at least for now.

All the best,

dogukanteber

klausb · October 19, 2022, 08:34

Is it possible to decompose the matrix horizontally in a way, that n rows are allocated to the individial processes i.e. can a 240x240 matrix be allocated to 3 processes in such way, that rows 0-79 and I mean L, D and U values as well as the vectors psi and source are allocated to process 0, rows 80-159 are allocated to process 1 and rows 160-239 are allocated to process 3?

Such an allocation is used by many potential external linear systems solvers.

dlahaye · October 19, 2022, 10:20

Not sure.

How about making the mesh in essence 1D in x-direction (i.e., 1 cell in both y and z direction) and specifying the decomposition to be manual as (3 1 1)?

With more details on what you are trying to accomplish, I can try to supply more details.

Cheers, Domenico.

klausb · October 19, 2022, 12:40

It`s very simple, I need the data of the linear system allocated to the processes by sets of rows:

The matrix data allocation to processes should be like in this example:

Process 0
row 0: 1 2 2 0 0 0
row 1: 0 1 3 5 3 0

-----------------------------------
Process 1
row 2: 1 2 1 0 0 0
row 3: 0 1 3 1 3 0

-----------------------------------
Process 2
row 4: 0 0 2 0 1 0
row 5: 0 0 0 0 0 1

The first n rows go to process 0, the following n rows go to process 1 ....

I don´t want to "do" anything with it, it`s simply the defined input data distribution required by many external solvers. If it`s not provided, data gets re-allocated i.e. it`s moved between MPI processes during the setup phase which can be time consuming so allocating the data by rows to processes in the first place makes the initial setup more efficient as the setup would be limited to local operations.

dlahaye · October 20, 2022, 07:22

Aha.

For proc0:
The interface between proc0 and proc1 is stored in LDU format with L and D left empty. Only U remains. Idem for interface between proc0 and proc2.

proc0 thus stores:
its local block L_0, D_0 and U_0 and its interfaces U_{01} and U_{02}.

thus for proc0
row0: concat( row0(L_0), row0(D_0), row0(U_0), row0(U_{01}), row0(U_{02}))
row1: concat( row1(L_0), row1(D_0), row1(U_0), row1(U_{01}), row1(U_{02}))

For proc1:
The interface between proc1 and proc0 is stored in LDU format with D and U left empty. Only L remains. The interface between proc1 and proc2 is stored in LDU format with L and D left empty. Only U remains.

proc1 thus stores
its local block L_1, D_1 and U_1 and its interfaces L_{10} and U_{12}.

thus for proc1
row2: concat( row2(L_{10}), row2(L_1), row2(D_1), row2(U_1), row2(U_{12}))
row3: concat( row3(L_{10}), row3(L_1), row3(D_1), row3(U_1), row3(U_{12}))

For proc2:
opposite proc2.

proc1 thus stores
its local block L_2, D_2 and U_2 and its interfaces L_{20} and L_{21}.

thus for proc2
row4: concat( row4(L_{20}), row4(L_{12}), row4(L_2), row4(D_2), row4(U_2))
row5: concat( row5(L_{20}), row5(L_{12}), row5(L_2), row5(D_2), row5(U_2))

I need to elaborate details.

A consensus on above would be wonderful to have.

See also buildMat() in L476 of https://develop.openfoam.com/modules.../petscSolver.C

Talk soon, D.

klausb · October 20, 2022, 10:33

No, that`s not what I mean, it`s much easier.

OpenFOAM normally allocates cells to a process/processor.

The data is stored locally in LDU format.

What I want is that I have the data of the GLOBAL matrix rows 0..n on processor 0, then the data of the GLOBAL matrix rows n+1..m on processor 1 etc..

dlahaye · October 20, 2022, 11:36

I agree that OpenFOAM assign groups of cells to processors.

I, however, fail to understand your interpretation of the LDU matrix format in case that the matrix is decomposed according to groups of cells.

Can you pls. elaborate? Thx. D.

klausb · October 22, 2022, 05:27

I have been looking for a simple way to allocate local matrix data to processes as it would be required by some external solvers. LDU format wouldn`t matter as it would have to be converted anyway. May question was about getting the raw data to the right process (for external solvers) to avoid data transfers between processes. I did more reading on it and it turns out that I would have to start making changes at mesh level which would be too much effort. I`ll focus on external libraries which are compatible.

dlahaye · October 23, 2022, 14:02

I agree that in OpenFOAM the linear algebra zero-halo approach (i.e. the hierarchy of lduMatrix and fvMatrix) and the mesh interfaces (i.e. the hierarchy of primitiveMesh, lduMesh and fvMesh) are closely intertwined.

This is my understanding leaves us two approaches.

(1/2) Not care care about the OpenFoam implementation of all and compare the same test case with other solver (fully PETSc based or otherwise);

(2/2) Go through the entire parallel LDU implementation in OpenFoam and make sense of it;

I am open to discuss both in more details.

klausb · October 26, 2022, 10:02

I have done a lot of what would be (1/2) of your suggestions in recent years, always with an eye on GPU computing - to make progress in that direction, I need a better practical understanding of the preconditioner implementations for "p", especially a complete understanding of the DIC implementation would help as a foundation for the implementation of preconditioners which are suitable for GPU computing (DIC is NOT suitable).

Why such a focus on preconditioners - because it`s the preconditioner that makes the solver fast.

Next would be topics like the implementation of diagonal scaling and later deflation as first stage preconditioners for a two stage preconditioning approach which is often used on GPUs.

In short, I would prefer (2/2) of your suggestions with a focus on particular aspects rather than in a broad, general way.

What are your priorities?

dlahaye · November 2, 2022, 03:53

Apologies for the delay in the reply.

Focussing on the p-solve makes obvious sense. I am not sure how much sparse linear algebra can be accelerated on GPU.

I tend to focus on algorithmic aspects, understanding what is currently there and how to implement it.

I am currently looking GAMG as solver and as preconditioner in laplacianFoam.

Has your view on how to access the rows of the matrix changed in the meantime?

Cheers. Domenico.

August 28, 2022, 14:45	Parallel programming in OpenFOAM	#1
dogukanteber New Member Doğukan Teber Join Date: Jul 2022 Location: Izmir, Türkiye Posts: 13 Rep Power: 3	Dear Foamers, This summer I interned in a program called Summer of HPC. During this internship, I wrote a tool to parallelize decomposePar utility on structured meshes. When I started the internship, I was lost and did not know what to do. I had no idea what OpenFOAM was and barely know parallel programming. I could have used a blog that teaches the basics of writing parallel applications on OpenFOAM. To help others and make their OpenFOAM journey less hard, I have written a blog post that teaches the basics of developing parallel programs in OpenFOAM. It is not advanced or complicated. It contains the things I wish I know when I started the internship. If there is anything that can be improved, I am open to any criticism. If you are experiencing difficulty with parallel programming in OpenFOAM, I would like to help you best of my abilities. But keep in mind I am not a physicist or mechanical engineer, I am just a software engineering student Blog link: https://summerofhpc.prace-ri.eu/para...g-in-openfoam/ Project link: https://github.com/dogukanteber/hydr...llelProcessing Kind regards, dogukanteber rmaries, jherb, dlahaye and 3 others like this.

August 28, 2022, 15:25		#3
dogukanteber New Member Doğukan Teber Join Date: Jul 2022 Location: Izmir, Türkiye Posts: 13 Rep Power: 3	Thank you dlahaye. I will do that. dlahaye likes this.

October 7, 2022, 05:50		#6
sharonyue Senior Member Dongyue Li Join Date: Jun 2012 Location: Beijing, China Posts: 838 Rep Power: 17	dogukanteber, I tried your decompose utility, it is much faster. But it does not support scotch. Is it possible to support scotch partition? Thanks. __________________ My OpenFOAM algorithm website: http://dyfluid.com By far the largest Chinese CFD-based forum: http://www.cfd-china.com/category/6/openfoam We provide lots of clusters to Chinese customers, and we are considering to do business overseas: http://dyfluid.com/DMCmodel.html

October 9, 2022, 15:58	A request:	#7
klausb Senior Member Klaus Join Date: Mar 2009 Posts: 250 Rep Power: 22	Regarding your offer, I would be interested in a step-by-step explanation of how the DIC preconditioner is calculated and how it`s applied, followed by a tutorial of how to implement the "Incomplete Poisson Preconditioner (IP)" (i.e. how to compute the required matrix operations and how to apply the preconditioner). The IP preconditioner (sort of an approximate inverse) is defined as: M exp-1 = (I - LD exp-1)(I - D exp-1L expT) where I is the identity matrix, D is the Diagonal and L is the strictly lower triangle. T means transpose, exp is used to express exponent.

October 14, 2022, 06:41		#8
mikesie New Member Jackson Join Date: Oct 2022 Posts: 2 Rep Power: 0	A successful implementation team should include the people responsible for planning and the post-deployment stages. These individuals should be experts in the field they are responsible for. In other words, they should be knowledgeable about every aspect of the implementation process. In addition, they should have the appropriate qualifications and experience to handle the job https://mlsdev.com Last edited by mikesie; October 16, 2022 at 06:17.

August 28, 2022, 15:12		#2
dlahaye Senior Member Domenico Lahaye Join Date: Dec 2013 Posts: 722 Blog Entries: 1 Rep Power: 17	Well done! I was unaware of ops.H Possibly you want to embed in your post more references using MPI for scientific computing (in general) and this tutorial https://github.com/UnnamedMoose/Basi.../OFtutorial5.C is particular. Congrats!

August 28, 2022, 16:38		#4
dlahaye Senior Member Domenico Lahaye Join Date: Dec 2013 Posts: 722 Blog Entries: 1 Rep Power: 17	Are you in any way able to measure the wall clock time for global and point-to-point communications? Are you subsequently able to express this wall clock time on a given processor with the wall clock time for a floating point operation on that processor? The idea here is to compare time required for intra-processor communication with on-processor computation. Take e.g. a large vector (volScalarField in OpenFoam) shared by a number of processors. Imagine scaling this vector, adding two such vectors or computing the norm of such a vector. Suppose that this computation takes N units of wall clock time. How is N subdivided in time for computing and time for communication? This question is important to answer in the debate on HPC for partial differential equations.

October 14, 2022, 08:39		#9
dogukanteber New Member Doğukan Teber Join Date: Jul 2022 Location: Izmir, Türkiye Posts: 13 Rep Power: 3	Hello, sharonyue, Thank you for taking the time and try out the tool. I developed this within the scope of my summer internship two months ago. Since the time was limited, the aim was to support only simple decomposition. Unfortunately, it is not possible to support scotch partition, at least for now. All the best, dogukanteber

October 19, 2022, 08:34		#11
klausb Senior Member Klaus Join Date: Mar 2009 Posts: 250 Rep Power: 22	Is it possible to decompose the matrix horizontally in a way, that n rows are allocated to the individial processes i.e. can a 240x240 matrix be allocated to 3 processes in such way, that rows 0-79 and I mean L, D and U values as well as the vectors psi and source are allocated to process 0, rows 80-159 are allocated to process 1 and rows 160-239 are allocated to process 3? Such an allocation is used by many potential external linear systems solvers.

October 19, 2022, 10:20		#12
dlahaye Senior Member Domenico Lahaye Join Date: Dec 2013 Posts: 722 Blog Entries: 1 Rep Power: 17	Not sure. How about making the mesh in essence 1D in x-direction (i.e., 1 cell in both y and z direction) and specifying the decomposition to be manual as (3 1 1)? With more details on what you are trying to accomplish, I can try to supply more details. Cheers, Domenico.

October 19, 2022, 12:40		#13
klausb Senior Member Klaus Join Date: Mar 2009 Posts: 250 Rep Power: 22	It`s very simple, I need the data of the linear system allocated to the processes by sets of rows: The matrix data allocation to processes should be like in this example: Process 0 row 0: 1 2 2 0 0 0 row 1: 0 1 3 5 3 0 ----------------------------------- Process 1 row 2: 1 2 1 0 0 0 row 3: 0 1 3 1 3 0 ----------------------------------- Process 2 row 4: 0 0 2 0 1 0 row 5: 0 0 0 0 0 1 The first n rows go to process 0, the following n rows go to process 1 .... I don´t want to "do" anything with it, it`s simply the defined input data distribution required by many external solvers. If it`s not provided, data gets re-allocated i.e. it`s moved between MPI processes during the setup phase which can be time consuming so allocating the data by rows to processes in the first place makes the initial setup more efficient as the setup would be limited to local operations.

October 20, 2022, 07:22		#14
dlahaye Senior Member Domenico Lahaye Join Date: Dec 2013 Posts: 722 Blog Entries: 1 Rep Power: 17	Aha. For proc0: The interface between proc0 and proc1 is stored in LDU format with L and D left empty. Only U remains. Idem for interface between proc0 and proc2. proc0 thus stores: its local block L_0, D_0 and U_0 and its interfaces U_{01} and U_{02}. thus for proc0 row0: concat( row0(L_0), row0(D_0), row0(U_0), row0(U_{01}), row0(U_{02})) row1: concat( row1(L_0), row1(D_0), row1(U_0), row1(U_{01}), row1(U_{02})) For proc1: The interface between proc1 and proc0 is stored in LDU format with D and U left empty. Only L remains. The interface between proc1 and proc2 is stored in LDU format with L and D left empty. Only U remains. proc1 thus stores its local block L_1, D_1 and U_1 and its interfaces L_{10} and U_{12}. thus for proc1 row2: concat( row2(L_{10}), row2(L_1), row2(D_1), row2(U_1), row2(U_{12})) row3: concat( row3(L_{10}), row3(L_1), row3(D_1), row3(U_1), row3(U_{12})) For proc2: opposite proc2. proc1 thus stores its local block L_2, D_2 and U_2 and its interfaces L_{20} and L_{21}. thus for proc2 row4: concat( row4(L_{20}), row4(L_{12}), row4(L_2), row4(D_2), row4(U_2)) row5: concat( row5(L_{20}), row5(L_{12}), row5(L_2), row5(D_2), row5(U_2)) I need to elaborate details. A consensus on above would be wonderful to have. See also buildMat() in L476 of https://develop.openfoam.com/modules.../petscSolver.C Talk soon, D. Last edited by dlahaye; October 20, 2022 at 07:55. Reason: Added information.

October 20, 2022, 10:33		#15
klausb Senior Member Klaus Join Date: Mar 2009 Posts: 250 Rep Power: 22	No, that`s not what I mean, it`s much easier. OpenFOAM normally allocates cells to a process/processor. The data is stored locally in LDU format. What I want is that I have the data of the GLOBAL matrix rows 0..n on processor 0, then the data of the GLOBAL matrix rows n+1..m on processor 1 etc..

October 20, 2022, 11:36		#16
dlahaye Senior Member Domenico Lahaye Join Date: Dec 2013 Posts: 722 Blog Entries: 1 Rep Power: 17	I agree that OpenFOAM assign groups of cells to processors. I, however, fail to understand your interpretation of the LDU matrix format in case that the matrix is decomposed according to groups of cells. Can you pls. elaborate? Thx. D.

October 22, 2022, 05:27		#17
klausb Senior Member Klaus Join Date: Mar 2009 Posts: 250 Rep Power: 22	I have been looking for a simple way to allocate local matrix data to processes as it would be required by some external solvers. LDU format wouldn`t matter as it would have to be converted anyway. May question was about getting the raw data to the right process (for external solvers) to avoid data transfers between processes. I did more reading on it and it turns out that I would have to start making changes at mesh level which would be too much effort. I`ll focus on external libraries which are compatible.

October 23, 2022, 14:02		#18
dlahaye Senior Member Domenico Lahaye Join Date: Dec 2013 Posts: 722 Blog Entries: 1 Rep Power: 17	I agree that in OpenFOAM the linear algebra zero-halo approach (i.e. the hierarchy of lduMatrix and fvMatrix) and the mesh interfaces (i.e. the hierarchy of primitiveMesh, lduMesh and fvMesh) are closely intertwined. This is my understanding leaves us two approaches. (1/2) Not care care about the OpenFoam implementation of all and compare the same test case with other solver (fully PETSc based or otherwise); (2/2) Go through the entire parallel LDU implementation in OpenFoam and make sense of it; I am open to discuss both in more details.

October 26, 2022, 10:02		#19
klausb Senior Member Klaus Join Date: Mar 2009 Posts: 250 Rep Power: 22	I have done a lot of what would be (1/2) of your suggestions in recent years, always with an eye on GPU computing - to make progress in that direction, I need a better practical understanding of the preconditioner implementations for "p", especially a complete understanding of the DIC implementation would help as a foundation for the implementation of preconditioners which are suitable for GPU computing (DIC is NOT suitable). Why such a focus on preconditioners - because it`s the preconditioner that makes the solver fast. Next would be topics like the implementation of diagonal scaling and later deflation as first stage preconditioners for a two stage preconditioning approach which is often used on GPUs. In short, I would prefer (2/2) of your suggestions with a focus on particular aspects rather than in a broad, general way. What are your priorities?

November 2, 2022, 03:53		#20
dlahaye Senior Member Domenico Lahaye Join Date: Dec 2013 Posts: 722 Blog Entries: 1 Rep Power: 17	Apologies for the delay in the reply. Focussing on the p-solve makes obvious sense. I am not sure how much sparse linear algebra can be accelerated on GPU. I tend to focus on algorithmic aspects, understanding what is currently there and how to implement it. I am currently looking GAMG as solver and as preconditioner in laplacianFoam. Has your view on how to access the rows of the matrix changed in the meantime? Cheers. Domenico.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Frequently Asked Questions about Installing OpenFOAM	wyldckat	OpenFOAM Installation	3	November 14, 2023 11:58
Getting Started with OpenFOAM	wyldckat	OpenFOAM	25	August 14, 2022 13:55
Map of the OpenFOAM Forum - Understanding where to post your questions!	wyldckat	OpenFOAM	10	September 2, 2021 05:29
OpenFOAM can't be run in parallel in cluster	sibo	OpenFOAM Running, Solving & CFD	4	February 21, 2017 16:29
How can i make a parallel programming in OpenFOAM?	jignesh_thaker2007	OpenFOAM Running, Solving & CFD	3	July 2, 2014 09:37