|
[Sponsors] |
August 28, 2022, 14:45 |
Parallel programming in OpenFOAM
|
#1 |
New Member
Doğukan Teber
Join Date: Jul 2022
Location: Izmir, Türkiye
Posts: 13
Rep Power: 3 |
Dear Foamers,
This summer I interned in a program called Summer of HPC. During this internship, I wrote a tool to parallelize decomposePar utility on structured meshes. When I started the internship, I was lost and did not know what to do. I had no idea what OpenFOAM was and barely know parallel programming. I could have used a blog that teaches the basics of writing parallel applications on OpenFOAM. To help others and make their OpenFOAM journey less hard, I have written a blog post that teaches the basics of developing parallel programs in OpenFOAM. It is not advanced or complicated. It contains the things I wish I know when I started the internship. If there is anything that can be improved, I am open to any criticism. If you are experiencing difficulty with parallel programming in OpenFOAM, I would like to help you best of my abilities. But keep in mind I am not a physicist or mechanical engineer, I am just a software engineering student Blog link: https://summerofhpc.prace-ri.eu/para...g-in-openfoam/ Project link: https://github.com/dogukanteber/hydr...llelProcessing Kind regards, dogukanteber |
|
August 28, 2022, 15:12 |
|
#2 |
Senior Member
|
Well done!
I was unaware of ops.H Possibly you want to embed in your post more references using MPI for scientific computing (in general) and this tutorial https://github.com/UnnamedMoose/Basi.../OFtutorial5.C is particular. Congrats! |
|
August 28, 2022, 15:25 |
|
#3 |
New Member
Doğukan Teber
Join Date: Jul 2022
Location: Izmir, Türkiye
Posts: 13
Rep Power: 3 |
Thank you dlahaye. I will do that.
|
|
August 28, 2022, 16:38 |
|
#4 |
Senior Member
|
Are you in any way able to measure the wall clock time for global and point-to-point communications?
Are you subsequently able to express this wall clock time on a given processor with the wall clock time for a floating point operation on that processor? The idea here is to compare time required for intra-processor communication with on-processor computation. Take e.g. a large vector (volScalarField in OpenFoam) shared by a number of processors. Imagine scaling this vector, adding two such vectors or computing the norm of such a vector. Suppose that this computation takes N units of wall clock time. How is N subdivided in time for computing and time for communication? This question is important to answer in the debate on HPC for partial differential equations. |
|
August 29, 2022, 07:14 |
|
#5 |
New Member
Doğukan Teber
Join Date: Jul 2022
Location: Izmir, Türkiye
Posts: 13
Rep Power: 3 |
I actually did not need to measure it. However, according to you question and my understanding I would come up with this kind of solution:
Code:
int main(int argc, char* argv[]) { fileHandler().distributed(true); #include "setRootCase.H" #include "createTime.H" runTime.functionObjects().off(); label len = 10000; labelList data(len, 9); labelList recvData(len, 0); Pout << "clock time1 : " << runTime.elapsedClockTime() << endl; Pout << "cpu time1 : " << runTime.elapsedCpuTime() << endl; if (Pstream::myProcNo() < Pstream::nProcs() / 2) { label destination = Pstream::nProcs() - (Pstream::myProcNo() + 1); OPstream send(Pstream::commsTypes::blocking, destination); send << data; } else { label target = Pstream::nProcs() - (Pstream::myProcNo() + 1); IPstream recv(Pstream::commsTypes::blocking, target); recv >> recvData; Pout << "clock time2 : " << runTime.elapsedClockTime() << endl; Pout << "cpu time2 : " << runTime.elapsedCpuTime() << endl; label sum = 0; for (label i=0; i<len; ++i) { for (label j=0; j<len; ++j) { sum += recvData[j]; } } Pout << "clock time3 : " << runTime.elapsedClockTime() << endl; Pout << "cpu time3 : " << runTime.elapsedCpuTime() << endl; } Pstream::gather(data, sumOp<labelList>()); Pstream::scatter(data); Pout << "clock time4 : " << runTime.elapsedClockTime() << endl; Pout << "cpu time4 : " << runTime.elapsedCpuTime() << endl; } What do you think? |
|
October 7, 2022, 05:50 |
|
#6 |
Senior Member
Dongyue Li
Join Date: Jun 2012
Location: Beijing, China
Posts: 838
Rep Power: 17 |
dogukanteber,
I tried your decompose utility, it is much faster. But it does not support scotch. Is it possible to support scotch partition? Thanks.
__________________
My OpenFOAM algorithm website: http://dyfluid.com By far the largest Chinese CFD-based forum: http://www.cfd-china.com/category/6/openfoam We provide lots of clusters to Chinese customers, and we are considering to do business overseas: http://dyfluid.com/DMCmodel.html |
|
October 9, 2022, 15:58 |
A request:
|
#7 |
Senior Member
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22 |
Regarding your offer, I would be interested in a step-by-step explanation of how the DIC preconditioner is calculated and how it`s applied, followed by a tutorial of how to implement the "Incomplete Poisson Preconditioner (IP)" (i.e. how to compute the required matrix operations and how to apply the preconditioner). The IP preconditioner (sort of an approximate inverse) is defined as: M exp-1 = (I - L*D exp-1)(I - D exp-1*L expT) where I is the identity matrix, D is the Diagonal and L is the strictly lower triangle. T means transpose, exp is used to express exponent.
|
|
October 14, 2022, 06:41 |
|
#8 |
New Member
Jackson
Join Date: Oct 2022
Posts: 2
Rep Power: 0 |
A successful implementation team should include the people responsible for planning and the post-deployment stages. These individuals should be experts in the field they are responsible for. In other words, they should be knowledgeable about every aspect of the implementation process. In addition, they should have the appropriate qualifications and experience to handle the job https://mlsdev.com
Last edited by mikesie; October 16, 2022 at 06:17. |
|
October 14, 2022, 08:39 |
|
#9 |
New Member
Doğukan Teber
Join Date: Jul 2022
Location: Izmir, Türkiye
Posts: 13
Rep Power: 3 |
Hello, sharonyue,
Thank you for taking the time and try out the tool. I developed this within the scope of my summer internship two months ago. Since the time was limited, the aim was to support only simple decomposition. Unfortunately, it is not possible to support scotch partition, at least for now. All the best, dogukanteber |
|
October 14, 2022, 08:40 |
|
#10 | |
New Member
Doğukan Teber
Join Date: Jul 2022
Location: Izmir, Türkiye
Posts: 13
Rep Power: 3 |
Quote:
Thank you for taking the time and try out the tool. I developed this within the scope of my summer internship two months ago. Since the time was limited, the aim was to support only simple decomposition. Unfortunately, it is not possible to support scotch partition, at least for now. All the best, dogukanteber |
||
October 19, 2022, 08:34 |
|
#11 |
Senior Member
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22 |
Is it possible to decompose the matrix horizontally in a way, that n rows are allocated to the individial processes i.e. can a 240x240 matrix be allocated to 3 processes in such way, that rows 0-79 and I mean L, D and U values as well as the vectors psi and source are allocated to process 0, rows 80-159 are allocated to process 1 and rows 160-239 are allocated to process 3?
Such an allocation is used by many potential external linear systems solvers. |
|
October 19, 2022, 10:20 |
|
#12 |
Senior Member
|
Not sure.
How about making the mesh in essence 1D in x-direction (i.e., 1 cell in both y and z direction) and specifying the decomposition to be manual as (3 1 1)? With more details on what you are trying to accomplish, I can try to supply more details. Cheers, Domenico. |
|
October 19, 2022, 12:40 |
|
#13 |
Senior Member
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22 |
It`s very simple, I need the data of the linear system allocated to the processes by sets of rows:
The matrix data allocation to processes should be like in this example: Process 0 row 0: 1 2 2 0 0 0 row 1: 0 1 3 5 3 0 ----------------------------------- Process 1 row 2: 1 2 1 0 0 0 row 3: 0 1 3 1 3 0 ----------------------------------- Process 2 row 4: 0 0 2 0 1 0 row 5: 0 0 0 0 0 1 The first n rows go to process 0, the following n rows go to process 1 .... I don´t want to "do" anything with it, it`s simply the defined input data distribution required by many external solvers. If it`s not provided, data gets re-allocated i.e. it`s moved between MPI processes during the setup phase which can be time consuming so allocating the data by rows to processes in the first place makes the initial setup more efficient as the setup would be limited to local operations. |
|
October 20, 2022, 07:22 |
|
#14 |
Senior Member
|
Aha.
For proc0: The interface between proc0 and proc1 is stored in LDU format with L and D left empty. Only U remains. Idem for interface between proc0 and proc2. proc0 thus stores: its local block L_0, D_0 and U_0 and its interfaces U_{01} and U_{02}. thus for proc0 row0: concat( row0(L_0), row0(D_0), row0(U_0), row0(U_{01}), row0(U_{02})) row1: concat( row1(L_0), row1(D_0), row1(U_0), row1(U_{01}), row1(U_{02})) For proc1: The interface between proc1 and proc0 is stored in LDU format with D and U left empty. Only L remains. The interface between proc1 and proc2 is stored in LDU format with L and D left empty. Only U remains. proc1 thus stores its local block L_1, D_1 and U_1 and its interfaces L_{10} and U_{12}. thus for proc1 row2: concat( row2(L_{10}), row2(L_1), row2(D_1), row2(U_1), row2(U_{12})) row3: concat( row3(L_{10}), row3(L_1), row3(D_1), row3(U_1), row3(U_{12})) For proc2: opposite proc2. proc1 thus stores its local block L_2, D_2 and U_2 and its interfaces L_{20} and L_{21}. thus for proc2 row4: concat( row4(L_{20}), row4(L_{12}), row4(L_2), row4(D_2), row4(U_2)) row5: concat( row5(L_{20}), row5(L_{12}), row5(L_2), row5(D_2), row5(U_2)) I need to elaborate details. A consensus on above would be wonderful to have. See also buildMat() in L476 of https://develop.openfoam.com/modules.../petscSolver.C Talk soon, D. Last edited by dlahaye; October 20, 2022 at 07:55. Reason: Added information. |
|
October 20, 2022, 10:33 |
|
#15 |
Senior Member
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22 |
No, that`s not what I mean, it`s much easier.
OpenFOAM normally allocates cells to a process/processor. The data is stored locally in LDU format. What I want is that I have the data of the GLOBAL matrix rows 0..n on processor 0, then the data of the GLOBAL matrix rows n+1..m on processor 1 etc.. |
|
October 20, 2022, 11:36 |
|
#16 |
Senior Member
|
I agree that OpenFOAM assign groups of cells to processors.
I, however, fail to understand your interpretation of the LDU matrix format in case that the matrix is decomposed according to groups of cells. Can you pls. elaborate? Thx. D. |
|
October 22, 2022, 05:27 |
|
#17 |
Senior Member
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22 |
I have been looking for a simple way to allocate local matrix data to processes as it would be required by some external solvers. LDU format wouldn`t matter as it would have to be converted anyway. May question was about getting the raw data to the right process (for external solvers) to avoid data transfers between processes. I did more reading on it and it turns out that I would have to start making changes at mesh level which would be too much effort. I`ll focus on external libraries which are compatible.
|
|
October 23, 2022, 14:02 |
|
#18 |
Senior Member
|
I agree that in OpenFOAM the linear algebra zero-halo approach (i.e. the hierarchy of lduMatrix and fvMatrix) and the mesh interfaces (i.e. the hierarchy of primitiveMesh, lduMesh and fvMesh) are closely intertwined.
This is my understanding leaves us two approaches. (1/2) Not care care about the OpenFoam implementation of all and compare the same test case with other solver (fully PETSc based or otherwise); (2/2) Go through the entire parallel LDU implementation in OpenFoam and make sense of it; I am open to discuss both in more details. |
|
October 26, 2022, 10:02 |
|
#19 |
Senior Member
Klaus
Join Date: Mar 2009
Posts: 250
Rep Power: 22 |
I have done a lot of what would be (1/2) of your suggestions in recent years, always with an eye on GPU computing - to make progress in that direction, I need a better practical understanding of the preconditioner implementations for "p", especially a complete understanding of the DIC implementation would help as a foundation for the implementation of preconditioners which are suitable for GPU computing (DIC is NOT suitable).
Why such a focus on preconditioners - because it`s the preconditioner that makes the solver fast. Next would be topics like the implementation of diagonal scaling and later deflation as first stage preconditioners for a two stage preconditioning approach which is often used on GPUs. In short, I would prefer (2/2) of your suggestions with a focus on particular aspects rather than in a broad, general way. What are your priorities? |
|
November 2, 2022, 03:53 |
|
#20 |
Senior Member
|
Apologies for the delay in the reply.
Focussing on the p-solve makes obvious sense. I am not sure how much sparse linear algebra can be accelerated on GPU. I tend to focus on algorithmic aspects, understanding what is currently there and how to implement it. I am currently looking GAMG as solver and as preconditioner in laplacianFoam. Has your view on how to access the rows of the matrix changed in the meantime? Cheers. Domenico. |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Frequently Asked Questions about Installing OpenFOAM | wyldckat | OpenFOAM Installation | 3 | November 14, 2023 11:58 |
Getting Started with OpenFOAM | wyldckat | OpenFOAM | 25 | August 14, 2022 13:55 |
Map of the OpenFOAM Forum - Understanding where to post your questions! | wyldckat | OpenFOAM | 10 | September 2, 2021 05:29 |
OpenFOAM can't be run in parallel in cluster | sibo | OpenFOAM Running, Solving & CFD | 4 | February 21, 2017 16:29 |
How can i make a parallel programming in OpenFOAM? | jignesh_thaker2007 | OpenFOAM Running, Solving & CFD | 3 | July 2, 2014 09:37 |