Parallel processing of OpenFOAM cases on multicore processor???
Dear experts,
I'm going to run a huge OpenFOAM case in a quad-core processor. I don't know if it is possible in OpenFOAM to apply parallel processing for shared memory computers. As I know OpenFOAM uses MPI for parallel processing and MPI is desirable for distributed memory computers. By the way, I have not the possibility of using clusters. Is there any way for implementation of parallel processing in my shared memory computer and using all cores of cpu in calculations? Best regards. |
Greetings g.akbari,
With OpenFOAM's ThirdParty package, comes OpenMPI. And by what I know, OpenMPI can automatically decide what communication protocol to use when comunicating between processes, whether they are in the same computer or in different computers. Nonetheless, you can search in the OpenMPI manual how to specifically define what protocol to use, including "shared memory". Then you can edit the script $WM_PROJECT_DIR/etc/foamJob and tweak the running options for OpenMPI! Sadly I don't have much experience with OpenMPI, so I only know it's possible. Best regards, Bruno Santos |
You just need to use the decomposeParDict to make the partitions.
The user manual have a good tutorial in page 63. I run all my cases in my mini cluster with two quad core xeon processors and 8g ram. Good luck CX |
Thanks alot. I performed the damBreak tutorial and it worked fine. Now, my problem is that I gain no speed-up and the serial execution time is shorter than parallel execution time. What 's the reason for this behavior?
|
I don't have a answer for that question.
The only thing I can say is that you will certanly attained a faster convergence with the parallel execution. Do you try to run the dam break without the mpi? CX |
Yes, I executed damBreak two times
using MPI and by setting the number of processors as 4, calculations take 122 sec, without MPI and in serial mode, it takes 112 sec. May be I shoud use more finer mesh to obtain more better speed-up. |
After using a finer mesh, the speed-up becomes larger than unity. Thanks very much about your comments wyldckat and xisto.
Sincerely |
Multi-Core Processors: Execution Time vs Clock Time
Hello all,
I have a single dual-core processor, and I am attempting to determine if it is possible to use parallel processing with multiple cores vice multiple processors. I ran wingMotion2D_pimpleDyMFoam three times, with 1, 2 and 3 "processors" identified each time in the /system/decomposeParDict. The results were predictable, in that the single "processor" run required 1.5x the time required by the dual "processor" run. I then tried 3 processors just to see what would happen, and found that this required 99.9% of the execution time of the dual-core run, but 150% of the clock time of the dual-core run. I did not know what to expect, running a 3-processor parallel simulation on a dual-core single-processor system, but this anomaly was definitely not expected. Can anyone tell me why the clock and execution times would be so different, but only when #processors in decomposeParDict > #cores in the computer? Sure, it was a silly test, but now that I have strange results it does make me wonder. Thanks, Dan |
My guess: You're still doing the same amount of computations, so the CPU time is similar, but you're wasting lot's of time on communication and process-switching so the wall clock time is larger.
|
Hey Guys ,
I am facing similar problems here, my parallel run with 6 processors is taking a way lot more time than the serial one. Any one has a clue that what might be happening ?? |
Greetings Prashant,
From the top of my head, here are a few pointers:
Bruno |
hi
I decided to buy a multi core (core=6;thread=12 intel) computer. my question : does openfoam use to threads of CPU for processing ? |
Greetings Ali Jafari,
I know this has been discussed here on the forum, but I'm not in the mood to go searching :rolleyes: A summary is as follows:
Best regards, Bruno |
|
Dear all,
it is a little bit late but I want to share my findings on parallel runs in OpenFOAM: I used the lid driven cavity for benchmarking. I found out that for this case the RAM-CPU communication was the bottleneck. The best scaling I found is using only 2 cores per CPU (8 with 4 core each and no HT) with corebinding especially if you have Dualcore-machines. Doing so the case scaled almost linear up to 16 cores, and was faster than using all available 32 cores. Kind Regards. Edmund |
Hi Edmund,
Can you share some more information about the characteristics of the machines you've used? Such as:
Bruno |
Hi Bruno,
The Processors are XEON W5580 (4 cores) or X5680 (6 cores) with 3.2 respectively 3.33 GHz without overclocking. The Memory is DDR3-1333. I used normal Ethernet 1GbpS. The modelsize was 1 Million cells. Running on 2 cores the Speedup is 2 as well. Using 4 cores on the same CPU the speedup is only ~2.6 but Using 4 cores on 2 CPU's the speedup will be nearly 4! It seems to be the same when looking at the unofficial benchmarks (http://code.google.com/p/bluecfd-sin...SE_12.1_x86_64) So on a cluster where you use machines with more than one CPU you need to do corbinding and define which task on which cpu-core with an additional rankfile in openmpi. Example 2 Dual CPU machines (no matter if 4 or 6 cores): mpirun -np 8 -hostfile ./hostfile.txt -rankfile ./rankfile.txt icoFoam -parallel hostfile: host_1 host_2 rankfile: rank 0 =host_1 slot=0:0 rank 1 =host_1 slot=0:1 rank 2 =host_1 slot=1:0 rank 3 =host_1 slot=1:1 rank 4 =host_2 slot=0:0 rank 5 =host_2 slot=0:1 rank 6 =host_2 slot=1:0 rank 7 =host_2 slot=1:1 Now the job runs on 8 cores distributed on cores 0 and 1 on any of the 4 CPU's with a speedup of more than 7. Perhaps the newest generation of CPU's show a faster CPU-RAM communication and one can use 3 cores per CPU. Kind Regards. Edmund |
Hi Edmund,
Many thanks for sharing the information! But I'm still wondering if there isn't some specific detail we're missing. I did some searching and:
Because from correlating all of this information, my guess is that your machines only have 2 RAM modules assigned per socket... 4 modules in total per machine. Either that or 1 million cells is not enough for a full test! ;) Best regards, Bruno |
2 Attachment(s)
Hi Bruno,
the slots are all filled with equal sized DIMM's. What do you mean with "full test"? Up to 20 cores 1 Million cells will be at least 50kcells/core. I don't want to tell stories. Attached you can find an overview of the timings I found and the test case I used in zip-format. Could you please crosscheck the speedup from 1 to 2 and 4 Cores to see if only my hardware behaves ugly? Kind Regards. Edmund |
Hi Edmund,
Thanks for sharing. I'll give it a try when I get an opening on our clusters. In the meantime, check the following:
Bruno |
I'm not sure why you are having longer computation time but I have a guess:
Your longer processing time could be due to unoptimized selection of number of decomposition domain. when you decomposed the domain the communication between each threads during parallel computation also take process which correspondingly demands more time. In your case I believe if you decompose you domain into 5 or 3 instead of 4, you should be facing different processing time as the communication between threads might decrease or increase. it is not always efficient to decompose the domain into several parts. |
Quote:
Dear Edmund and Bruno, It seems that Open MPI rank file can not detect multi threads, I mean when u have cores with HT enabled, in a rankfile u can only include physical processors. Is there any solution? Regards, Ali |
Quote:
Beyond that, a very quick search lead me to this answer: http://stackoverflow.com/a/11761943 |
Quote:
Tnx for quick response. It was helpful. I know that the maximum speedup would be 10-30% in some cases, when some processors become idle e.g. in combustion probs. I refer u to this paper, "An Empirical Study of Hyper-Threading in High Performance Computing Clusters". Ok lets forget the HT for the moment. I have another question, is there any report of OpenFOAM scalability above 32 processors like this "https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje" but without infiniband communication? I mean with Ethernet communication among nodes? The question may seem weird but let me describe it more, I'm not a pro in computer science so excuse me for probable mistakes. We have 3 Supermicro servers, each has 2 Intel Xeon E5-2690 (2*10 cores). I connected them via ethernet with Cat6 cables and a high speed switch. The problem is that I cant reproduce the result of "https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje" in 1M cells cavity case using 32 processors. The solution in 1 node is scalable, however increasing the nodes to 2 and 3 (40 and 60 processors respectively) there is no substantial speedup. When I change the problem to the combustion case (PDE+ODE solutions) an interesting behavior is seen. The scalability of ODE solution part is linear. But the PDEs solution time is still the same like cavity case. So it comes to my mind that maybe this is the prblem of communication among nodes. Since ODE solution part doesn't need any synchronization while the PDEs do. The conclusion: since the only major difference btw me and the cluster in "https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje" is the type of communication (ethernet VS infiniband) it seems that this is the source of lack of scalability under the same conditions. Is it true? Is there any report of significant speedup by using ethernet communication among nodes in clusters? Regards, Ali |
Hi Ali,
Quote:
http://www.cfd-online.com/Forums/har...tml#post518234 - post #8 Your cluster already falls within the details given in the image, namely that 1Gbps connection is not enough to support so many processors. Best regards, Bruno |
Quote:
would you mind to explain this part of your quote in more detail? How can you tell then if it's a CPU cache problem? What should be saved in the cache? I can't imagine even the L3 cache is big enough to hold the whole mesh. Do you know of some tutorial or description of how to use the hierarcial decomposition method? I searched the user guide and the forum but didn't get a clue. Best regards, Kate |
Hi Kate,
Quote:
Quote:
Best regards, Bruno |
Hi Bruno,
I understand your thought process. But what does this mean for a real simulation. The problem is you can't actually see what is slowing down your parallel simulation, can you? My current way of procedure on a 2 socket machine with each having 6 cores and 3 memory channels is the following: 1) Run case in serial to have a reference 2) Run 2 threads on different sockets core-binded 3) Run 4 threads, 2 on every socket, core-binded 4) The same with 6, 8, 10 and 12 threads I run these test cases for 10 iterations each (is that enough), see which one finishes the fastet and go with this configuration for this case. Is there any other method? Regarding the hierarcial decomposition method. Not really. I don't understand what it is supposed to do. A quick example: Code:
28 hierarchicalCoeffs Code:
---------------------- How does the order of splitting effect the outcome? Best regards, Kate |
Hi Kate,
Quote:
Quote:
Quote:
Keep in mind that OpenFOAM technically uses boundary conditions of type "processor" for communicating the data between subdomains. And since small changes in a boundary condition can affect the solution, this means that more or less iterations might be needed to reach convergence. Keep in mind that this can either be iterations at the level of the matrix solvers (e.g. GAMG) or at the level of the outer iterations of the application solver (e.g. simpleFoam). Quote:
To a lesser extend, the other objective is to have the simulation be solved in the most efficient way possible, simultaneously if possible. This can be tested by modifying the "incompressible/icoFoam/cavity" tutorial case to be 3D and then test the various orders of decomposition. In theory, if we can have all of the domains work though the equation matrices in the same exact order in parallel, this should be the most optimum way to process the data. From your ASCII drawing, the efficient way would be to have all 6 processes work from the left to the right, then one line down and left to the right again, within their own subdomains, so that they are working side-by-side on solving the same parts of the matrices, at least for each pair of processes. I'm oversimplifying this, but this should become more apparent when testing with a 3D cavity case with a uniform mesh and uniform mesh distribution between processes. Translating this to a real simulation isn't as straight forward, but it can at least help you reduce the number of tests you need to do when looking for the best decomposition. But for more complex meshes, the usual decomposition to go with is Scotch or Metis, since they use graph theory (I can't remember the exact terminology) for trying to minimize the number of faces needed for communicating between subdomains. Best regards, Bruno |
Can you help me. the errors appear when I run parallel. the comment "reconstructPar"
1 Attachment(s)
hi, everyone.
I am running parallel in OpenFoam. When I comment "reconstructPar - latestTime", it appears the errors. the first: there are the coordinates of the face in the Polymesh have "word" in the number. the second: in the file P appear the symbol as "^,$,&" in the number in here. I hope everyone helps me. Attachment 58859 |
Quote:
what solver did you use? It appears to me that your mesh has reformed, in this case you need to reconstract the mesh first, then reconstruct the fields. |
Quote:
Hi Edmund I tried to do parallel calculation in two network pc by simulation does not run further it is stock as below please help me to find my failure [15:18][tec0683@rue-l020:/disk1/krishna/EinfacheRohre/bendtubeparalle/bendingtube]$ mpirun -np 8 -hostfile machines simpleFoam -parallel /*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: 2.1.1 | | \\ / A nd | Web: www.OpenFOAM.org | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : 2.1.1-221db2718bbb Exec : simpleFoam -parallel Date : Nov 01 2017 Time : 15:18:49 Host : "linxuman" PID : 13714 with regards Anna |
All times are GMT -4. The time now is 10:39. |