CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

Parallel processing of OpenFOAM cases on multicore processor???

Register Blogs Community New Posts Updated Threads Search

Like Tree29Likes

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   November 20, 2013, 20:06
Default
  #21
New Member
 
Join Date: Dec 2012
Posts: 19
Rep Power: 13
smraniaki is on a distinguished road
I'm not sure why you are having longer computation time but I have a guess:
Your longer processing time could be due to unoptimized selection of number of decomposition domain. when you decomposed the domain the communication between each threads during parallel computation also take process which correspondingly demands more time. In your case I believe if you decompose you domain into 5 or 3 instead of 4, you should be facing different processing time as the communication between threads might decrease or increase. it is not always efficient to decompose the domain into several parts.
smraniaki is offline   Reply With Quote

Old   May 30, 2015, 03:15
Default
  #22
Member
 
Ali Shamooni
Join Date: Oct 2010
Posts: 44
Rep Power: 15
Alish1984 is on a distinguished road
Quote:
Originally Posted by eddi0907 View Post
Hi Bruno,

The Processors are XEON W5580 (4 cores) or X5680 (6 cores) with 3.2 respectively 3.33 GHz without overclocking.
The Memory is DDR3-1333.
I used normal Ethernet 1GbpS.

The modelsize was 1 Million cells.

Running on 2 cores the Speedup is 2 as well.
Using 4 cores on the same CPU the speedup is only ~2.6 but Using 4 cores on 2 CPU's the speedup will be nearly 4!
It seems to be the same when looking at the unofficial benchmarks (http://code.google.com/p/bluecfd-sin...SE_12.1_x86_64)

So on a cluster where you use machines with more than one CPU you need to do corbinding and define which task on which cpu-core with an additional rankfile in openmpi.

Example 2 Dual CPU machines (no matter if 4 or 6 cores):

mpirun -np 8 -hostfile ./hostfile.txt -rankfile ./rankfile.txt icoFoam -parallel

hostfile:

host_1
host_2

rankfile:

rank 0 =host_1 slot=0:0
rank 1 =host_1 slot=0:1
rank 2 =host_1 slot=1:0
rank 3 =host_1 slot=1:1
rank 4 =host_2 slot=0:0
rank 5 =host_2 slot=0:1
rank 6 =host_2 slot=1:0
rank 7 =host_2 slot=1:1

Now the job runs on 8 cores distributed on cores 0 and 1 on any of the 4 CPU's with a speedup of more than 7.

Perhaps the newest generation of CPU's show a faster CPU-RAM communication and one can use 3 cores per CPU.

Kind Regards.

Edmund

Dear Edmund and Bruno,

It seems that Open MPI rank file can not detect multi threads, I mean when u have cores with HT enabled, in a rankfile u can only include physical processors. Is there any solution?

Regards,
Ali
Alish1984 is offline   Reply With Quote

Old   May 30, 2015, 08:20
Default
  #23
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Quote:
Originally Posted by Alish1984 View Post
It seems that Open MPI rank file can not detect multi threads, I mean when u have cores with HT enabled, in a rankfile u can only include physical processors. Is there any solution?
Quick answer: You will not see a substantial performance increase when using HyperThreading with OpenFOAM. It's best that you only use the physical cores.

Beyond that, a very quick search lead me to this answer: http://stackoverflow.com/a/11761943
Alish1984 likes this.
wyldckat is offline   Reply With Quote

Old   May 31, 2015, 07:35
Default
  #24
Member
 
Ali Shamooni
Join Date: Oct 2010
Posts: 44
Rep Power: 15
Alish1984 is on a distinguished road
Quote:
Originally Posted by wyldckat View Post
Quick answer: You will not see a substantial performance increase when using HyperThreading with OpenFOAM. It's best that you only use the physical cores.

Beyond that, a very quick search lead me to this answer: http://stackoverflow.com/a/11761943
Dear Bruno,

Tnx for quick response. It was helpful.
I know that the maximum speedup would be 10-30% in some cases, when some processors become idle e.g. in combustion probs. I refer u to this paper, "An Empirical Study of Hyper-Threading in High Performance Computing Clusters".

Ok lets forget the HT for the moment. I have another question, is there any report of OpenFOAM scalability above 32 processors like this "https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje" but without infiniband communication? I mean with Ethernet communication among nodes?

The question may seem weird but let me describe it more, I'm not a pro in computer science so excuse me for probable mistakes. We have 3 Supermicro servers, each has 2 Intel Xeon E5-2690 (2*10 cores). I connected them via ethernet with Cat6 cables and a high speed switch.
The problem is that I cant reproduce the result of "https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje" in 1M cells cavity case using 32 processors.
The solution in 1 node is scalable, however increasing the nodes to 2 and 3 (40 and 60 processors respectively) there is no substantial speedup.

When I change the problem to the combustion case (PDE+ODE solutions) an interesting behavior is seen. The scalability of ODE solution part is linear. But the PDEs solution time is still the same like cavity case.

So it comes to my mind that maybe this is the prblem of communication among nodes. Since ODE solution part doesn't need any synchronization while the PDEs do.

The conclusion: since the only major difference btw me and the cluster in "https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje" is the type of communication (ethernet VS infiniband) it seems that this is the source of lack of scalability under the same conditions.

Is it true? Is there any report of significant speedup by using ethernet communication among nodes in clusters?

Regards,

Ali
Alish1984 is offline   Reply With Quote

Old   May 31, 2015, 17:58
Default
  #25
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Ali,

Quote:
Originally Posted by Alish1984 View Post
I have another question, is there any report of OpenFOAM scalability above 32 processors like this "https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje" but without infiniband communication? I mean with Ethernet communication among nodes?
I know there are more examples on the Hardware forum, but I can't find them right now. The one I found after a quick search is in the attached image on this post:
http://www.cfd-online.com/Forums/har...tml#post518234 - post #8
Your cluster already falls within the details given in the image, namely that 1Gbps connection is not enough to support so many processors.

Best regards,
Bruno
wyldckat is offline   Reply With Quote

Old   October 29, 2015, 06:31
Default
  #26
Senior Member
 
Join Date: Mar 2015
Posts: 250
Rep Power: 12
KateEisenhower is on a distinguished road
Quote:
Originally Posted by wyldckat View Post
  1. I would suggest splitting the case in 2,3,4,5,6 and 12 sub-domains, to try and isolate if it's a CPU cache problem. I've had a situation where a 6 core CPU was faster with 16 sub-domains than 6 sub-domains
Hi Bruno,


would you mind to explain this part of your quote in more detail? How can you tell then if it's a CPU cache problem? What should be saved in the cache? I can't imagine even the L3 cache is big enough to hold the whole mesh.


Do you know of some tutorial or description of how to use the hierarcial decomposition method? I searched the user guide and the forum but didn't get a clue.

Best regards,

Kate
KateEisenhower is offline   Reply With Quote

Old   October 31, 2015, 08:44
Default
  #27
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Kate,

Quote:
Originally Posted by KateEisenhower View Post
would you mind to explain this part of your quote in more detail? How can you tell then if it's a CPU cache problem? What should be saved in the cache? I can't imagine even the L3 cache is big enough to hold the whole mesh.
The logic in my thought process is that when we have over-scheduling going on, it can eventually end up in a situation of "least effort" as a result of the bottlenecking effect, namely where:
  • all processes are either accessing neighbouring memory regions that are common to various processes;
  • or all processes only need to access a particular region in the memory for each process, that is needed for communicating between various processes.
For example, if 4 processes are dealing with a corner in their decompositions that are common to all sub-domains, this would mean that this memory region would be used as data source for each process to communicate with 2 or more processes at the same time.



Quote:
Originally Posted by KateEisenhower View Post
Do you know of some tutorial or description of how to use the hierarcial decomposition method? I searched the user guide and the forum but didn't get a clue.
Fortunately I believe/hope you've already found some more details about this: http://www.cfd-online.com/Forums/ope...mulations.html

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   November 2, 2015, 04:28
Default
  #28
Senior Member
 
Join Date: Mar 2015
Posts: 250
Rep Power: 12
KateEisenhower is on a distinguished road
Hi Bruno,

I understand your thought process. But what does this mean for a real simulation. The problem is you can't actually see what is slowing down your parallel simulation, can you?
My current way of procedure on a 2 socket machine with each having 6 cores and 3 memory channels is the following:

1) Run case in serial to have a reference
2) Run 2 threads on different sockets core-binded
3) Run 4 threads, 2 on every socket, core-binded
4) The same with 6, 8, 10 and 12 threads

I run these test cases for 10 iterations each (is that enough), see which one finishes the fastet and go with this configuration for this case. Is there any other method?

Regarding the hierarcial decomposition method. Not really. I don't understand what it is supposed to do. A quick example:

Code:
28  hierarchicalCoeffs 
29  { 
30      n               ( 3 1 2 ); 
31      delta           0.001; 
32      order           xyz; 
33  }
would look as follows:


Code:
----------------------
I      I      I      I
----------------------
I      I      I      I
----------------------
Î: z-direction ->: x-direction

How does the order of splitting effect the outcome?

Best regards,

Kate

Last edited by KateEisenhower; November 2, 2015 at 04:40. Reason: clarification
KateEisenhower is offline   Reply With Quote

Old   November 2, 2015, 17:34
Default
  #29
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Kate,

Quote:
Originally Posted by KateEisenhower View Post
The problem is you can't actually see what is slowing down your parallel simulation, can you?
I know that there are MPI profiling tools that can try to give you this kind of information, but I've never used them myself.

Quote:
Originally Posted by KateEisenhower View Post
My current way of procedure on a 2 socket machine with each having 6 cores and 3 memory channels is the following:

1) Run case in serial to have a reference
2) Run 2 threads on different sockets core-binded
3) Run 4 threads, 2 on every socket, core-binded
4) The same with 6, 8, 10 and 12 threads
For a particular type of test cases, this is usually the way to do this. Your mileage can vary depending on the type of simulation (e.g. simpleFoam or reactingFoam), mesh configuration, and on the matrix solver settings defined in "fvSolution".

Quote:
Originally Posted by KateEisenhower View Post
I run these test cases for 10 iterations each (is that enough), see which one finishes the fastet and go with this configuration for this case. Is there any other method?
The number of subdomains and the way the subdomains were divided, can affect the necessary number of iterations for the simulation to converge. This to say that 10 iterations might not be enough to give you a good enough comparison. For example, comparing 10 vs 11 vs 12 seconds isn't as good as comparing 101 vs 113 vs 118 seconds.

Keep in mind that OpenFOAM technically uses boundary conditions of type "processor" for communicating the data between subdomains. And since small changes in a boundary condition can affect the solution, this means that more or less iterations might be needed to reach convergence. Keep in mind that this can either be iterations at the level of the matrix solvers (e.g. GAMG) or at the level of the outer iterations of the application solver (e.g. simpleFoam).

Quote:
Originally Posted by KateEisenhower View Post
Regarding the hierarcial decomposition method. Not really. I don't understand what it is supposed to do. A quick example:

Code:
28  hierarchicalCoeffs 
29  { 
30      n               ( 3 1 2 ); 
31      delta           0.001; 
32      order           xyz; 
33  }
would look as follows:


Code:
----------------------
I      I      I      I
----------------------
I      I      I      I
----------------------
Î: z-direction ->: x-direction

How does the order of splitting effect the outcome?
The standard objective is simple enough: keep the number of faces shared between subdomains down to the smallest possible number. Because the fewer the shared faces, the less time is spent communicating between processes.

To a lesser extend, the other objective is to have the simulation be solved in the most efficient way possible, simultaneously if possible. This can be tested by modifying the "incompressible/icoFoam/cavity" tutorial case to be 3D and then test the various orders of decomposition. In theory, if we can have all of the domains work though the equation matrices in the same exact order in parallel, this should be the most optimum way to process the data.
From your ASCII drawing, the efficient way would be to have all 6 processes work from the left to the right, then one line down and left to the right again, within their own subdomains, so that they are working side-by-side on solving the same parts of the matrices, at least for each pair of processes.

I'm oversimplifying this, but this should become more apparent when testing with a 3D cavity case with a uniform mesh and uniform mesh distribution between processes.

Translating this to a real simulation isn't as straight forward, but it can at least help you reduce the number of tests you need to do when looking for the best decomposition.
But for more complex meshes, the usual decomposition to go with is Scotch or Metis, since they use graph theory (I can't remember the exact terminology) for trying to minimize the number of faces needed for communicating between subdomains.

Best regards,
Bruno
KateEisenhower likes this.
wyldckat is offline   Reply With Quote

Old   October 8, 2017, 00:08
Default Can you help me. the errors appear when I run parallel. the comment "reconstructPar"
  #30
Member
 
ESI
Join Date: Sep 2017
Posts: 46
Rep Power: 8
ht2017 is on a distinguished road
hi, everyone.
I am running parallel in OpenFoam. When I comment "reconstructPar - latestTime", it appears the errors.

the first: there are the coordinates of the face in the Polymesh have "word" in the number.
the second: in the file P appear the symbol as "^,$,&" in the number in here.

I hope everyone helps me.

thanh.jpg
ht2017 is offline   Reply With Quote

Old   October 8, 2017, 13:18
Default
  #31
New Member
 
Join Date: Dec 2012
Posts: 19
Rep Power: 13
smraniaki is on a distinguished road
Quote:
Originally Posted by ht2017 View Post
hi, everyone.
I am running parallel in OpenFoam. When I comment "reconstructPar - latestTime", it appears the errors.

the first: there are the coordinates of the face in the Polymesh have "word" in the number.
the second: in the file P appear the symbol as "^,$,&" in the number in here.

I hope everyone helps me.

Attachment 58859

what solver did you use? It appears to me that your mesh has reformed, in this case you need to reconstract the mesh first, then reconstruct the fields.
smraniaki is offline   Reply With Quote

Old   November 1, 2017, 09:25
Default
  #32
New Member
 
Join Date: Jul 2017
Posts: 10
Rep Power: 8
OpenFoamlove is on a distinguished road
Quote:
Originally Posted by eddi0907 View Post
Hi Bruno,

The Processors are XEON W5580 (4 cores) or X5680 (6 cores) with 3.2 respectively 3.33 GHz without overclocking.
The Memory is DDR3-1333.
I used normal Ethernet 1GbpS.

The modelsize was 1 Million cells.

Running on 2 cores the Speedup is 2 as well.
Using 4 cores on the same CPU the speedup is only ~2.6 but Using 4 cores on 2 CPU's the speedup will be nearly 4!
It seems to be the same when looking at the unofficial benchmarks (http://code.google.com/p/bluecfd-sin...SE_12.1_x86_64)

So on a cluster where you use machines with more than one CPU you need to do corbinding and define which task on which cpu-core with an additional rankfile in openmpi.

Example 2 Dual CPU machines (no matter if 4 or 6 cores):

mpirun -np 8 -hostfile ./hostfile.txt -rankfile ./rankfile.txt icoFoam -parallel

hostfile:

host_1
host_2

rankfile:

rank 0 =host_1 slot=0:0
rank 1 =host_1 slot=0:1
rank 2 =host_1 slot=1:0
rank 3 =host_1 slot=1:1
rank 4 =host_2 slot=0:0
rank 5 =host_2 slot=0:1
rank 6 =host_2 slot=1:0
rank 7 =host_2 slot=1:1

Now the job runs on 8 cores distributed on cores 0 and 1 on any of the 4 CPU's with a speedup of more than 7.

Perhaps the newest generation of CPU's show a faster CPU-RAM communication and one can use 3 cores per CPU.

Kind Regards.

Edmund

Hi Edmund I tried to do parallel calculation in two network pc by simulation does not run further it is stock as below please help me to find my failure


[15:18][tec0683@rue-l020:/disk1/krishna/EinfacheRohre/bendtubeparalle/bendingtube]$ mpirun -np 8 -hostfile machines simpleFoam -parallel
/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 2.1.1 |
| \\ / A nd | Web: www.OpenFOAM.org |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : 2.1.1-221db2718bbb
Exec : simpleFoam -parallel
Date : Nov 01 2017
Time : 15:18:49
Host : "linxuman"
PID : 13714


with regards Anna
OpenFoamlove is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
HP MPI warning...Distributed parallel processing Peter CFX 10 May 14, 2011 06:17
FSI and parallel processing Jorn CFX 5 June 8, 2007 15:53
Paradox in parallel processing Vagelis FLUENT 0 October 26, 2005 05:36
About parallel processing in Linux tuks CFX 10 August 8, 2005 08:22
Parallel processing L.S. Frinch FLUENT 1 August 21, 2001 13:00


All times are GMT -4. The time now is 18:38.