CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM

Scale-Up Study in Parallel Processing with OpenFoam

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree1Likes
  • 1 Post By sahm

Reply
 
LinkBack Thread Tools Display Modes
Old   April 8, 2010, 16:12
Default Scale-Up Study in Parallel Processing with OpenFoam
  #1
Member
 
sahm's Avatar
 
S. Ali H.M.
Join Date: Nov 2009
Location: Chicago
Posts: 69
Rep Power: 7
sahm is on a distinguished road
Send a message via Yahoo to sahm
Hello everyone, Whats the plan?
Recently I have done a case with different numbers of processors to check the parallel processing performance in OpenFoam, But I got a strange result.
I made a simple case and used Scotch method, which is kinda acting like Metis method. Then I used the cluseter in our lab (CMTL in UIC) to solve these cases. The cluster has 4 nodes with 8 processors on each node.
The case is a 3d Backward facing step with 748K cells and Re=500, and the solver is rhoPisoFoam.
I used 2,4,6,8,16,24,32 processors at a time to solve same Case, but after running I saw the 32-processor case was slower than 16-processor case. The attached file shows the speed of solutions ( Time step/Minute) versus number of processors
Can Any one tell me what I did wrong, or if he /she had this problem before?
Also Can anyone tell me about that Time openFoam is reporting in log file?
Attached Files
File Type: zip Parallel Scale up Study.zip (9.6 KB, 154 views)
calim_cfd likes this.
sahm is offline   Reply With Quote

Old   April 9, 2010, 00:57
Default
  #2
Member
 
Andrew King
Join Date: Mar 2009
Location: Perth, Western Australia, Australia
Posts: 81
Rep Power: 8
andersking is on a distinguished road
Hi Sahm,

Your case is too small to stress your cluster, for 32 procs you only have 23k cells per proc. Your cluster is probably spending all its time communicating. Try a larger mesh (or faster interconnect).

Cheers
Andrew
__________________
Dr Andrew King
Fluid Dynamics Research Group
Curtin University
andersking is offline   Reply With Quote

Old   April 14, 2010, 16:43
Default With Metis Method
  #3
Member
 
sahm's Avatar
 
S. Ali H.M.
Join Date: Nov 2009
Location: Chicago
Posts: 69
Rep Power: 7
sahm is on a distinguished road
Send a message via Yahoo to sahm
I changed decomposition method to Metis method, and there was an increase to my solution speeds, The attached file shows the speed-up.
Attached Files
File Type: zip Parallel Scale up Study-2.zip (9.7 KB, 118 views)
sahm is offline   Reply With Quote

Old   April 19, 2010, 08:46
Default
  #4
Member
 
Fábio César Canesin
Join Date: Mar 2010
Location: Florianópolis
Posts: 67
Rep Power: 7
Canesin is on a distinguished road
Can you upload it in pdf or open-document format ?? Running linux, so no xlsx for me =(
Canesin is offline   Reply With Quote

Old   April 20, 2010, 15:22
Default New Results.
  #5
Member
 
sahm's Avatar
 
S. Ali H.M.
Join Date: Nov 2009
Location: Chicago
Posts: 69
Rep Power: 7
sahm is on a distinguished road
Send a message via Yahoo to sahm
Sorry for that,
this file is for 2 cases. one is backstep flow, with Solver rhopisoFoam, and 750k and 1.5m cells, the other one, is for cavity case with 1M cells, any comment about this is appreciated.
Attached Files
File Type: pdf Parallel Scale up Study.pdf (94.8 KB, 359 views)
sahm is offline   Reply With Quote

Old   April 22, 2010, 14:59
Default
  #6
Member
 
sahm's Avatar
 
S. Ali H.M.
Join Date: Nov 2009
Location: Chicago
Posts: 69
Rep Power: 7
sahm is on a distinguished road
Send a message via Yahoo to sahm
Is any body Going to Help? Is any body not Going to Help?

Last edited by sahm; April 26, 2010 at 15:16. Reason: 2 Admin: If you can please delete this message.
sahm is offline   Reply With Quote

Old   April 23, 2010, 04:50
Default
  #7
New Member
 
Simon Rees
Join Date: Mar 2009
Posts: 12
Rep Power: 8
sjrees is on a distinguished road
What is your question??

I would say you should get better scalability than this but you will have to say more about the system - cores per node, interconnect type etc.
sjrees is offline   Reply With Quote

Old   April 23, 2010, 17:08
Default
  #8
Member
 
sahm's Avatar
 
S. Ali H.M.
Join Date: Nov 2009
Location: Chicago
Posts: 69
Rep Power: 7
sahm is on a distinguished road
Send a message via Yahoo to sahm
Ok,
The system I`m working with is a cluster with 4 processing Node, and 1 head node, each node has 8 processors, with Infiniband connection between nodes. I dont know about ram of each node. But I didn't have any problem with that.
The problem with this scale up study is that when I increased the number of Processors, the speed of calculation is decreased. I also tried this case with 3M cells, But still this problem exists.
Do you have any Idea why this is happening and how I can make it better?
sahm is offline   Reply With Quote

Old   April 23, 2010, 18:30
Default
  #9
Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 8,507
Blog Entries: 34
Rep Power: 86
wyldckat is just really nicewyldckat is just really nicewyldckat is just really nicewyldckat is just really nice
Greetings SAHM,

It seems you have an "isolate and conquer" problem! 4 machines, 8 cores each, means that you should first test in a ramification method.

In other words:
  1. Do a battery of test on a single node, ranging from 1 to 8 cores. Analyse results - if this doesn't scale well, then you've got a problem on how the case is divided along the cores.
  2. If it scaled well with a single node, do another battery of tests with only two nodes, but this time test 2 to 16 nodes, in equally distributing the scale up... in other words, it's the first test, by upping both nodes at the same time per test.
    Repeat this battery of tests with the other two nodes, just to make sure if the nodes and interconnects are purely symmetrical and well balanced. If the times aren't nearly identical, then you've got a bottleneck somewhere
  3. If the pairs are properly balanced, then do yet another battery of tests, this time using 3 nodes, ranging 3 to 24 processors, always scaling up 1 core per node per test.
    Again, you should run again this battery of tests, by swapping one of the used nodes by the one left out.
  4. And finally, a battery of tests, using all 4 nodes, doing tests from 4 to 32 cores, always scaling 1 core in all nodes.
Of course, if you don't feel like running all of these batteries of tests, then just do the fourth one!

I suggest these tests, because by what I've seen from the first graph, it feels there is somewhat of an inertia like problem with the machines! In other words, by analysing the first graph, it seems that:
  • 1-2-4 processors - single node
  • 6-8 -> 3-4 nodes - indicates a lag made by the interconnect, because the nodes have to be in sync
  • 8-16-32 -> always 4 nodes - and this part of the graph is nearly linear!
Now, if this analysis reflects what really happened, then there is nothing wrong with the results! If you always want to use the 4 nodes/32 processors, then you should have compared always using 4 nodes! And/or compare:
  • 4 cores: 1 node*4 cores/node vs 2 nodes*2 cores/node vs 4 nodes*1 cores/node
  • 8 cores: 1 node*8 cores/node vs 2 nodes*4 cores/node vs 4 nodes*2 cores/node
  • 16 cores: 2 nodes*8 cores/node vs 4 nodes*4 cores/node
This way you can account for the intrusive behaviour the interconnect has on your scalability!

Additionally, don't forget that your infinyband interconnect might have options to configure for jumbo packets (useful for gigantic data transfer between nodes, which would be the case with 50-100 million cells ) or very low latency (useful for constant communication between nodes).

Another thing that could hinder the scalability is how frequently do you save a case snapshot. In other words, does this case store:
  • every iteration of the run;
  • only the last iteration;
  • or only needs to read at the start and doesn't save anything during execution, not even at the end.
Actually, if it saves all iterations, that would explain that flat line of the 4-6-8 processors!

Sooo... to conclude... be careful what you wish for! You just might get it ... or so I would think so :P


Edit ----------------------------
I didn't see the other three graphs before I posted. But my train of thought still applies. And I should add another thing to the test list:
  • pick the biggest case, compare only using the 4 nodes, by upping 1 core in all nodes at same time (4-8-12-16-20-24-28-32), and always run at least 10 times each test! Do an average and deviation analysis.
For this final test, many outcomes may apply:
  • Scenario 1: average is linear and deviation is small.
    Conclusion: what the hell Wasn't working before and now is working? This shouldn't have happened....
  • Scenario 2: average is linear and deviation is large.
    Conclusion: you've got a serious bottleneck somewhere... could be storage, interconnect, network interference, or someone left something running in the nodes, like a file indexing daemon...
  • Scenario 3: average is not linear and deviation is small.
    Conclusion: the problem is reproducible. This means that hunting down the real problem will be easier this way, because when you change something, it will be directly reflected on the results!
And yet another thing comes to mind: Do the CPUs in the nodes have the "turbo boost" feature that some modern CPUs have? For example, when the CPU uses 8 cores, runs at 3GHz; but with 4 cores, runs at 3.3GHz; and with 2 cores runs at 3.6GHz? And with a single core, runs at 4GHz! This would seriously affect results...

Best regards,
Bruno

Last edited by wyldckat; April 23, 2010 at 18:58. Reason: added a missing thought process... and then I saw the other 3 graphs...
wyldckat is offline   Reply With Quote

Old   April 26, 2010, 15:12
Default Wow, Still there is a question.
  #10
Member
 
sahm's Avatar
 
S. Ali H.M.
Join Date: Nov 2009
Location: Chicago
Posts: 69
Rep Power: 7
sahm is on a distinguished road
Send a message via Yahoo to sahm
Thanks Bruno. I forgot to erase the first sheet since that case was not defined well, but the other cases have the same problem. About your comments to run them with different types of parallelism, I have a question.

Actually I don't know, how to assign a certain sub-domain (of a decomposed domain) into a certain processor. I mean to run your tests, I need to define a specific processor for a specific sub-domain, or at least I should define how many cores of a node I`m going to use. For example I have to define how to use 4 processors, 2 cores/node and 2 nodes. Can you tell me how to do this?

Since this cluster is shared between people in the lab, I need to ask for their permission when I'm going to use more than 8 cores. So running your case might take a long time. Besides, we use a software that enqueues the jobs in my cluster, its called Lava, and I should use it to define my jobs for the cluster, otherwise, other people might not see if I'm running something. Can you tell me how to define a job on certain cores on different nodes, and I'd appreciate it if you tell me how to do it with that Lava, if you know this software.

I have an idea that I would like to discuss with you. I think if I define sub-domains for specific cores, I can make the interconnections between nodes minimum. I mean If speed of connection between cores of a node is faster than the connection of nodes, assigning neighbor sub-domains into cores of a single node might help since this reduces the data transferred between nodes ( through a slower connection). I would like to know your idea about this concept.

Thanks again for your comment.
sahm is offline   Reply With Quote

Old   April 26, 2010, 17:37
Default
  #11
Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 8,507
Blog Entries: 34
Rep Power: 86
wyldckat is just really nicewyldckat is just really nicewyldckat is just really nicewyldckat is just really nice
Greetings SAHM,

Quote:
Originally Posted by sahm View Post
For example I have to define how to use 4 processors, 2 cores/node and 2 nodes. Can you tell me how to do this?
Well, I know that with OpenMPI and MPICH2 you can define which machines to use and how many cores per machine. After a little searching, I found a tutorial (--> MPI Tutorial <--) that has an example of a machine file:
Code:
cat machinefile.morab 
morab001
morab001
morab002
morab002
For OpenFOAM, if I'm not mistaken, you can create a file named "machines" or "hostfile" with a list of machines (by IP or name) you wish to use; then save that file either in the case folder or in the "system" folder of the case. Then use OpenFOAM's foamJob script for launching the case. For example, to launch icoFoam in parallel and with output to screen:
Code:
foamJob -p -s icoFoam
And it will create a log file named "log".
So, for more information about the MPI application you are using, you should consult its manual You could also tweak the foamJob script (run "which foamJob" to know where it is ) to better suit your needs!

Quote:
Originally Posted by sahm View Post
Besides, we use a software that enqueues the jobs in my cluster, its called Lava, and I should use it to define my jobs for the cluster, otherwise, other people might not see if I'm running something. Can you tell me how to define a job on certain cores on different nodes, and I'd appreciate it if you tell me how to do it with that Lava, if you know this software.
Quick answer: check with your cluster/system's administrator or the Lava manual
Not-so-quick answer: See --> this <-- short tutorial and the links it points to. But by what I can briefly see, it seems that it can operate as a wrapper for mpirun. So, my guess is that you could do a symbolic link of mpirun to Lava's mpirun and use foamJob as it is now!

Quote:
Originally Posted by sahm View Post
I have an idea that I would like to discuss with you. I think if I define sub-domains for specific cores, I can make the interconnections between nodes minimum. I mean If speed of connection between cores of a node is faster than the connection of nodes, assigning neighbor sub-domains into cores of a single node might help since this reduces the data transferred between nodes ( through a slower connection). I would like to know your idea about this concept.
Yep, you are thinking in the right direction Assuming that the "machines" file will be respected, you should be able to assign auto-magically each sub-domain (each processor folder made by decomposePar) to each respective machine, as long as mpirun will follow the order in the machine list. As for defining the dimensions of each sub-domain in the decomposeParDict... I have no clue Unfortunately I only have a little experience with running OpenFOAM in parallel, and I still have a lot to master in OpenFOAM

Best regards,
Bruno
wyldckat is offline   Reply With Quote

Reply

Tags
cluster, openfoam, parallel, parallel processing, scale up

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
running OpenFoam in parallel vishwa OpenFOAM 22 August 2, 2015 08:53
Parallel solution in OpenFOAM makaveli_lcf OpenFOAM Running, Solving & CFD 0 September 21, 2009 08:07
parallel processing with icoFoam sudhar OpenFOAM Running, Solving & CFD 0 June 22, 2009 22:00
Parallel Processing Set Up virgilhowardson FLUENT 1 May 11, 2009 00:54
parallel processing and dynamic mesh Anand FLUENT 0 January 9, 2009 16:52


All times are GMT -4. The time now is 12:43.