CFD Online Discussion Forums - Scale-Up Study in Parallel Processing with OpenFoam

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- OpenFOAM (https://www.cfd-online.com/Forums/openfoam/)

- - Scale-Up Study in Parallel Processing with OpenFoam (https://www.cfd-online.com/Forums/openfoam/74799-scale-up-study-parallel-processing-openfoam.html)

Scale-Up Study in Parallel Processing with OpenFoam

Hello everyone, Whats the plan? :D
Recently I have done a case with different numbers of processors to check the parallel processing performance in OpenFoam, But I got a strange result.
I made a simple case and used Scotch method, which is kinda acting like Metis method. Then I used the cluseter in our lab (CMTL in UIC) to solve these cases. The cluster has 4 nodes with 8 processors on each node.
The case is a 3d Backward facing step with 748K cells and Re=500, and the solver is rhoPisoFoam.
I used 2,4,6,8,16,24,32 processors at a time to solve same Case, but after running I saw the 32-processor case was slower than 16-processor case. The attached file shows the speed of solutions ( Time step/Minute) versus number of processors
Can Any one tell me what I did wrong, or if he /she had this problem before?
Also Can anyone tell me about that Time openFoam is reporting in log file?

Hi Sahm,

Your case is too small to stress your cluster, for 32 procs you only have 23k cells per proc. Your cluster is probably spending all its time communicating. Try a larger mesh (or faster interconnect).

Cheers
Andrew

With Metis Method

I changed decomposition method to Metis method, and there was an increase to my solution speeds, The attached file shows the speed-up.

Can you upload it in pdf or open-document format ?? Running linux, so no xlsx for me =(

Sorry for that,
this file is for 2 cases. one is backstep flow, with Solver rhopisoFoam, and 750k and 1.5m cells, the other one, is for cavity case with 1M cells, any comment about this is appreciated.

Is any body Going to Help? Is any body not Going to Help?

What is your question??

I would say you should get better scalability than this but you will have to say more about the system - cores per node, interconnect type etc.

Ok,
The system I`m working with is a cluster with 4 processing Node, and 1 head node, each node has 8 processors, with Infiniband connection between nodes. I dont know about ram of each node. But I didn't have any problem with that.
The problem with this scale up study is that when I increased the number of Processors, the speed of calculation is decreased. I also tried this case with 3M cells, But still this problem exists.
Do you have any Idea why this is happening and how I can make it better?

Greetings SAHM,

It seems you have an "isolate and conquer" problem! 4 machines, 8 cores each, means that you should first test in a ramification method.

In other words:

Do a battery of test on a single node, ranging from 1 to 8 cores. Analyse results - if this doesn't scale well, then you've got a problem on how the case is divided along the cores.
If it scaled well with a single node, do another battery of tests with only two nodes, but this time test 2 to 16 nodes, in equally distributing the scale up... in other words, it's the first test, by upping both nodes at the same time per test.
Repeat this battery of tests with the other two nodes, just to make sure if the nodes and interconnects are purely symmetrical and well balanced. If the times aren't nearly identical, then you've got a bottleneck somewhere :(
If the pairs are properly balanced, then do yet another battery of tests, this time using 3 nodes, ranging 3 to 24 processors, always scaling up 1 core per node per test.
Again, you should run again this battery of tests, by swapping one of the used nodes by the one left out.
And finally, a battery of tests, using all 4 nodes, doing tests from 4 to 32 cores, always scaling 1 core in all nodes.

Of course, if you don't feel like running all of these batteries of tests, then just do the fourth one!

I suggest these tests, because by what I've seen from the first graph, it feels there is somewhat of an inertia like problem with the machines! In other words, by analysing the first graph, it seems that:

1-2-4 processors - single node
6-8 -> 3-4 nodes - indicates a lag made by the interconnect, because the nodes have to be in sync
8-16-32 -> always 4 nodes - and this part of the graph is nearly linear!

Now, if this analysis reflects what really happened, then there is nothing wrong with the results! If you always want to use the 4 nodes/32 processors, then you should have compared always using 4 nodes! And/or compare:

4 cores: 1 node*4 cores/node vs 2 nodes*2 cores/node vs 4 nodes*1 cores/node
8 cores: 1 node*8 cores/node vs 2 nodes*4 cores/node vs 4 nodes*2 cores/node
16 cores: 2 nodes*8 cores/node vs 4 nodes*4 cores/node

This way you can account for the intrusive behaviour the interconnect has on your scalability!

Additionally, don't forget that your infinyband interconnect might have options to configure for jumbo packets (useful for gigantic data transfer between nodes, which would be the case with 50-100 million cells :eek:) or very low latency (useful for constant communication between nodes).

Another thing that could hinder the scalability is how frequently do you save a case snapshot. In other words, does this case store:

every iteration of the run;
only the last iteration;
or only needs to read at the start and doesn't save anything during execution, not even at the end.

Actually, if it saves all iterations, that would explain that flat line of the 4-6-8 processors!

Sooo... to conclude... be careful what you wish for! You just might get it :cool: ... or so I would think so :P

Edit ----------------------------
I didn't see the other three graphs before I posted. But my train of thought still applies. And I should add another thing to the test list:

pick the biggest case, compare only using the 4 nodes, by upping 1 core in all nodes at same time (4-8-12-16-20-24-28-32), and always run at least 10 times each test! Do an average and deviation analysis.

For this final test, many outcomes may apply:

Scenario 1: average is linear and deviation is small.
Conclusion: what the hell :confused: Wasn't working before and now is working? This shouldn't have happened....
Scenario 2: average is linear and deviation is large.
Conclusion: you've got a serious bottleneck somewhere... could be storage, interconnect, network interference, or someone left something running in the nodes, like a file indexing daemon...
Scenario 3: average is not linear and deviation is small.
Conclusion: the problem is reproducible. This means that hunting down the real problem will be easier this way, because when you change something, it will be directly reflected on the results!

And yet another thing comes to mind: Do the CPUs in the nodes have the "turbo boost" feature that some modern CPUs have? For example, when the CPU uses 8 cores, runs at 3GHz; but with 4 cores, runs at 3.3GHz; and with 2 cores runs at 3.6GHz? And with a single core, runs at 4GHz! This would seriously affect results...

Best regards,
Bruno

Wow, Still there is a question.

Thanks Bruno. I forgot to erase the first sheet since that case was not defined well, but the other cases have the same problem. About your comments to run them with different types of parallelism, I have a question.

Actually I don't know, how to assign a certain sub-domain (of a decomposed domain) into a certain processor. I mean to run your tests, I need to define a specific processor for a specific sub-domain, or at least I should define how many cores of a node I`m going to use. For example I have to define how to use 4 processors, 2 cores/node and 2 nodes. Can you tell me how to do this?

Since this cluster is shared between people in the lab, I need to ask for their permission when I'm going to use more than 8 cores. So running your case might take a long time. Besides, we use a software that enqueues the jobs in my cluster, its called Lava, and I should use it to define my jobs for the cluster, otherwise, other people might not see if I'm running something. Can you tell me how to define a job on certain cores on different nodes, and I'd appreciate it if you tell me how to do it with that Lava, if you know this software.

I have an idea that I would like to discuss with you. I think if I define sub-domains for specific cores, I can make the interconnections between nodes minimum. I mean If speed of connection between cores of a node is faster than the connection of nodes, assigning neighbor sub-domains into cores of a single node might help since this reduces the data transferred between nodes ( through a slower connection). I would like to know your idea about this concept.

Thanks again for your comment.

Greetings SAHM,

Quote:

Originally Posted by sahm (Post 256371)

For example I have to define how to use 4 processors, 2 cores/node and 2 nodes. Can you tell me how to do this?

Well, I know that with OpenMPI and MPICH2 you can define which machines to use and how many cores per machine. After a little searching, I found a tutorial (--> MPI Tutorial <--) that has an example of a machine file:

Code:

cat machinefile.morab 

morab001

morab001

morab002

morab002

For OpenFOAM, if I'm not mistaken, you can create a file named "machines" or "hostfile" with a list of machines (by IP or name) you wish to use; then save that file either in the case folder or in the "system" folder of the case. Then use OpenFOAM's foamJob script for launching the case. For example, to launch icoFoam in parallel and with output to screen:

Code:

foamJob -p -s icoFoam

And it will create a log file named "log".
So, for more information about the MPI application you are using, you should consult its manual ;) You could also tweak the foamJob script (run "which foamJob" to know where it is ;)) to better suit your needs!

Quote:

Originally Posted by sahm (Post 256371)

Besides, we use a software that enqueues the jobs in my cluster, its called Lava, and I should use it to define my jobs for the cluster, otherwise, other people might not see if I'm running something. Can you tell me how to define a job on certain cores on different nodes, and I'd appreciate it if you tell me how to do it with that Lava, if you know this software.

Quick answer: check with your cluster/system's administrator or the Lava manual ;)
Not-so-quick answer: See --> this <-- short tutorial and the links it points to. But by what I can briefly see, it seems that it can operate as a wrapper for mpirun. So, my guess is that you could do a symbolic link of mpirun to Lava's mpirun and use foamJob as it is now!

Quote:

Originally Posted by sahm (Post 256371)

I have an idea that I would like to discuss with you. I think if I define sub-domains for specific cores, I can make the interconnections between nodes minimum. I mean If speed of connection between cores of a node is faster than the connection of nodes, assigning neighbor sub-domains into cores of a single node might help since this reduces the data transferred between nodes ( through a slower connection). I would like to know your idea about this concept.

Yep, you are thinking in the right direction :) Assuming that the "machines" file will be respected, you should be able to assign auto-magically each sub-domain (each processor folder made by decomposePar) to each respective machine, as long as mpirun will follow the order in the machine list. As for defining the dimensions of each sub-domain in the decomposeParDict... I have no clue :( Unfortunately I only have a little experience with running OpenFOAM in parallel, and I still have a lot to master in OpenFOAM :p

Best regards,
Bruno