|
[Sponsors] |
Scaling from 1 to 2 nodes shows 122% performance increase |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
October 26, 2012, 18:41 |
Scaling from 1 to 2 nodes shows 122% performance increase
|
#1 |
Senior Member
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,167
Rep Power: 23 |
Just thought I'd share the somewhat unexpected results of my 2 node "cluster". I'm using two identical i7-3930K computers overclocked to 4.4 GHz, each with 32GB of 2133MHz ram. They are connected using Intel gigabit and I'm using platform-MPI running ANSYS CFX v14.
Benchmark case has ~4 million nodes - steady state thermal with multiple domains. When comparing: 1 computer running 4 cores to 2 computers running 4 cores each My speedup shows to be 2.22 times faster ! So much for linear scaling, has anyone else seen this, it just seems a little odd to me, though I'm definitely happy about it! This is something to consider If anyone has been thinking about adding a second node. EDIT: After running it a few more times I realized during my single node simulation I accidentally had the CPU downclocked to 3.8GHz instead of 4.4. So the 15.6% Overclock gave me the extra 11% speed per node. Running it again with the same 4.4GHz clock speed on all nodes I got 99.5% efficient scaling. Sorry for the misinformation. Last edited by evcelica; October 31, 2012 at 10:58. Reason: Mistake in information |
|
October 27, 2012, 18:22 |
|
#2 |
Senior Member
Join Date: Oct 2009
Location: Germany
Posts: 636
Rep Power: 21 |
There are two possible explanations:
Either the method to investigate the performance fails for some reason. It might be interesting to see how you've judged the performance gain. Or that might happen due to increased memory bandwidth when running on both nodes. And of course, the cell count per core is not that low, so there should be no run into scaling issues due to communication latency. And of course, your memory might be very fast, depending on the memory timings. But haven't ever seen something similar. Although a nearly linear scaling could be seen on our Sandy Bridge cluster down to very low cells/core, and it performs VERY well, speedup efficiency was never above 1.
__________________
We do three types of jobs here: GOOD, FAST AND CHEAP You may choose any two! |
|
October 28, 2012, 06:06 |
|
#3 |
Senior Member
Charles
Join Date: Apr 2009
Posts: 185
Rep Power: 18 |
Better than 100% scaling on very small clusters is not so unusual, because you can benefit from the additional extremely fast cache memory that becomes available.
|
|
October 28, 2012, 11:32 |
|
#4 |
Senior Member
Join Date: Oct 2009
Location: Germany
Posts: 636
Rep Power: 21 |
I agree, CapSizer. I just wondered why I haven't seen this before, since the Sandy Bridge E in our cluster have 20MB cache and each two can communicate via QPI and via infiniband with all others.
Therefore I would still say, it strongly depends on the case and the used hardware.
__________________
We do three types of jobs here: GOOD, FAST AND CHEAP You may choose any two! |
|
October 28, 2012, 12:09 |
|
#5 |
Senior Member
Charles
Join Date: Apr 2009
Posts: 185
Rep Power: 18 |
I think also that perhaps the way the solver is parallelized may have an effect? It is a long time since I used CFX, but from what I can remember it was less sensitive to inter-node communication than other solvers I have used. It is as if it was doing a lot of work per iteration before communicating between nodes, so perhaps it suffers less from inter-node latency, and more likely to benefit form the extra cache? Just a guess ...
|
|
October 31, 2012, 10:58 |
|
#6 |
Senior Member
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,167
Rep Power: 23 |
EDIT:
After running it a few more times I realized during my single node simulation I accidentally had the CPU downclocked to 3.8GHz instead of 4.4. So the 15.6% Overclock gave me the extra 11% speed per node. Running it again with the same 4.4GHz clock speed on all nodes I got 99.5% efficient scaling. Sorry for the misinformation. |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
mpirun, best parameters | pablodecastillo | Hardware | 18 | November 10, 2016 12:36 |
Global scaling - grid independency testing | Jenny | CFX | 6 | December 10, 2013 03:52 |
Performance Improvements by core locking | RobertB | STAR-CCM+ | 7 | October 22, 2010 07:59 |
meshing F1 front wing | Steve | FLUENT | 0 | April 17, 2003 12:37 |
CFX4.3 -build analysis form | Chie Min | CFX | 5 | July 12, 2001 23:19 |