CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Hardware

Scaling from 1 to 2 nodes shows 122% performance increase

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Display Modes
Old   October 26, 2012, 18:41
Default Scaling from 1 to 2 nodes shows 122% performance increase
  #1
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 487
Rep Power: 10
evcelica is on a distinguished road
Just thought I'd share the somewhat unexpected results of my 2 node "cluster". I'm using two identical i7-3930K computers overclocked to 4.4 GHz, each with 32GB of 2133MHz ram. They are connected using Intel gigabit and I'm using platform-MPI running ANSYS CFX v14.

Benchmark case has ~4 million nodes - steady state thermal with multiple domains.

When comparing:
1 computer running 4 cores to
2 computers running 4 cores each

My speedup shows to be 2.22 times faster !
So much for linear scaling, has anyone else seen this, it just seems a little odd to me, though I'm definitely happy about it!
This is something to consider If anyone has been thinking about adding a second node.


EDIT:
After running it a few more times I realized during my single node simulation I accidentally had the CPU downclocked to 3.8GHz instead of 4.4. So the 15.6% Overclock gave me the extra 11% speed per node. Running it again with the same 4.4GHz clock speed on all nodes I got 99.5% efficient scaling. Sorry for the misinformation.

Last edited by evcelica; October 31, 2012 at 11:58. Reason: Mistake in information
evcelica is offline   Reply With Quote

Old   October 27, 2012, 18:22
Default
  #2
Senior Member
 
Join Date: Oct 2009
Location: Germany
Posts: 637
Rep Power: 12
abdul099 is on a distinguished road
There are two possible explanations:

Either the method to investigate the performance fails for some reason. It might be interesting to see how you've judged the performance gain.

Or that might happen due to increased memory bandwidth when running on both nodes. And of course, the cell count per core is not that low, so there should be no run into scaling issues due to communication latency. And of course, your memory might be very fast, depending on the memory timings.
But haven't ever seen something similar. Although a nearly linear scaling could be seen on our Sandy Bridge cluster down to very low cells/core, and it performs VERY well, speedup efficiency was never above 1.
__________________
We do three types of jobs here:
GOOD, FAST AND CHEAP
You may choose any two!
abdul099 is offline   Reply With Quote

Old   October 28, 2012, 07:06
Default
  #3
Senior Member
 
Charles
Join Date: Apr 2009
Posts: 181
Rep Power: 9
CapSizer is on a distinguished road
Better than 100% scaling on very small clusters is not so unusual, because you can benefit from the additional extremely fast cache memory that becomes available.
CapSizer is offline   Reply With Quote

Old   October 28, 2012, 12:32
Default
  #4
Senior Member
 
Join Date: Oct 2009
Location: Germany
Posts: 637
Rep Power: 12
abdul099 is on a distinguished road
I agree, CapSizer. I just wondered why I haven't seen this before, since the Sandy Bridge E in our cluster have 20MB cache and each two can communicate via QPI and via infiniband with all others.
Therefore I would still say, it strongly depends on the case and the used hardware.
__________________
We do three types of jobs here:
GOOD, FAST AND CHEAP
You may choose any two!
abdul099 is offline   Reply With Quote

Old   October 28, 2012, 13:09
Default
  #5
Senior Member
 
Charles
Join Date: Apr 2009
Posts: 181
Rep Power: 9
CapSizer is on a distinguished road
I think also that perhaps the way the solver is parallelized may have an effect? It is a long time since I used CFX, but from what I can remember it was less sensitive to inter-node communication than other solvers I have used. It is as if it was doing a lot of work per iteration before communicating between nodes, so perhaps it suffers less from inter-node latency, and more likely to benefit form the extra cache? Just a guess ...
CapSizer is offline   Reply With Quote

Old   October 31, 2012, 11:58
Default
  #6
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 487
Rep Power: 10
evcelica is on a distinguished road
EDIT:
After running it a few more times I realized during my single node simulation I accidentally had the CPU downclocked to 3.8GHz instead of 4.4. So the 15.6% Overclock gave me the extra 11% speed per node. Running it again with the same 4.4GHz clock speed on all nodes I got 99.5% efficient scaling. Sorry for the misinformation.
evcelica is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Global scaling - grid independency testing Jenny CFX 6 December 10, 2013 04:52
mpirun, best parameters pablodecastillo Hardware 17 April 27, 2012 13:05
Performance Improvements by core locking RobertB STAR-CCM+ 7 October 22, 2010 07:59
meshing F1 front wing Steve FLUENT 0 April 17, 2003 12:37
CFX4.3 -build analysis form Chie Min CFX 5 July 12, 2001 23:19


All times are GMT -4. The time now is 13:18.