Scaling from 1 to 2 nodes shows 122% performance increase

evcelica · October 26, 2012, 18:41

Just thought I'd share the somewhat unexpected results of my 2 node "cluster". I'm using two identical i7-3930K computers overclocked to 4.4 GHz, each with 32GB of 2133MHz ram. They are connected using Intel gigabit and I'm using platform-MPI running ANSYS CFX v14.

Benchmark case has ~4 million nodes - steady state thermal with multiple domains.

When comparing:
1 computer running 4 cores to
2 computers running 4 cores each

My speedup shows to be 2.22 times faster

!
So much for linear scaling, has anyone else seen this, it just seems a little odd to me, though I'm definitely happy about it!
This is something to consider If anyone has been thinking about adding a second node.

EDIT:
After running it a few more times I realized during my single node simulation I accidentally had the CPU downclocked to 3.8GHz instead of 4.4. So the 15.6% Overclock gave me the extra 11% speed per node. Running it again with the same 4.4GHz clock speed on all nodes I got 99.5% efficient scaling. Sorry for the misinformation.

abdul099 · October 27, 2012, 18:22

There are two possible explanations:

Either the method to investigate the performance fails for some reason. It might be interesting to see how you've judged the performance gain.

Or that might happen due to increased memory bandwidth when running on both nodes. And of course, the cell count per core is not that low, so there should be no run into scaling issues due to communication latency. And of course, your memory might be very fast, depending on the memory timings.
But haven't ever seen something similar. Although a nearly linear scaling could be seen on our Sandy Bridge cluster down to very low cells/core, and it performs VERY well, speedup efficiency was never above 1.

CapSizer · October 28, 2012, 06:06

Better than 100% scaling on very small clusters is not so unusual, because you can benefit from the additional extremely fast cache memory that becomes available.

abdul099 · October 28, 2012, 11:32

I agree, CapSizer. I just wondered why I haven't seen this before, since the Sandy Bridge E in our cluster have 20MB cache and each two can communicate via QPI and via infiniband with all others.
Therefore I would still say, it strongly depends on the case and the used hardware.

CapSizer · October 28, 2012, 12:09

I think also that perhaps the way the solver is parallelized may have an effect? It is a long time since I used CFX, but from what I can remember it was less sensitive to inter-node communication than other solvers I have used. It is as if it was doing a lot of work per iteration before communicating between nodes, so perhaps it suffers less from inter-node latency, and more likely to benefit form the extra cache? Just a guess ...

evcelica · October 31, 2012, 10:58

EDIT:
After running it a few more times I realized during my single node simulation I accidentally had the CPU downclocked to 3.8GHz instead of 4.4. So the 15.6% Overclock gave me the extra 11% speed per node. Running it again with the same 4.4GHz clock speed on all nodes I got 99.5% efficient scaling. Sorry for the misinformation.

October 26, 2012, 18:41	Scaling from 1 to 2 nodes shows 122% performance increase	#1
evcelica Senior Member Erik Join Date: Feb 2011 Location: Earth (Land portion) Posts: 1,167 Rep Power: 23	Just thought I'd share the somewhat unexpected results of my 2 node "cluster". I'm using two identical i7-3930K computers overclocked to 4.4 GHz, each with 32GB of 2133MHz ram. They are connected using Intel gigabit and I'm using platform-MPI running ANSYS CFX v14. Benchmark case has ~4 million nodes - steady state thermal with multiple domains. When comparing: 1 computer running 4 cores to 2 computers running 4 cores each My speedup shows to be 2.22 times faster ! So much for linear scaling, has anyone else seen this, it just seems a little odd to me, though I'm definitely happy about it! This is something to consider If anyone has been thinking about adding a second node. EDIT: After running it a few more times I realized during my single node simulation I accidentally had the CPU downclocked to 3.8GHz instead of 4.4. So the 15.6% Overclock gave me the extra 11% speed per node. Running it again with the same 4.4GHz clock speed on all nodes I got 99.5% efficient scaling. Sorry for the misinformation. Last edited by evcelica; October 31, 2012 at 10:58. Reason: Mistake in information

October 27, 2012, 18:22		#2
abdul099 Senior Member Join Date: Oct 2009 Location: Germany Posts: 636 Rep Power: 21	There are two possible explanations: Either the method to investigate the performance fails for some reason. It might be interesting to see how you've judged the performance gain. Or that might happen due to increased memory bandwidth when running on both nodes. And of course, the cell count per core is not that low, so there should be no run into scaling issues due to communication latency. And of course, your memory might be very fast, depending on the memory timings. But haven't ever seen something similar. Although a nearly linear scaling could be seen on our Sandy Bridge cluster down to very low cells/core, and it performs VERY well, speedup efficiency was never above 1. __________________ We do three types of jobs here: GOOD, FAST AND CHEAP You may choose any two!

October 28, 2012, 11:32		#4
abdul099 Senior Member Join Date: Oct 2009 Location: Germany Posts: 636 Rep Power: 21	I agree, CapSizer. I just wondered why I haven't seen this before, since the Sandy Bridge E in our cluster have 20MB cache and each two can communicate via QPI and via infiniband with all others. Therefore I would still say, it strongly depends on the case and the used hardware. __________________ We do three types of jobs here: GOOD, FAST AND CHEAP You may choose any two!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
mpirun, best parameters	pablodecastillo	Hardware	18	November 10, 2016 12:36
Global scaling - grid independency testing	Jenny	CFX	6	December 10, 2013 03:52
Performance Improvements by core locking	RobertB	STAR-CCM+	7	October 22, 2010 07:59
meshing F1 front wing	Steve	FLUENT	0	April 17, 2003 12:37
CFX4.3 -build analysis form	Chie Min	CFX	5	July 12, 2001 23:19

October 28, 2012, 06:06		#3
CapSizer Senior Member Charles Join Date: Apr 2009 Posts: 185 Rep Power: 18	Better than 100% scaling on very small clusters is not so unusual, because you can benefit from the additional extremely fast cache memory that becomes available.

October 28, 2012, 12:09		#5
CapSizer Senior Member Charles Join Date: Apr 2009 Posts: 185 Rep Power: 18	I think also that perhaps the way the solver is parallelized may have an effect? It is a long time since I used CFX, but from what I can remember it was less sensitive to inter-node communication than other solvers I have used. It is as if it was doing a lot of work per iteration before communicating between nodes, so perhaps it suffers less from inter-node latency, and more likely to benefit form the extra cache? Just a guess ...

October 31, 2012, 10:58		#6
evcelica Senior Member Erik Join Date: Feb 2011 Location: Earth (Land portion) Posts: 1,167 Rep Power: 23	EDIT: After running it a few more times I realized during my single node simulation I accidentally had the CPU downclocked to 3.8GHz instead of 4.4. So the 15.6% Overclock gave me the extra 11% speed per node. Running it again with the same 4.4GHz clock speed on all nodes I got 99.5% efficient scaling. Sorry for the misinformation.