FLOP/clock-cycle

etna · July 18, 2013, 16:42

hi there,

reading about the efficiency and performance of cfd-simulations i often found sentences like this: ... When running a typical CFD simulation on cluster, the cores are waiting most of the time to get new data into caches and this gives low performance from FLOPs/s point of view, ie, realistic FLOPs/clock-cycle is far below theoretical FLOPs/clock-cycle.

Example recent OpenFOAM cluster benchmark: simulation using AMD Interlagos CPUs (having theoretically 8 FLOPs/clock-cycle) is only 10% faster then simulation run on AMD Fangio CPUs (same as Interlagos but capped down to max 2 FLOPs/clock-cycle). Notice: in theory the sim. on Interlagos CPUs should be 4 times faster than sim. on Fangio CPUs!

Question 1:

are cores 'waiting' due to:
a) slow core - RAM communication?
b) slow communication between different cores (partitions) in a cluster?
c) both, depending on the core-loading (nr. of CFD grid cells per core).

Question 2:

how to increase the realistic FLOP/clock-cycle?
- if a) then i want to run my simulation on as many cores as possible (lower the nr. of cells per core)
- if b) then i want to run my simulation on as few cores as possible (increase the nr. of cells per core)
- if c) then i want to run on an optimum nr. of cells per core

Question 3:

how to find an 'optimum nr. of cells per core'?

is this nr. same for cores with high theoretical FLOP/clock-cycle (for example 8) and low theoretical FLOP/clock-cycle (for example 2 or even 1)?

kyle · July 18, 2013, 18:36

The answer is, as you probably expect, "c".

If you have slow RAM, or your mesh is not stored efficiently in memory, then your CPUs will spend a lot of time waiting for data to be transferred from memory.

If your network is slow, your domain is decomposed inefficiently, or your case is split across too many cores, the CPUs will be spending a lot of time waiting for data to cross the network.

The optimum is very hard to define and really depends on your requirements. For many people the cost of hardware is a negligible expense, so they will use twice as many cores for only 10% speedup. For others, hardware expense is a huge concern. Commercial software often costs more per-core than the hardware, which can really affects which hardware makes sense.

My feeling is many people, myself included, spend so much time obsessing over and researching hardware that any benefit in doing so is eaten up by the time it takes!

evcelica · July 18, 2013, 23:19

Quote:

Originally Posted by kyle

My feeling is many people, myself included, spend so much time obsessing over and researching hardware that any benefit in doing so is eaten up by the time it takes!

I'm definitely guilty of that as well!

etna · July 21, 2013, 16:18

thank you kyle for your quick response and very clear explanation!

yep, i was expecting the answer to the 1st question to be c).

in princple i also agree with your observation that searching for the cluster 'sweet spot' (optimum nr. of cells per core) is often overestimated (loss of time).

do you think the whole idea of finding the cluster 'sweet spot' is irelevant also when we are talking about rel. large clusters (> 10,000 cores, where each simulation can be run on 500, 1000, 2000 or even 4000 cores)?

i expect running loads of simulation a bit more efficiently (close to the cores sweet spot) can add-up to quite a nice saving in time over a year...

and what confuses me additionally is the fact that different cores have wildly different theor. FLOPs/clock-cycle performances...

if i have one 10,000 cores cluster consisting of cores with max. 2 FLOPs/clock-cycle and another one with 10,000 cores having max. 8 FLOPs/clock-cycle how to choose the simulation strategy for each cluster?

if i want efficency should i run all my simulations on on the first cluster using 'only' 500 cores, while going for 4000 cores on the second cluster? or vice versa?

if someone could explain it in layman's terms I would be grateful!

July 18, 2013, 16:42	FLOP/clock-cycle	#1
etna New Member Join Date: Jul 2013 Posts: 2 Rep Power: 0	hi there, reading about the efficiency and performance of cfd-simulations i often found sentences like this: ... When running a typical CFD simulation on cluster, the cores are waiting most of the time to get new data into caches and this gives low performance from FLOPs/s point of view, ie, realistic FLOPs/clock-cycle is far below theoretical FLOPs/clock-cycle. Example recent OpenFOAM cluster benchmark: simulation using AMD Interlagos CPUs (having theoretically 8 FLOPs/clock-cycle) is only 10% faster then simulation run on AMD Fangio CPUs (same as Interlagos but capped down to max 2 FLOPs/clock-cycle). Notice: in theory the sim. on Interlagos CPUs should be 4 times faster than sim. on Fangio CPUs! Question 1: are cores 'waiting' due to: a) slow core - RAM communication? b) slow communication between different cores (partitions) in a cluster? c) both, depending on the core-loading (nr. of CFD grid cells per core). Question 2: how to increase the realistic FLOP/clock-cycle? - if a) then i want to run my simulation on as many cores as possible (lower the nr. of cells per core) - if b) then i want to run my simulation on as few cores as possible (increase the nr. of cells per core) - if c) then i want to run on an optimum nr. of cells per core Question 3: how to find an 'optimum nr. of cells per core'? is this nr. same for cores with high theoretical FLOP/clock-cycle (for example 8) and low theoretical FLOP/clock-cycle (for example 2 or even 1)?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
THERMODYNAMIC CYCLE ANALYSIS SOFTWARE	P.PETER	Main CFD Forum	7	May 19, 2016 22:18
How to simulate the split cycle engine in fluent	hmdeepak	FLUENT	0	March 29, 2013 11:13
piston motion_not completing the cycle after a number of cycles	Catthan	FLUENT	0	September 5, 2012 08:56
Coefficient of Lift vs flapping cycle phase plot	Rose	Siemens	2	December 20, 2011 06:24
Multi cycle analysis	james	Siemens	0	April 11, 2005 13:03

July 18, 2013, 18:36		#2
kyle Senior Member Join Date: Mar 2009 Location: Austin, TX Posts: 160 Rep Power: 18	The answer is, as you probably expect, "c". If you have slow RAM, or your mesh is not stored efficiently in memory, then your CPUs will spend a lot of time waiting for data to be transferred from memory. If your network is slow, your domain is decomposed inefficiently, or your case is split across too many cores, the CPUs will be spending a lot of time waiting for data to cross the network. The optimum is very hard to define and really depends on your requirements. For many people the cost of hardware is a negligible expense, so they will use twice as many cores for only 10% speedup. For others, hardware expense is a huge concern. Commercial software often costs more per-core than the hardware, which can really affects which hardware makes sense. My feeling is many people, myself included, spend so much time obsessing over and researching hardware that any benefit in doing so is eaten up by the time it takes!

July 21, 2013, 16:18		#4
etna New Member Join Date: Jul 2013 Posts: 2 Rep Power: 0	thank you kyle for your quick response and very clear explanation! yep, i was expecting the answer to the 1st question to be c). in princple i also agree with your observation that searching for the cluster 'sweet spot' (optimum nr. of cells per core) is often overestimated (loss of time). do you think the whole idea of finding the cluster 'sweet spot' is irelevant also when we are talking about rel. large clusters (> 10,000 cores, where each simulation can be run on 500, 1000, 2000 or even 4000 cores)? i expect running loads of simulation a bit more efficiently (close to the cores sweet spot) can add-up to quite a nice saving in time over a year... and what confuses me additionally is the fact that different cores have wildly different theor. FLOPs/clock-cycle performances... if i have one 10,000 cores cluster consisting of cores with max. 2 FLOPs/clock-cycle and another one with 10,000 cores having max. 8 FLOPs/clock-cycle how to choose the simulation strategy for each cluster? if i want efficency should i run all my simulations on on the first cluster using 'only' 500 cores, while going for 4000 cores on the second cluster? or vice versa? if someone could explain it in layman's terms I would be grateful!