CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   Hardware (http://www.cfd-online.com/Forums/hardware/)
-   -   FLOP/clock-cycle (http://www.cfd-online.com/Forums/hardware/120989-flop-clock-cycle.html)

etna July 18, 2013 16:42

FLOP/clock-cycle
 
hi there,

reading about the efficiency and performance of cfd-simulations i often found sentences like this: ... When running a typical CFD simulation on cluster, the cores are waiting most of the time to get new data into caches and this gives low performance from FLOPs/s point of view, ie, realistic FLOPs/clock-cycle is far below theoretical FLOPs/clock-cycle.

Example recent OpenFOAM cluster benchmark: simulation using AMD Interlagos CPUs (having theoretically 8 FLOPs/clock-cycle) is only 10% faster then simulation run on AMD Fangio CPUs (same as Interlagos but capped down to max 2 FLOPs/clock-cycle). Notice: in theory the sim. on Interlagos CPUs should be 4 times faster than sim. on Fangio CPUs!

Question 1:

are cores 'waiting' due to:
a) slow core - RAM communication?
b) slow communication between different cores (partitions) in a cluster?
c) both, depending on the core-loading (nr. of CFD grid cells per core).

Question 2:

how to increase the realistic FLOP/clock-cycle?
- if a) then i want to run my simulation on as many cores as possible (lower the nr. of cells per core)
- if b) then i want to run my simulation on as few cores as possible (increase the nr. of cells per core)
- if c) then i want to run on an optimum nr. of cells per core

Question 3:

how to find an 'optimum nr. of cells per core'?

is this nr. same for cores with high theoretical FLOP/clock-cycle (for example 8) and low theoretical FLOP/clock-cycle (for example 2 or even 1)?

kyle July 18, 2013 18:36

The answer is, as you probably expect, "c".

If you have slow RAM, or your mesh is not stored efficiently in memory, then your CPUs will spend a lot of time waiting for data to be transferred from memory.

If your network is slow, your domain is decomposed inefficiently, or your case is split across too many cores, the CPUs will be spending a lot of time waiting for data to cross the network.

The optimum is very hard to define and really depends on your requirements. For many people the cost of hardware is a negligible expense, so they will use twice as many cores for only 10% speedup. For others, hardware expense is a huge concern. Commercial software often costs more per-core than the hardware, which can really affects which hardware makes sense.

My feeling is many people, myself included, spend so much time obsessing over and researching hardware that any benefit in doing so is eaten up by the time it takes!

evcelica July 18, 2013 23:19

Quote:

Originally Posted by kyle (Post 440672)

My feeling is many people, myself included, spend so much time obsessing over and researching hardware that any benefit in doing so is eaten up by the time it takes!

I'm definitely guilty of that as well!

etna July 21, 2013 16:18

thank you kyle for your quick response and very clear explanation!

yep, i was expecting the answer to the 1st question to be c).

in princple i also agree with your observation that searching for the cluster 'sweet spot' (optimum nr. of cells per core) is often overestimated (loss of time).

do you think the whole idea of finding the cluster 'sweet spot' is irelevant also when we are talking about rel. large clusters (> 10,000 cores, where each simulation can be run on 500, 1000, 2000 or even 4000 cores)?

i expect running loads of simulation a bit more efficiently (close to the cores sweet spot) can add-up to quite a nice saving in time over a year...

and what confuses me additionally is the fact that different cores have wildly different theor. FLOPs/clock-cycle performances...

if i have one 10,000 cores cluster consisting of cores with max. 2 FLOPs/clock-cycle and another one with 10,000 cores having max. 8 FLOPs/clock-cycle how to choose the simulation strategy for each cluster?

if i want efficency should i run all my simulations on on the first cluster using 'only' 500 cores, while going for 4000 cores on the second cluster? or vice versa?

if someone could explain it in layman's terms I would be grateful!


All times are GMT -4. The time now is 14:49.