Computational cost for floating point multiplication/division/summation/subtraction

herofmm · March 5, 2016, 11:31

I am curious about how I can relate the floating point operation (or FLOPS) to the multiplication/division/summation/subtraction.

For example, in the following link, it says "For example, for a double-precision floating-point division, this method uses 10 multiplies, 9 adds, and 2 shifts."

https://en.wikipedia.org/wiki/Divisi...phson_division

Ultimately, I would like to know what would be the computational cost for my algorithm. For example, how much would it cost for central difference scheme which can be written as

dudx[i] = (u[i+1] - u[i-1]) / (x[i+1] - x[i-1] ).

Thank you.

mprinkey · March 6, 2016, 05:03

Quote:

Originally Posted by herofmm

I am curious about how I can relate the floating point operation (or FLOPS) to the multiplication/division/summation/subtraction.

For example, in the following link, it says "For example, for a double-precision floating-point division, this method uses 10 multiplies, 9 adds, and 2 shifts."

https://en.wikipedia.org/wiki/Divisi...phson_division

Ultimately, I would like to know what would be the computational cost for my algorithm. For example, how much would it cost for central difference scheme which can be written as

dudx[i] = (u[i+1] - u[i-1]) / (x[i+1] - x[i-1] ).

Thank you.

On most modern CPUs, the cost of multiplication and addition are the same...usually one cycle. Floating point division generally takes several cycles. But the exact cycle count depends on the architecture. Here is a reasonably up-to-date document that outlines the cycle count of various CPU instructions:

http://www.agner.org/optimize/instruction_tables.pdf

As you review that document, you will see the complexity of modern CPU instruction sets. The simple code that you have above would naively translate into two FSUBs and a single FDIV. But there are many (MANY) other factors that will affect the code compilation and optimization and ultimate performance. First, if you are computing these values more than once (as is generally the case in CFD), you could reasonably precompute invTwoDx[i] = 1.0/(x[i+1] - x[i-1]) for all i and then recast your code above as:

dudx[i] = (u[i+1] - u[i-1]) * invTwoDx[i]

That will reduce the entire process to one subtraction and one multiplication for every subsequent computation. That will always give you a significant speed improvement.

Second, there are SSE and AVX instructions that can do multiple ADDs, MUL, and DIVs in single instructions by grouping several dudx computations for, say, two or four or eight different values of i at the same time.

One other issue related to cache and main memory access. In your formula, naively implemented, there are four floating point reads from memory (u[i+1], u[i-1], x[i+1], and x[i-1]) and then one write (dudx[i]) balanced against two FSUBs and one FDIV. Depending on the size of the range of i and, even more importantly, on the dimension of the problem...1D -> 2D or 3D, the span of memory segments being accessed will begin to limit performance of the code...the floating point operations will be waiting on memory requests to be fulfilled.

So, my final comment is that, while I understand your question, it is important to recognize that the performance of even simple formula like your example is very complicated due to modern CPU and memory system design. Figuring out how to make these implementation faster requires a more comprehensive understanding of these details and honestly, becomes a trial-and-error process using a code profiler.

FMDenaro · March 6, 2016, 05:27

I totaly agree

Furthermore, many compilers allow you to have a pre-processing phase (high optimization level) that can drastically change you original source code, doing the optimizations that Michael addressed.
Many years ago, the KAP precompilers allowed to see the transcripted optimized source but I never found any optimization in modern compilers (for example PGI) to see this code, yet...

Jonas Holdeman · March 6, 2016, 10:11

Wow! This would have been a hot topic a few decades ago when processor speeds were a few megahertz or less and memory limited to 64KB or less. But those have gone the way of the 8-inch floppy disk (except perhaps for imbedded microcontrollers) . When computing was very expensive it was practical to spend a lot of time optimizing code. But today the cost of a programmer's time is much higher than savings in computer cost that might accumulate over the life of the software.

One thing missing here is the cost of looping and branching, where prefetching and predicted code execution must be discarded. Also in an application, costs of input/output and graphics must be considered. And computer hardware has been changing so rapidly that what is most efficient today may not be so tomorrow.

The simple answer is to code alternatives and do timing studies. If alternative coding makes little difference overall, then worry more about improving algorithms.

hhh · August 8, 2016, 23:04

Dear Friends,
I am trying to estimate the Computational time (in terms of Days) for the FSI Problems (example:flexible Flapping Wings) in Flops. I saw few literature, in that they mentioned 800 teraflops is sufficient.

teraflops / petaflops / exaflops, which one is good for this kind of problems?
what is the simulation time difference b/w these flops (In days)?
How do I estimate the flops (Windows & Linux)?

Kindly assist me.

Regards,
HHH

March 5, 2016, 11:31	Computational cost for floating point multiplication/division/summation/subtraction	#1
herofmm New Member Donghyuk Shin Join Date: May 2013 Posts: 1 Rep Power: 0	I am curious about how I can relate the floating point operation (or FLOPS) to the multiplication/division/summation/subtraction. For example, in the following link, it says "For example, for a double-precision floating-point division, this method uses 10 multiplies, 9 adds, and 2 shifts." https://en.wikipedia.org/wiki/Divisi...phson_division Ultimately, I would like to know what would be the computational cost for my algorithm. For example, how much would it cost for central difference scheme which can be written as dudx[i] = (u[i+1] - u[i-1]) / (x[i+1] - x[i-1] ). Thank you.

August 8, 2016, 23:04	teraflops / petaflops / exaflops: FSI Problems	#5
hhh Senior Member kunar Join Date: Nov 2011 Posts: 117 Rep Power: 14	Dear Friends, I am trying to estimate the Computational time (in terms of Days) for the FSI Problems (example:flexible Flapping Wings) in Flops. I saw few literature, in that they mentioned 800 teraflops is sufficient. teraflops / petaflops / exaflops, which one is good for this kind of problems? what is the simulation time difference b/w these flops (In days)? How do I estimate the flops (Windows & Linux)? Kindly assist me. Regards, HHH __________________ kunar

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
UDF for searching the closest point to a certain point	princekar	Fluent UDF and Scheme Programming	2	April 26, 2015 04:53
block-structured mesh for t-junction	Robert@cfd	ANSYS Meshing & Geometry	20	November 11, 2011 04:59
Problem with UDF compiling for kTkLW model	Wantami	FLUENT	0	July 18, 2011 05:11
matching variable data with grid point data	anfho	OpenFOAM Programming & Development	0	May 6, 2011 15:28
The cost of CFD efficiency	Ben Metodiev	Main CFD Forum	5	May 24, 2001 22:27

March 6, 2016, 05:27		#3
FMDenaro Senior Member Filippo Maria Denaro Join Date: Jul 2010 Posts: 6,764 Rep Power: 71	I totaly agree Furthermore, many compilers allow you to have a pre-processing phase (high optimization level) that can drastically change you original source code, doing the optimizations that Michael addressed. Many years ago, the KAP precompilers allowed to see the transcripted optimized source but I never found any optimization in modern compilers (for example PGI) to see this code, yet...

March 6, 2016, 10:11		#4
Jonas Holdeman Senior Member Jonas T. Holdeman, Jr. Join Date: Mar 2009 Location: Knoxville, Tennessee Posts: 128 Rep Power: 18	Wow! This would have been a hot topic a few decades ago when processor speeds were a few megahertz or less and memory limited to 64KB or less. But those have gone the way of the 8-inch floppy disk (except perhaps for imbedded microcontrollers) . When computing was very expensive it was practical to spend a lot of time optimizing code. But today the cost of a programmer's time is much higher than savings in computer cost that might accumulate over the life of the software. One thing missing here is the cost of looping and branching, where prefetching and predicted code execution must be discarded. Also in an application, costs of input/output and graphics must be considered. And computer hardware has been changing so rapidly that what is most efficient today may not be so tomorrow. The simple answer is to code alternatives and do timing studies. If alternative coding makes little difference overall, then worry more about improving algorithms.