CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Main CFD Forum

Computational cost for floating point multiplication/division/summation/subtraction

Register Blogs Community New Posts Updated Threads Search

Like Tree3Likes
  • 3 Post By mprinkey

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   March 5, 2016, 11:31
Default Computational cost for floating point multiplication/division/summation/subtraction
  #1
New Member
 
Donghyuk Shin
Join Date: May 2013
Posts: 1
Rep Power: 0
herofmm is on a distinguished road
I am curious about how I can relate the floating point operation (or FLOPS) to the multiplication/division/summation/subtraction.

For example, in the following link, it says "For example, for a double-precision floating-point division, this method uses 10 multiplies, 9 adds, and 2 shifts."

https://en.wikipedia.org/wiki/Divisi...phson_division

Ultimately, I would like to know what would be the computational cost for my algorithm. For example, how much would it cost for central difference scheme which can be written as

dudx[i] = (u[i+1] - u[i-1]) / (x[i+1] - x[i-1] ).

Thank you.
herofmm is offline   Reply With Quote

Old   March 6, 2016, 05:03
Default
  #2
Senior Member
 
Michael Prinkey
Join Date: Mar 2009
Location: Pittsburgh PA
Posts: 363
Rep Power: 25
mprinkey will become famous soon enough
Quote:
Originally Posted by herofmm View Post
I am curious about how I can relate the floating point operation (or FLOPS) to the multiplication/division/summation/subtraction.

For example, in the following link, it says "For example, for a double-precision floating-point division, this method uses 10 multiplies, 9 adds, and 2 shifts."

https://en.wikipedia.org/wiki/Divisi...phson_division

Ultimately, I would like to know what would be the computational cost for my algorithm. For example, how much would it cost for central difference scheme which can be written as

dudx[i] = (u[i+1] - u[i-1]) / (x[i+1] - x[i-1] ).

Thank you.
On most modern CPUs, the cost of multiplication and addition are the same...usually one cycle. Floating point division generally takes several cycles. But the exact cycle count depends on the architecture. Here is a reasonably up-to-date document that outlines the cycle count of various CPU instructions:

http://www.agner.org/optimize/instruction_tables.pdf

As you review that document, you will see the complexity of modern CPU instruction sets. The simple code that you have above would naively translate into two FSUBs and a single FDIV. But there are many (MANY) other factors that will affect the code compilation and optimization and ultimate performance. First, if you are computing these values more than once (as is generally the case in CFD), you could reasonably precompute invTwoDx[i] = 1.0/(x[i+1] - x[i-1]) for all i and then recast your code above as:

dudx[i] = (u[i+1] - u[i-1]) * invTwoDx[i]

That will reduce the entire process to one subtraction and one multiplication for every subsequent computation. That will always give you a significant speed improvement.

Second, there are SSE and AVX instructions that can do multiple ADDs, MUL, and DIVs in single instructions by grouping several dudx computations for, say, two or four or eight different values of i at the same time.

One other issue related to cache and main memory access. In your formula, naively implemented, there are four floating point reads from memory (u[i+1], u[i-1], x[i+1], and x[i-1]) and then one write (dudx[i]) balanced against two FSUBs and one FDIV. Depending on the size of the range of i and, even more importantly, on the dimension of the problem...1D -> 2D or 3D, the span of memory segments being accessed will begin to limit performance of the code...the floating point operations will be waiting on memory requests to be fulfilled.

So, my final comment is that, while I understand your question, it is important to recognize that the performance of even simple formula like your example is very complicated due to modern CPU and memory system design. Figuring out how to make these implementation faster requires a more comprehensive understanding of these details and honestly, becomes a trial-and-error process using a code profiler.
wyldckat, FMDenaro and kaya like this.
mprinkey is offline   Reply With Quote

Old   March 6, 2016, 05:27
Default
  #3
Senior Member
 
Filippo Maria Denaro
Join Date: Jul 2010
Posts: 6,764
Rep Power: 71
FMDenaro has a spectacular aura aboutFMDenaro has a spectacular aura aboutFMDenaro has a spectacular aura about
I totaly agree

Furthermore, many compilers allow you to have a pre-processing phase (high optimization level) that can drastically change you original source code, doing the optimizations that Michael addressed.
Many years ago, the KAP precompilers allowed to see the transcripted optimized source but I never found any optimization in modern compilers (for example PGI) to see this code, yet...
FMDenaro is online now   Reply With Quote

Old   March 6, 2016, 10:11
Default
  #4
Senior Member
 
Jonas T. Holdeman, Jr.
Join Date: Mar 2009
Location: Knoxville, Tennessee
Posts: 128
Rep Power: 18
Jonas Holdeman is on a distinguished road
Wow! This would have been a hot topic a few decades ago when processor speeds were a few megahertz or less and memory limited to 64KB or less. But those have gone the way of the 8-inch floppy disk (except perhaps for imbedded microcontrollers) . When computing was very expensive it was practical to spend a lot of time optimizing code. But today the cost of a programmer's time is much higher than savings in computer cost that might accumulate over the life of the software.

One thing missing here is the cost of looping and branching, where prefetching and predicted code execution must be discarded. Also in an application, costs of input/output and graphics must be considered. And computer hardware has been changing so rapidly that what is most efficient today may not be so tomorrow.

The simple answer is to code alternatives and do timing studies. If alternative coding makes little difference overall, then worry more about improving algorithms.
Jonas Holdeman is offline   Reply With Quote

Old   August 8, 2016, 23:04
Default teraflops / petaflops / exaflops: FSI Problems
  #5
hhh
Senior Member
 
kunar
Join Date: Nov 2011
Posts: 117
Rep Power: 14
hhh is on a distinguished road
Dear Friends,
I am trying to estimate the Computational time (in terms of Days) for the FSI Problems (example:flexible Flapping Wings) in Flops. I saw few literature, in that they mentioned 800 teraflops is sufficient.

  1. teraflops / petaflops / exaflops, which one is good for this kind of problems?
  2. what is the simulation time difference b/w these flops (In days)?
  3. How do I estimate the flops (Windows & Linux)?
Kindly assist me.

Regards,
HHH
__________________
kunar
hhh is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
UDF for searching the closest point to a certain point princekar Fluent UDF and Scheme Programming 2 April 26, 2015 04:53
block-structured mesh for t-junction Robert@cfd ANSYS Meshing & Geometry 20 November 11, 2011 04:59
Problem with UDF compiling for kTkLW model Wantami FLUENT 0 July 18, 2011 05:11
matching variable data with grid point data anfho OpenFOAM Programming & Development 0 May 6, 2011 15:28
The cost of CFD efficiency Ben Metodiev Main CFD Forum 5 May 24, 2001 22:27


All times are GMT -4. The time now is 12:07.