|
[Sponsors] |
Computational cost for floating point multiplication/division/summation/subtraction |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
March 5, 2016, 11:31 |
Computational cost for floating point multiplication/division/summation/subtraction
|
#1 |
New Member
Donghyuk Shin
Join Date: May 2013
Posts: 1
Rep Power: 0 |
I am curious about how I can relate the floating point operation (or FLOPS) to the multiplication/division/summation/subtraction.
For example, in the following link, it says "For example, for a double-precision floating-point division, this method uses 10 multiplies, 9 adds, and 2 shifts." https://en.wikipedia.org/wiki/Divisi...phson_division Ultimately, I would like to know what would be the computational cost for my algorithm. For example, how much would it cost for central difference scheme which can be written as dudx[i] = (u[i+1] - u[i-1]) / (x[i+1] - x[i-1] ). Thank you. |
|
March 6, 2016, 05:03 |
|
#2 | |
Senior Member
Michael Prinkey
Join Date: Mar 2009
Location: Pittsburgh PA
Posts: 363
Rep Power: 25 |
Quote:
http://www.agner.org/optimize/instruction_tables.pdf As you review that document, you will see the complexity of modern CPU instruction sets. The simple code that you have above would naively translate into two FSUBs and a single FDIV. But there are many (MANY) other factors that will affect the code compilation and optimization and ultimate performance. First, if you are computing these values more than once (as is generally the case in CFD), you could reasonably precompute invTwoDx[i] = 1.0/(x[i+1] - x[i-1]) for all i and then recast your code above as: dudx[i] = (u[i+1] - u[i-1]) * invTwoDx[i] That will reduce the entire process to one subtraction and one multiplication for every subsequent computation. That will always give you a significant speed improvement. Second, there are SSE and AVX instructions that can do multiple ADDs, MUL, and DIVs in single instructions by grouping several dudx computations for, say, two or four or eight different values of i at the same time. One other issue related to cache and main memory access. In your formula, naively implemented, there are four floating point reads from memory (u[i+1], u[i-1], x[i+1], and x[i-1]) and then one write (dudx[i]) balanced against two FSUBs and one FDIV. Depending on the size of the range of i and, even more importantly, on the dimension of the problem...1D -> 2D or 3D, the span of memory segments being accessed will begin to limit performance of the code...the floating point operations will be waiting on memory requests to be fulfilled. So, my final comment is that, while I understand your question, it is important to recognize that the performance of even simple formula like your example is very complicated due to modern CPU and memory system design. Figuring out how to make these implementation faster requires a more comprehensive understanding of these details and honestly, becomes a trial-and-error process using a code profiler. |
||
March 6, 2016, 05:27 |
|
#3 |
Senior Member
Filippo Maria Denaro
Join Date: Jul 2010
Posts: 6,764
Rep Power: 71 |
I totaly agree
Furthermore, many compilers allow you to have a pre-processing phase (high optimization level) that can drastically change you original source code, doing the optimizations that Michael addressed. Many years ago, the KAP precompilers allowed to see the transcripted optimized source but I never found any optimization in modern compilers (for example PGI) to see this code, yet... |
|
March 6, 2016, 10:11 |
|
#4 |
Senior Member
Jonas T. Holdeman, Jr.
Join Date: Mar 2009
Location: Knoxville, Tennessee
Posts: 128
Rep Power: 18 |
Wow! This would have been a hot topic a few decades ago when processor speeds were a few megahertz or less and memory limited to 64KB or less. But those have gone the way of the 8-inch floppy disk (except perhaps for imbedded microcontrollers) . When computing was very expensive it was practical to spend a lot of time optimizing code. But today the cost of a programmer's time is much higher than savings in computer cost that might accumulate over the life of the software.
One thing missing here is the cost of looping and branching, where prefetching and predicted code execution must be discarded. Also in an application, costs of input/output and graphics must be considered. And computer hardware has been changing so rapidly that what is most efficient today may not be so tomorrow. The simple answer is to code alternatives and do timing studies. If alternative coding makes little difference overall, then worry more about improving algorithms. |
|
August 8, 2016, 23:04 |
teraflops / petaflops / exaflops: FSI Problems
|
#5 |
Senior Member
kunar
Join Date: Nov 2011
Posts: 117
Rep Power: 14 |
Dear Friends,
I am trying to estimate the Computational time (in terms of Days) for the FSI Problems (example:flexible Flapping Wings) in Flops. I saw few literature, in that they mentioned 800 teraflops is sufficient.
Regards, HHH
__________________
kunar |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
UDF for searching the closest point to a certain point | princekar | Fluent UDF and Scheme Programming | 2 | April 26, 2015 04:53 |
block-structured mesh for t-junction | Robert@cfd | ANSYS Meshing & Geometry | 20 | November 11, 2011 04:59 |
Problem with UDF compiling for kTkLW model | Wantami | FLUENT | 0 | July 18, 2011 05:11 |
matching variable data with grid point data | anfho | OpenFOAM Programming & Development | 0 | May 6, 2011 15:28 |
The cost of CFD efficiency | Ben Metodiev | Main CFD Forum | 5 | May 24, 2001 22:27 |