CFD Online Discussion Forums - Computational cost for floating point multiplication/division/summation/subtraction

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Main CFD Forum (https://www.cfd-online.com/Forums/main/)

- - Computational cost for floating point multiplication/division/summation/subtraction (https://www.cfd-online.com/Forums/main/167615-computational-cost-floating-point-multiplication-division-summation-subtraction.html)

Computational cost for floating point multiplication/division/summation/subtraction

I am curious about how I can relate the floating point operation (or FLOPS) to the multiplication/division/summation/subtraction.

For example, in the following link, it says "For example, for a double-precision floating-point division, this method uses 10 multiplies, 9 adds, and 2 shifts."

https://en.wikipedia.org/wiki/Divisi...phson_division

Ultimately, I would like to know what would be the computational cost for my algorithm. For example, how much would it cost for central difference scheme which can be written as

dudx[i] = (u[i+1] - u[i-1]) / (x[i+1] - x[i-1] ).

Thank you.

Quote:

Originally Posted by herofmm (Post 588192)

On most modern CPUs, the cost of multiplication and addition are the same...usually one cycle. Floating point division generally takes several cycles. But the exact cycle count depends on the architecture. Here is a reasonably up-to-date document that outlines the cycle count of various CPU instructions:

http://www.agner.org/optimize/instruction_tables.pdf

As you review that document, you will see the complexity of modern CPU instruction sets. The simple code that you have above would naively translate into two FSUBs and a single FDIV. But there are many (MANY) other factors that will affect the code compilation and optimization and ultimate performance. First, if you are computing these values more than once (as is generally the case in CFD), you could reasonably precompute invTwoDx[i] = 1.0/(x[i+1] - x[i-1]) for all i and then recast your code above as:

dudx[i] = (u[i+1] - u[i-1]) * invTwoDx[i]

That will reduce the entire process to one subtraction and one multiplication for every subsequent computation. That will always give you a significant speed improvement.

Second, there are SSE and AVX instructions that can do multiple ADDs, MUL, and DIVs in single instructions by grouping several dudx computations for, say, two or four or eight different values of i at the same time.

One other issue related to cache and main memory access. In your formula, naively implemented, there are four floating point reads from memory (u[i+1], u[i-1], x[i+1], and x[i-1]) and then one write (dudx[i]) balanced against two FSUBs and one FDIV. Depending on the size of the range of i and, even more importantly, on the dimension of the problem...1D -> 2D or 3D, the span of memory segments being accessed will begin to limit performance of the code...the floating point operations will be waiting on memory requests to be fulfilled.

So, my final comment is that, while I understand your question, it is important to recognize that the performance of even simple formula like your example is very complicated due to modern CPU and memory system design. Figuring out how to make these implementation faster requires a more comprehensive understanding of these details and honestly, becomes a trial-and-error process using a code profiler.

I totaly agree :)

Furthermore, many compilers allow you to have a pre-processing phase (high optimization level) that can drastically change you original source code, doing the optimizations that Michael addressed.
Many years ago, the KAP precompilers allowed to see the transcripted optimized source but I never found any optimization in modern compilers (for example PGI) to see this code, yet...:confused:

Wow! This would have been a hot topic a few decades ago when processor speeds were a few megahertz or less and memory limited to 64KB or less. But those have gone the way of the 8-inch floppy disk (except perhaps for imbedded microcontrollers) . When computing was very expensive it was practical to spend a lot of time optimizing code. But today the cost of a programmer's time is much higher than savings in computer cost that might accumulate over the life of the software.

One thing missing here is the cost of looping and branching, where prefetching and predicted code execution must be discarded. Also in an application, costs of input/output and graphics must be considered. And computer hardware has been changing so rapidly that what is most efficient today may not be so tomorrow.

The simple answer is to code alternatives and do timing studies. If alternative coding makes little difference overall, then worry more about improving algorithms.

teraflops / petaflops / exaflops: FSI Problems

Dear Friends,
I am trying to estimate the Computational time (in terms of Days) for the FSI Problems (example:flexible Flapping Wings) in Flops. I saw few literature, in that they mentioned 800 teraflops is sufficient.

teraflops / petaflops / exaflops, which one is good for this kind of problems?
what is the simulation time difference b/w these flops (In days)?
How do I estimate the flops (Windows & Linux)?

Kindly assist me.

Regards,
HHH