Motivation and introduction for learning assembly coding

aerosayan · January 14, 2021, 15:32

I'm really thankful to this forum and its members for helping me out in my hobby project. However I have seen that many members, even super experienced ones don't want to mess with understanding the machine(let's say assembly for brevity) code generated by the compiler.

I understand why diving deep into the assembly code might not seem very interesting or even worth the effort for most. Unfortunately it is very useful, and people are missing out on the benefits of understanding and analyzing the assembly code.

This post is to start a discussion and me willing to help anyone interested in learning more.

PART 1 : What are the benefits of understanding assembly code?

Here are a few benefits of understanding assembly code :

- We know exactly how much fast a small portion of code will run, since we know from Intel/AMD CPU manuals how much fast (throughput and latency) any particular machine instruction like (IMUL, ADD, SUB) are and if the generated assembly uses the highest SIMD vector registers (i.e XMM (okay), YMM (fast and most common on Intel i3-i5-i7 family), ZMM (really fast and available on special CPUs))

So, if you see that your generated assembly code uses YMM registers, you're in for a good time. That means your code is using AVX2 vectorization and not the slow SSE vectorization.

- If you know how the assembly code works, you know exactly how your data needs to be in order to gain maximum performance. (Hint : 1D arrays is the best. Linear access is the best.)

You might want to store the CSV as vector = [rho, rho*u, rho*v, rho*E][rho, rho*u, rho*v, rho*E]

You might think that having the data close to each other like that will improve your cache performance. You're right. So what's the problem? The problem is that you're doomed to only using slow XMM registers. That's what the compiler will generate in order to be safe.

If you want to force the compiler to generate AVX2 vectorized code that uses YMM registers, you can. However the problem is that in order to use the data in that form (say CSV to PSV conversion operation is required), AVX2 vectorized code will use machine instructions that do complex operations like : interleave the data, permutate the data, rotate the data and other very complex operations.

This will cause your AVX2 "optimized" code to run slower than the code first generated by the compiler.

What's the "right" way in this context?
Store the data as vector = [rho, rho, rho, rho][rho*u, rho*u, rho*u, rho*u], [rho*v, rho*v, rho*v][rho*E, rho*E, rho*E, rho*E] and access the data as arrays U1[i], U2[i], U3[i], U4[i] etc. Where you'll use pointers to set the first element of each of those arrays.

I have done this, and every loop is vectorized to use AVX2 instructions by the compiler!

This is 4X faster than the serial code for double precision data and 8X faster than serial code for single precision data.

That's because YMM registers are 256 bits wide and can fit 4 and 8 double and single precision values respectively.And then every vector operation will add/multiply/subtract/divide the numbers (4 or 8) at once in a single instruction!

If your code isn't using SIMD vectorization, you're wasting performance.

Sorry, that's mathematically proven.

PART 2 : How do we compile and study the assembly code... coming soon..

January 14, 2021, 15:32	Motivation and introduction for learning assembly coding	#1
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	I'm really thankful to this forum and its members for helping me out in my hobby project. However I have seen that many members, even super experienced ones don't want to mess with understanding the machine(let's say assembly for brevity) code generated by the compiler. I understand why diving deep into the assembly code might not seem very interesting or even worth the effort for most. Unfortunately it is very useful, and people are missing out on the benefits of understanding and analyzing the assembly code. This post is to start a discussion and me willing to help anyone interested in learning more. PART 1 : What are the benefits of understanding assembly code? Here are a few benefits of understanding assembly code : - We know exactly how much fast a small portion of code will run, since we know from Intel/AMD CPU manuals how much fast (throughput and latency) any particular machine instruction like (IMUL, ADD, SUB) are and if the generated assembly uses the highest SIMD vector registers (i.e XMM (okay), YMM (fast and most common on Intel i3-i5-i7 family), ZMM (really fast and available on special CPUs)) So, if you see that your generated assembly code uses YMM registers, you're in for a good time. That means your code is using AVX2 vectorization and not the slow SSE vectorization. - If you know how the assembly code works, you know exactly how your data needs to be in order to gain maximum performance. (Hint : 1D arrays is the best. Linear access is the best.) You might want to store the CSV as vector = [rho, rhou, rhov, rhoE][rho, rhou, rhov, rhoE] You might think that having the data close to each other like that will improve your cache performance. You're right. So what's the problem? The problem is that you're doomed to only using slow XMM registers. That's what the compiler will generate in order to be safe. If you want to force the compiler to generate AVX2 vectorized code that uses YMM registers, you can. However the problem is that in order to use the data in that form (say CSV to PSV conversion operation is required), AVX2 vectorized code will use machine instructions that do complex operations like : interleave the data, permutate the data, rotate the data and other very complex operations. This will cause your AVX2 "optimized" code to run slower than the code first generated by the compiler. What's the "right" way in this context? Store the data as vector = [rho, rho, rho, rho][rhou, rhou, rhou, rhou], [rhov, rhov, rhov][rhoE, rhoE, rhoE, rhoE] and access the data as arrays U1[i], U2[i], U3[i], U4[i] etc. Where you'll use pointers to set the first element of each of those arrays. I have done this, and every loop is vectorized to use AVX2 instructions by the compiler!* This is 4X faster than the serial code for double precision data and 8X faster than serial code for single precision data. That's because YMM registers are 256 bits wide and can fit 4 and 8 double and single precision values respectively.And then every vector operation will add/multiply/subtract/divide the numbers (4 or 8) at once in a single instruction! If your code isn't using SIMD vectorization, you're wasting performance. Sorry, that's mathematically proven. PART 2 : How do we compile and study the assembly code... coming soon.. sbaffini, ships26 and aero_head like this.