Motivation and introduction for learning assembly coding
I'm really thankful to this forum and its members for helping me out in my hobby project. However I have seen that many members, even super experienced ones don't want to mess with understanding the machine(let's say assembly for brevity) code generated by the compiler.
I understand why diving deep into the assembly code might not seem very interesting or even worth the effort for most. Unfortunately it is very useful, and people are missing out on the benefits of understanding and analyzing the assembly code. This post is to start a discussion and me willing to help anyone interested in learning more. PART 1 : What are the benefits of understanding assembly code? Here are a few benefits of understanding assembly code : - We know exactly how much fast a small portion of code will run, since we know from Intel/AMD CPU manuals how much fast (throughput and latency) any particular machine instruction like (IMUL, ADD, SUB) are and if the generated assembly uses the highest SIMD vector registers (i.e XMM (okay), YMM (fast and most common on Intel i3-i5-i7 family), ZMM (really fast and available on special CPUs)) So, if you see that your generated assembly code uses YMM registers, you're in for a good time. That means your code is using AVX2 vectorization and not the slow SSE vectorization. - If you know how the assembly code works, you know exactly how your data needs to be in order to gain maximum performance. (Hint : 1D arrays is the best. Linear access is the best.) You might want to store the CSV as vector = [rho, rho*u, rho*v, rho*E][rho, rho*u, rho*v, rho*E] You might think that having the data close to each other like that will improve your cache performance. You're right. So what's the problem? The problem is that you're doomed to only using slow XMM registers. That's what the compiler will generate in order to be safe. If you want to force the compiler to generate AVX2 vectorized code that uses YMM registers, you can. However the problem is that in order to use the data in that form (say CSV to PSV conversion operation is required), AVX2 vectorized code will use machine instructions that do complex operations like : interleave the data, permutate the data, rotate the data and other very complex operations. This will cause your AVX2 "optimized" code to run slower than the code first generated by the compiler. What's the "right" way in this context? Store the data as vector = [rho, rho, rho, rho][rho*u, rho*u, rho*u, rho*u], [rho*v, rho*v, rho*v][rho*E, rho*E, rho*E, rho*E] and access the data as arrays U1[i], U2[i], U3[i], U4[i] etc. Where you'll use pointers to set the first element of each of those arrays. I have done this, and every loop is vectorized to use AVX2 instructions by the compiler! This is 4X faster than the serial code for double precision data and 8X faster than serial code for single precision data. That's because YMM registers are 256 bits wide and can fit 4 and 8 double and single precision values respectively.And then every vector operation will add/multiply/subtract/divide the numbers (4 or 8) at once in a single instruction! If your code isn't using SIMD vectorization, you're wasting performance. Sorry, that's mathematically proven. PART 2 : How do we compile and study the assembly code... coming soon.. |
PART 2 : How do we compile and study the assembly code
I hope some are interested to dive deep into the generated assembly code and optimize our operations for maximum performance. Let's look at the operation UN = U that is done at every timestep. We copy over the values inside U from the previous timestep and use it to solve for our current timestep. Cool... this is a very important operation as it is right in the middle of our solver loop and I don't want it to be slow. Since this is a very simple operation, you would be able to understand what's going on, and able to focus on the method. Here's the C++ code: Code:
// Store solution from the previous time step for use in mstage iterations Computers love 1D arrays and linear access. Give her some.:rolleyes: Now, before we go any further .. you'll need some tools. TOOLS required : g++ compiler, objdump tool, grep tool running under GNU/Linux OS. We will want to provide the flags -g to g++ to ensure that our source code and debug symbols are also included in the compiled binary. This will allow us to understand the assembly code easily. Compile: Code:
$ g++ -g -O3 -fopenmp -mavx2 my_awesome_code.cpp -fopenmp -mavx2 are used to allow OpenMP instructions to be used in the C++ code. And since I've used them, I included the flags here. They're not strictly required. Our task is to analyze the machine code generated by the compiler. However since we can't read machine code, we use a disassembler to disassemble it into assembly code. objdump is our disassembler of choice. Here's the command: Code:
$ objdump -SC -Mintel a.out | grep 'mark' -A30 -Mintel tells the disassembler to use the Intel Assembly code syntax. Trust me, it's life saving. You don't want to read AT&T Assembly code syntax. a.out is our executable This will print all of the annotated assembly code. That would be roughly around 9000 to 9,000,000 lines (based on your project size). So, we need something to make it easier for us to search through this blob of text and find our region of interest. That's why grep is used. grep 'mark' -A30 will search for the string "mark" (see in the C++ code) and print 30 lines of assembly code after it. This lets us find our interesting assembly code in an instant. Change "mark" to whatever you like. Here's the assembly code: Code:
objdump -SC -Mintel a.out | grep 'mark' -A30 Notice how, we can see YMM registers right after the annotated C++ code. You can now tweak your code to optimize more if you want. (EDIT : You'll notice that it's using VMOVUPS U for unalligned. I might want it to use VMOVAPS A for alligned. In older hardware, VMOVAPS is significantly faster.) (EDIT : For clarification, you'll still need to profile your code. Just because it uses YMM registers doesn't mean that it will be fast in every case. The case shown above was a sample, so you need to profile your code. Most likely, in many cases where you don't have large arrays of data, serial code will be faster since CPUs can do roughly 10^8 (IIRC) operations in a second. If your arrays are very small, you will want to stick with serial code.) Notice how, we can see XMM registers at the bottom. Those are there as a safety feature. SIMD vectors instructions only work on vectors of lengths that are multiple of 4,8,16 etc. This means that in some arrays, there will be left over elements. Those elements are also needed to be calculated, thus the XMM registers are used. ( EDIT : The previously shown code took 10.0 seconds for 100,000 iterations in both serial and SIMD vectorized execution. So where did our 4X or 8X boost go? Remember this golden tid-bit : "SIMD vectorizations are used to do a single operation over a large amount of data." In the previous code shown, we were'nt really doing a single operation, we were doing four operations. My optimization was to split that up into 4 loops: Code:
// Store solution from the previous time step for use in mstage iterations Why was this version faster? The cache was going to be hot for this type of linear acces, and looking at the assembly we can see that the "serial" code a the end of the loop for the leftover elements was completely unrolled in this version. To be honest, I didn't know that this version would be faster, but now that I know how the assembly code looks for fast loops, I know what to look out for in the future. In the previous version, the operations on the leftover elements using XMM registers also used a conditional loop that we see in -O2 optimization. The complete loop unrolling that we see in this version is seen in -O3 optimization. This is how I was able to tell why this version was faster!!!!!! I knew from my previous experience how -O3 vs -O2 optimized code looks and which is faster. Do you now see why I say that learning assembly will help you a lot? Code:
UN4_[i] = U4_[i]; // mark The "serial" code was actually SSE vectorized. I just said it was "serial" for ease of understanding. AVX2 is 2X faster than SSE. ) Now, you know how easy it is to analyze your assembly code. There's a lot more to learn, but the intention of this post was to show how easy it was to analyze your assembly code, and to start a discussion. Thanks and regards ~sayan |
Just a disclaimer for myself... of course, if I could, I would like to be good at everything (yet, probably, assembly would still not make it into the top 5 in this case). But healthy time and brain occupations are limited, and typically you invest them where better returns are expected.
Of course, in theory, assembly is almost fundamental in programming. But if I barely understand C++, I'm not gonna look at assembly with any great return. In the same time I would spend on it optimizing a certain loop I could implement, say, a new turbulence model in a code, and I would perceive that as something with better ROI. You know, we are who we are. However, keep rolling this as, for the exact same reasons above, it is going to help a lot of us |
Quote:
I agree on many of the things you said. However I respectfully disagree with the statement of implementing a new turbulence model instead of optimizing loops. Maybe that was just a simple example that you gave, but I would rather optimize the loop to run as fast as possible then go for the turbulence mode, if my future work depends on the simple loop. Why would I want a shiny new turbulence model, when I know that my code will still be slow and I need to fix that before it becomes unfixable ?:) "Do one thing, but do it well." That's been my motto. The UN=U is a very simple example, but a very crucial one. This operation happens every iteration. I'm not sure if the solution I provided here is even the best solution. I could in theory run a loop from i=0 to i=nrows*ncols*4-1 since all of the 4 arrays are allocated in a single contiguous memory block. Now, that would be even faster. (EDIT : I did the tests. I converted the whole thing as a single loop. The SSE vectorized code took 15 seconds. While the AVX2 vectorized code only took 6.13 seconds!!! 2X improvement from SSE. That's what we expected from the operation!) (EDIT : This was done with -O2 optimization since many people still like to use it. In -O3 optimization, both get compiled to the AVX2 vectorized form and take 6.19 seconds!) The important parts of C++ isn't difficult. The new modern things in C++ are. We shouldn't even be using the new modern things. Most of them are bad. If someone knows basic C, they can do just fine. OOP is overrated and slow as heck. :p I'm a full time C++ dev, and I hate overuse of OOP with a burning passion. Look up Orthodox C++. I don't like agree with many of the things said in the article (like only using printf for printing), but this and the links provided tells you what are the most important things necessary for C++ to run fast. Surprise, it's mostly C like codes. Overuse of modern OOP is a horrible decision. https://gist.github.com/bkaradzic/2e39896bc7d8c34e042b OOP is long dead, long live Data Oriented Design. https://youtu.be/yy8jQgmhbAU I'm sure you already know about CPU caches and other things since you have so much experience, but in case someone else in the forum wants to learn about it, this video by Scott Meyers is absolutely critical. https://youtu.be/WDIkqP4JbkE Learning Assembly coding is actually simple. I know this stuff because I was interested in OS development and made a small kernel. If you want to learn Assembly coding easily, this 16 video playlist by Creel would be more than enough to get you started. https://www.youtube.com/watch?v=rxsB...tFT1Rmh3uJp7kA I know many will disagree with worrying too much about Assembly code being generated, but as I showed above, learning this gives me immense freedom and confidence to write my code knowing that it will perform well. I still have a lot to learn, but learning assembly has been one of my best decisions. |
I don't want to spoil this thread with additional personal consideration (instead, I invite you to keep it rolling with everything you feel useful to us in order to understand the matter), but:
Quote:
Still, if I have to make a more specific example in CFD, MPI is instead something that you should specifically favor over assembly. It makes the difference between running efficiently over any known system and, basically, just your PC. Quote:
Quote:
|
1 Attachment(s)
PART 3 : How to measure performance?
You'll definitely want to measure the performance of your code. Even though the assembly may look fine, the code may still run slower, as we have seen above. There's nothing better than actually profiling your code. You could use many tools for the job. I generally use the following : perf, valgrind, google-benchmark perf - A very important tool used by linux kernel devs to measure the performance of their low latency code. It's a little difficult to use at first, but it's extremely fast. Look at the image attached, you'll see that perf report has highlighted which assembly instruction is consuming how much % of our execution time. This is a blessing. It becomes difficult to understand sometimes, but if you know perf well, you are going to have a good time. Installing perf can be difficult as it depends on the kernel. Expect a difficulty curve when trying to install it for the first time, and use it. Here's a basic intro : https://youtu.be/M6ldFtwWup0 Here's another basic intro by Chandler Carruth : https://youtu.be/nXaxk27zwlk Chandler goes into the nitty gritty details that you might need to understsand perf better. valgrind - Great tool for finding memory leak bugs, and it's submodule cachegrind is extremely awesome for performance analysis. The problem is that valgrind is extremely slow. So, if you want to measure performance in a fast speed using valgrind and cachegrind, you'll have to write in the code where you want to measure the performance. This is advanced stuff and I'll not be covering it for now. To look at the results produced by cachegrind, use the GUI tool kcachegrind. It's a really good GUI tool and makes the performance results easy to understand. kcachegrind : https://kcachegrind.github.io/html/Home.html google-benchmark - google-benchmark is a micro-benchmark library in which you can test how fast certain sections of your code will run. Say, you want to see whether to use a tree or an array to do an operation, you'll write a simple proof of concept code and test it in google-benchmark to see which one runs faster. Since this is a micro-benchmarking library, you can't test your whole code's performance in it. Only small parts. See Chandler's video to understand some basics about google benchmark. google-benchmark : https://github.com/google/benchmark |
1 Attachment(s)
PART - 4 : ALWAYS profile your code...
Use the perf tool to profile your code if you need to know how much performance cost each instruction is having. This gives a very good knowledge of exactly what's making your code slow. Look at the attached image, you can see that VDIVPD i.e a division operation is hogging up a major portion of the computation time. However this test is not enough. You'll want to run your code and see how much time it's actually running. You'll want to make little changes to the code and see in how much time the code finishes execution. This code ran in 69 seconds (on average of 4 trials) for 4*100,000*ncells*4 iterations in each trial: (<- I don't completely trust this result for now. But it looks promising.) Code:
// Iterate over all cells Code:
// Iterate over all cells |
Part 5 : Finding where you are in the code
I have previously explained how to find where you are in the code using a marker and grep tool to search through an objdump output. All of that is correct, but sometimes CPUs have a habit of reorganizing the order of operations to ensure better performance. That makes it a little bit difficult to read the assembly code. Maybe the section of code you're investigation, is so performance critical that, you can't even do a single unnecessary addition in the loop or out of the loop. So, you can't clearly mark the end of the loop correctly. In those cases, you can use the assembly NOP instruction to identify where you are. NOP instruction essentially says to the CPU to do nothing at all, so it has almost no performance loss. Considering the code given below, we want to find out in the assembly code, where the loop ends. Right after the end of the loop, we will see the NOP machine instruction, that we have inserted in the C++ code using asm("nop") Code:
// Setup the CSV Code:
U1_[i] = UN1_[i]; // xdawg |
Quote:
my guess is that in first code first (1/ something is computed) which may be by a specialized algorithm. In second code dt/cvols is calculated which is not so fast as 1/something. |
2 Attachment(s)
Part 6 : Always freakin' profile your code for cache misses
Coding in C++ has amazing advantages, except for all of the curses required to make it work. :p Cache misses suck. Instruction cache misses aren't as much of a problem as data cache misses. In data cache misses, L3 data cache misses cause horrible slowdowns. If you're starting to optimize your code, just doing profiling with a simple profiler like perf is not enough. Perf is a great tool, but in many cases of writing numerical codes, you would want to also do a cache performance profile. Best tool for this (as per my own experience) is cachegrind, and to see the results you will use kcachegrind GUI tool. In this part, I will share how profiling my code using cachegrind literally reduced L3 cache misses from 114,075,777 to 868,001. The numbers by themselves don't mean much, as everything is relative, except when we see them increase or decrease. If they decrease, our performance is improving. My faulty code, was the main solution update loop: Code:
#pragma omp simd linear(i:1) Unfortunately, this is trash code. Utter, garbage, beyond any salvage. Unfortunately, this code was not removed, because I didn't do a profile for the cache misses produced by this code. Specially the L3 cache misses. Here's the result from cachegrind. Code:
==36349== After noticing this behavior, I used kcachegrind to figure out, where the slowdown was. As shown in cache-1.jpg, it's the solution update loop. Doing some manual testing, I figured it out, that the performance degradation was due to using ncells*4. Apparently for some unknown reason, the data wants to be accessed in groups of ncells. I don't know why this is the case. This is why we can see around 81 million L3 cache misses happened in that loop. Completely unacceptable. So, I rewrote the code. Code:
#pragma omp simd linear(i:1) Code:
==36599== I'm not sure if 90,000 L3 misses is good. I don't think it is, but I will investigate that on a later date. Probably my data alignment is all messed up or something. I don't know. |
I had a calculation affected by this issue recently where the 8 processor run did the job in 12 seconds while 1 processor run did it in 300 seconds. Found out that one loop had too much going on.
Now slowly breaking down things in smaller groups :-D |
Quote:
Why does that happen? I have to look into it, but I don't know why the single threaded long loop performed poor. It makes no sense currently to me. I always thought CPUs like big arrays. That's what SIMD tutorials recommend, and technically makes sense. This is a mystery to me.:confused: EDIT : Are you using the __restrict__ keyword? |
Part 7 : Use the Load, Calculate, Store method to write easily optimizable code
Older compilers sometimes have difficulties to optimize code if some equation is very long, and complex. It makes sense to write code that will be optimized easily even by older compilers. The simplest method for me, has been to use the Load, Calculate, Store method. It's as easy as doing your complex equation in multiple steps, instead of writing a single big equation. Like how we did in this piece of code from my solver. In this : we load the data, do some calculations, and store it back. The compiler loves this!!! Code:
/// Calculate pressure for all cells |
Quote:
|
Quote:
Not a vague question. My boss wrote his solvers and his whole CFD software to be compiled with GCC 4.8. Apparently users in NASA, and US Army, used older machines which generally didn't have new Operating Systems and new compilers. 10 years earlier, he used to support older versions of GCC too. In those older versions of GCC (I think < GCC 4.8), you don't have the openmp simd directives to force the compiler to do aggressive vectorization. In such cases, you have to depend on the auto-vectorization provided by the compiler. So, you have to write the code in such a way that it's easy for the compiler to auto-vectorize. In Game Engine development for XBOX, PS2, PS3, etc, they also use older compilers (more specifically they sometimes used Microsoft compilers, which are garbage). They definitely don't use OpenMP. So, even there, they write the code using Load, Calculate, Store, method to allow auto-vectorization. The equation psi[i] = g*rho[i]*... gets vectorized even if I didn't break it up into parts. Because OpenMP simd directive forces the compiler to do aggresqsive vectorization on that equation. But, in case I ever need to support older compilers that don't have OpenMP SIMD directives, I would like it to be auto-vectorized. Obviously this is more noticeable in larger equations that span multiple lines. I know there's an overhead of writing out the code in multiple parts, but I still like to do it, as it simplifies the code structure for me, and most importantly, I can understand the disassembly very easily, as every part of the equation is in a separate line. It might not be 100% necessary for Fortran, if you use Intel compilers, since they do aggressive optimizations. However, I don't trust gfortran to always produce optimized code. GCC on the other hand is more mature, thus I prefer to use this technique on C/C++ codes. I have also used the const keyword to instruct to the compiler that the values of the different parts won't change once initiated, so it can do further aggressive optimizations. To add some further optimizations, I use the ncxrestrict macro which is just #define ncxrestrict __restrict__ since C/C++ codes generally suffer from aliasing issues. Thus the C++ code should be as fast as Fortran-77. I need to run tests someday to verify. |
I mentioned previously that writing the equation out in multiple parts helps in easier understanding of the equation. Even if you're not extremely skilled in understanding assembly code, you can still understand it if you write out the equation in multiple parts.
As shown in this section of disassembled code, you can see that the annotated source code, easily shows you exactly which line produces which assembly code instructions. This is very good, since you can easily understand it, if you know just the basics of assembly code instructions like ADD, SUB, MUL, DIV etc. Code:
|
Part 8 : easy method to extract assembly code
This is an important trick. It has saved me HOURS of development time. Let's say I want to see the code generated by some arbitrary piece of code. For this example, we will consider a call to printf. Well, how do we do that, the previous marker method I showed only works on a per line basis, but it's not meant for extracting the code from a whole section. And during development, I tend to extract code from a whole section, not individual lines. Here comes the NOP instruction Code:
// we'll use this Now, we will use sed to extract all of the code between these two statements Code:
$ objdump -w -SC --no-show-raw-insn -Mintel myExecutable | sed -n '/NOP/,/NOP/p' I have previously shown how to use objdump, this is an intuitive method to get better results. For ease of use, I have made a bash function out of this. Code:
# objdump code between two markers Code:
$ obseg NOP Hopefully, it was useful to some.:) |
All times are GMT -4. The time now is 04:08. |