Motivation and introduction for learning assembly coding

aerosayan · January 14, 2021, 15:32

I'm really thankful to this forum and its members for helping me out in my hobby project. However I have seen that many members, even super experienced ones don't want to mess with understanding the machine(let's say assembly for brevity) code generated by the compiler.

I understand why diving deep into the assembly code might not seem very interesting or even worth the effort for most. Unfortunately it is very useful, and people are missing out on the benefits of understanding and analyzing the assembly code.

This post is to start a discussion and me willing to help anyone interested in learning more.

PART 1 : What are the benefits of understanding assembly code?

Here are a few benefits of understanding assembly code :

- We know exactly how much fast a small portion of code will run, since we know from Intel/AMD CPU manuals how much fast (throughput and latency) any particular machine instruction like (IMUL, ADD, SUB) are and if the generated assembly uses the highest SIMD vector registers (i.e XMM (okay), YMM (fast and most common on Intel i3-i5-i7 family), ZMM (really fast and available on special CPUs))

So, if you see that your generated assembly code uses YMM registers, you're in for a good time. That means your code is using AVX2 vectorization and not the slow SSE vectorization.

- If you know how the assembly code works, you know exactly how your data needs to be in order to gain maximum performance. (Hint : 1D arrays is the best. Linear access is the best.)

You might want to store the CSV as vector = [rho, rho*u, rho*v, rho*E][rho, rho*u, rho*v, rho*E]

You might think that having the data close to each other like that will improve your cache performance. You're right. So what's the problem? The problem is that you're doomed to only using slow XMM registers. That's what the compiler will generate in order to be safe.

If you want to force the compiler to generate AVX2 vectorized code that uses YMM registers, you can. However the problem is that in order to use the data in that form (say CSV to PSV conversion operation is required), AVX2 vectorized code will use machine instructions that do complex operations like : interleave the data, permutate the data, rotate the data and other very complex operations.

This will cause your AVX2 "optimized" code to run slower than the code first generated by the compiler.

What's the "right" way in this context?
Store the data as vector = [rho, rho, rho, rho][rho*u, rho*u, rho*u, rho*u], [rho*v, rho*v, rho*v][rho*E, rho*E, rho*E, rho*E] and access the data as arrays U1[i], U2[i], U3[i], U4[i] etc. Where you'll use pointers to set the first element of each of those arrays.

I have done this, and every loop is vectorized to use AVX2 instructions by the compiler!

This is 4X faster than the serial code for double precision data and 8X faster than serial code for single precision data.

That's because YMM registers are 256 bits wide and can fit 4 and 8 double and single precision values respectively.And then every vector operation will add/multiply/subtract/divide the numbers (4 or 8) at once in a single instruction!

If your code isn't using SIMD vectorization, you're wasting performance.

Sorry, that's mathematically proven.

PART 2 : How do we compile and study the assembly code... coming soon..

aerosayan · January 14, 2021, 16:10

PART 2 : How do we compile and study the assembly code

I hope some are interested to dive deep into the generated assembly code and optimize our operations for maximum performance.

Let's look at the operation UN = U that is done at every timestep.
We copy over the values inside U from the previous timestep and use it to solve for our current timestep.

Cool... this is a very important operation as it is right in the middle of our solver loop and I don't want it to be slow. Since this is a very simple operation, you would be able to understand what's going on, and able to focus on the method.

Here's the C++ code:

Code:

// Store solution from the previous time step for use in mstage iterations
//
#pragma omp simd linear(i:1)
for(i=0; i<ncells; ++i)
{
    UN1_[i] = U1_[i]; // mark <- NOTE this string "mark"
    UN2_[i] = U2_[i]; // mark
    UN3_[i] = U3_[i]; // mark
    UN4_[i] = U4_[i]; // mark
}

Notice how our operation is a simple for loop and our data is laid out into an 1D array. This is very important to notice, since our grid is actually a 2D structured grid as shown in the picture, and the naive method would have been to access the data as a 2D array inside fortran or a std::vector<std::vector<double>> in C++. Both are bad for vectorization, as the compiler warns that you're not doing "sequential access" i.e iterate over an 1D array.

Computers love 1D arrays and linear access. Give her some.

Now, before we go any further .. you'll need some tools.

TOOLS required : g++ compiler, objdump tool, grep tool running under GNU/Linux OS.

We will want to provide the flags -g to g++ to ensure that our source code and debug symbols are also included in the compiled binary. This will allow us to understand the assembly code easily.

Compile:

Code:

$ g++ -g -O3 -fopenmp -mavx2 my_awesome_code.cpp

The -O3 or -O2 optimization is required. Otherwise why would you study unoptimized code? This is strictly required.

-fopenmp -mavx2 are used to allow OpenMP instructions to be used in the C++ code. And since I've used them, I included the flags here. They're not strictly required.

Our task is to analyze the machine code generated by the compiler. However since we can't read machine code, we use a disassembler to disassemble it into assembly code.

objdump is our disassembler of choice.

Here's the command:

Code:

$ objdump -SC -Mintel a.out | grep 'mark' -A30

-S annotates the code with the strings of your source code. This helps you read and understand the assembly code. -C is something I forgot, but it's required for now.

-Mintel tells the disassembler to use the Intel Assembly code syntax. Trust me, it's life saving. You don't want to read AT&T Assembly code syntax.

a.out is our executable

This will print all of the annotated assembly code. That would be roughly around 9000 to 9,000,000 lines (based on your project size).

So, we need something to make it easier for us to search through this blob of text and find our region of interest. That's why grep is used.

grep 'mark' -A30 will search for the string "mark" (see in the C++ code) and print 30 lines of assembly code after it. This lets us find our interesting assembly code in an instant. Change "mark" to whatever you like.

Here's the assembly code:

Code:

 objdump -SC -Mintel a.out | grep 'mark' -A30    
    UN1_[i] = U1_[i]; // mark
    3259:    c4 c1 7d 10 64 05 00     vmovupd ymm4,YMMWORD PTR [r13+rax*1+0x0]
    3260:    c4 c1 78 11 24 02        vmovups XMMWORD PTR [r10+rax*1],xmm4
    3266:    c4 c3 7d 19 64 02 10     vextractf128 XMMWORD PTR [r10+rax*1+0x10],ymm4,0x1
    326d:    01 
    UN2_[i] = U2_[i]; // mark
    326e:    c4 c1 7d 10 14 04        vmovupd ymm2,YMMWORD PTR [r12+rax*1]
    3274:    c4 c1 78 11 14 01        vmovups XMMWORD PTR [r9+rax*1],xmm2
    327a:    c4 c3 7d 19 54 01 10     vextractf128 XMMWORD PTR [r9+rax*1+0x10],ymm2,0x1
    3281:    01 
    UN3_[i] = U3_[i]; // mark
    3282:    c5 fd 10 1c 03           vmovupd ymm3,YMMWORD PTR [rbx+rax*1]
    3287:    c4 c1 78 11 1c 00        vmovups XMMWORD PTR [r8+rax*1],xmm3
    328d:    c4 c3 7d 19 5c 00 10     vextractf128 XMMWORD PTR [r8+rax*1+0x10],ymm3,0x1
    3294:    01 
    UN4_[i] = U4_[i]; // mark
    3295:    c4 c1 7d 10 24 03        vmovupd ymm4,YMMWORD PTR [r11+rax*1]
    329b:    c5 f8 11 24 07           vmovups XMMWORD PTR [rdi+rax*1],xmm4
    32a0:    c4 e3 7d 19 64 07 10     vextractf128 XMMWORD PTR [rdi+rax*1+0x10],ymm4,0x1
    32a7:    01 
    32a8:    48 83 c0 20              add    rax,0x20
    32ac:    48 39 85 68 fd ff ff     cmp    QWORD PTR [rbp-0x298],rax
    32b3:    75 a4                    jne    3259 <main+0xab9>
    32b5:    39 d6                    cmp    esi,edx
    32b7:    74 3a                    je     32f3 <main+0xb53>
    32b9:    48 63 c6                 movsxd rax,esi
    UN1_[i] = U1_[i]; // mark
    32bc:    c4 c1 7b 10 44 c5 00     vmovsd xmm0,QWORD PTR [r13+rax*8+0x0]
    32c3:    c4 c1 7b 11 04 c2        vmovsd QWORD PTR [r10+rax*8],xmm0
    UN2_[i] = U2_[i]; // mark
    32c9:    c4 c1 7b 10 04 c4        vmovsd xmm0,QWORD PTR [r12+rax*8]
    32cf:    c4 c1 7b 11 04 c1        vmovsd QWORD PTR [r9+rax*8],xmm0
    UN3_[i] = U3_[i]; // mark
    32d5:    c5 fb 10 04 c3           vmovsd xmm0,QWORD PTR [rbx+rax*8]
    32da:    c4 c1 7b 11 04 c0        vmovsd QWORD PTR [r8+rax*8],xmm0
    UN4_[i] = U4_[i]; // mark
    32e0:    c4 c1 7b 10 04 c3        vmovsd xmm0,QWORD PTR [r11+rax*8]
    32e6:    c5 fb 11 04 c7           vmovsd QWORD PTR [rdi+rax*8],xmm0
    32eb:    48 83 c0 01              add    rax,0x1
    32ef:    39 c2                    cmp    edx,eax
    32f1:    7f c9                    jg     32bc <main+0xb1c>

Eureka!!!!!

Notice how, we can see YMM registers right after the annotated C++ code.
You can now tweak your code to optimize more if you want.
(EDIT : You'll notice that it's using VMOVUPS U for unalligned. I might want it to use VMOVAPS A for alligned. In older hardware, VMOVAPS is significantly faster.)

(EDIT : For clarification, you'll still need to profile your code. Just because it uses YMM registers doesn't mean that it will be fast in every case. The case shown above was a sample, so you need to profile your code. Most likely, in many cases where you don't have large arrays of data, serial code will be faster since CPUs can do roughly 10^8 (IIRC) operations in a second. If your arrays are very small, you will want to stick with serial code.)

Notice how, we can see XMM registers at the bottom.
Those are there as a safety feature.

SIMD vectors instructions only work on vectors of lengths that are multiple of 4,8,16 etc. This means that in some arrays, there will be left over elements. Those elements are also needed to be calculated, thus the XMM registers are used.

(
EDIT : The previously shown code took 10.0 seconds for 100,000 iterations in both serial and SIMD vectorized execution. So where did our 4X or 8X boost go?

Remember this golden tid-bit : "SIMD vectorizations are used to do a single operation over a large amount of data."

In the previous code shown, we were'nt really doing a single operation, we were doing four operations.

My optimization was to split that up into 4 loops:

Code:

// Store solution from the previous time step for use in mstage iterations
//
#pragma omp simd linear(i:1)
for(i=0; i<ncells; ++i)
{
    UN1_[i] = U1_[i]; // mark
}

#pragma omp simd linear(i:1)
for(i=0; i<ncells; ++i)
{
    UN2_[i] = U2_[i]; // mark
}

#pragma omp simd linear(i:1)
for(i=0; i<ncells; ++i)
{
    UN3_[i] = U3_[i]; // mark
}


#pragma omp simd linear(i:1)
for(i=0; i<ncells; ++i)
{
    UN4_[i] = U4_[i]; // mark
}

This caused the code to be further optimized and execution time was only 6.5 seconds!

Why was this version faster?

The cache was going to be hot for this type of linear acces, and looking at the assembly we can see that the "serial" code a the end of the loop for the leftover elements was completely unrolled in this version. To be honest, I didn't know that this version would be faster, but now that I know how the assembly code looks for fast loops, I know what to look out for in the future.

In the previous version, the operations on the leftover elements using XMM registers also used a conditional loop that we see in -O2 optimization. The complete loop unrolling that we see in this version is seen in -O3 optimization.

This is how I was able to tell why this version was faster!!!!!! I knew from my previous experience how -O3 vs -O2 optimized code looks and which is faster.

Do you now see why I say that learning assembly will help you a lot?

Code:

    UN4_[i] = U4_[i]; // mark
    39c2:    c4 c1 7d 10 24 33        vmovupd ymm4,YMMWORD PTR [r11+rsi*1]
    39c8:    c4 c1 78 11 24 37        vmovups XMMWORD PTR [r15+rsi*1],xmm4
    39ce:    c4 c3 7d 19 64 37 10     vextractf128 XMMWORD PTR [r15+rsi*1+0x10],ymm4,0x1
    39d5:    01

But where's 8x or 4x boost? I see only 2X bost!
The "serial" code was actually SSE vectorized. I just said it was "serial" for ease of understanding. AVX2 is 2X faster than SSE.

)

Now, you know how easy it is to analyze your assembly code.

There's a lot more to learn, but the intention of this post was to show how easy it was to analyze your assembly code, and to start a discussion.

Thanks and regards
~sayan

sbaffini · January 15, 2021, 05:06

Just a disclaimer for myself... of course, if I could, I would like to be good at everything (yet, probably, assembly would still not make it into the top 5 in this case). But healthy time and brain occupations are limited, and typically you invest them where better returns are expected.

Of course, in theory, assembly is almost fundamental in programming. But if I barely understand C++, I'm not gonna look at assembly with any great return. In the same time I would spend on it optimizing a certain loop I could implement, say, a new turbulence model in a code, and I would perceive that as something with better ROI. You know, we are who we are.

However, keep rolling this as, for the exact same reasons above, it is going to help a lot of us

aerosayan · January 15, 2021, 05:44

Quote:

Originally Posted by sbaffini

Just a disclaimer for myself... of course, if I could, I would like to be good at everything (yet, probably, assembly would still not make it into the top 5 in this case). But healthy time and brain occupations are limited, and typically you invest them where better returns are expected.

Of course, in theory, assembly is almost fundamental in programming. But if I barely understand C++, I'm not gonna look at assembly with any great return. In the same time I would spend on it optimizing a certain loop I could implement, say, a new turbulence model in a code, and I would perceive that as something with better ROI. You know, we are who we are.

However, keep rolling this as, for the exact same reasons above, it is going to help a lot of us

I agree on many of the things you said.

However I respectfully disagree with the statement of implementing a new turbulence model instead of optimizing loops. Maybe that was just a simple example that you gave, but I would rather optimize the loop to run as fast as possible then go for the turbulence mode, if my future work depends on the simple loop. Why would I want a shiny new turbulence model, when I know that my code will still be slow and I need to fix that before it becomes unfixable ?

"Do one thing, but do it well." That's been my motto.

The UN=U is a very simple example, but a very crucial one. This operation happens every iteration. I'm not sure if the solution I provided here is even the best solution. I could in theory run a loop from i=0 to i=nrows*ncols*4-1 since all of the 4 arrays are allocated in a single contiguous memory block. Now, that would be even faster.

(EDIT : I did the tests. I converted the whole thing as a single loop. The SSE vectorized code took 15 seconds. While the AVX2 vectorized code only took 6.13 seconds!!! 2X improvement from SSE. That's what we expected from the operation!) (EDIT : This was done with -O2 optimization since many people still like to use it. In -O3 optimization, both get compiled to the AVX2 vectorized form and take 6.19 seconds!)

The important parts of C++ isn't difficult. The new modern things in C++ are.
We shouldn't even be using the new modern things. Most of them are bad.
If someone knows basic C, they can do just fine.
OOP is overrated and slow as heck.

I'm a full time C++ dev, and I hate overuse of OOP with a burning passion.

Look up Orthodox C++. I don't like agree with many of the things said in the article (like only using printf for printing), but this and the links provided tells you what are the most important things necessary for C++ to run fast. Surprise, it's mostly C like codes. Overuse of modern OOP is a horrible decision.

https://gist.github.com/bkaradzic/2e39896bc7d8c34e042b

OOP is long dead, long live Data Oriented Design.
https://youtu.be/yy8jQgmhbAU

I'm sure you already know about CPU caches and other things since you have so much experience, but in case someone else in the forum wants to learn about it, this video by Scott Meyers is absolutely critical.
https://youtu.be/WDIkqP4JbkE

Learning Assembly coding is actually simple. I know this stuff because I was interested in OS development and made a small kernel. If you want to learn Assembly coding easily, this 16 video playlist by Creel would be more than enough to get you started.

https://www.youtube.com/watch?v=rxsB...tFT1Rmh3uJp7kA

I know many will disagree with worrying too much about Assembly code being generated, but as I showed above, learning this gives me immense freedom and confidence to write my code knowing that it will perform well.

I still have a lot to learn, but learning assembly has been one of my best decisions.

sbaffini · January 15, 2021, 06:47

I don't want to spoil this thread with additional personal consideration (instead, I invite you to keep it rolling with everything you feel useful to us in order to understand the matter), but:

Quote:

Originally Posted by aerosayan

However I respectfully disagree with the statement of implementing a new turbulence model instead of optimizing loops. Maybe that was just a simple example that you gave, but I would rather optimize the loop to run as fast as possible then go for the turbulence mode, if my future work depends on the simple loop. Why would I want a shiny new turbulence model, when I know that my code will still be slow and I need to fix that before it becomes unfixable ?

Yes, that was pretty much an example to say that, within a given time frame, what you can accomplish by looking at assembly code is not what I can even remotely do. In contrast, there will be other areas where I (or anyone else here) might be more productive than you. This really boils down, also, to the kind of project and all. A personal project, I think, always has (to have) learning, in a form or another. When you develop as work you can still manage to somehow "mistify" learning as something different, but there are much less opportunities. In that case, however, you hopefully have a team with complementary skills, where everyone does his part and transfers key concepts to be kept in mind to the others.

Still, if I have to make a more specific example in CFD, MPI is instead something that you should specifically favor over assembly. It makes the difference between running efficiently over any known system and, basically, just your PC.

Quote:

Originally Posted by aerosayan

The important parts of C++ isn't difficult. The new modern things in C++ are.
We shouldn't even be using the new modern things. Most of them are bad.
If someone knows basic C, they can do just fine.
OOP is overrated and slow as heck.

I'm a full time C++ dev, and I hate overuse of OOP with a burning passion.

Look up Orthodox C++. I don't like agree with many of the things said in the article (like only using printf for printing), but this and the links provided tells you what are the most important things necessary for C++ to run fast. Surprise, it's mostly C like codes. Overuse of modern OOP is a horrible decision.

https://gist.github.com/bkaradzic/2e39896bc7d8c34e042b

OOP is long dead, long live Data Oriented Design.
https://youtu.be/yy8jQgmhbAU

Yes, that kind of was another example. I actually have to work in C++, Qt, VTK, OpenCascade, etc. to design and supervise the development of our GUI. But one thing I learned about myself is that I am not a developer, I am a CFD developer, which is different. Also, that I won't ever be as productive as I am in developing CFD related algorithms in Fortran. Can I still do C++? Yes, but the things I can do in Fortran working solely on the CFD part are from another league and will always be. Of course, while you don't really need learning C++ without OOP if you already know programming, nor OOP is difficult at all in Fortran, that was not the point I was trying to make, which is more about time and productivity in that time.

Quote:

Originally Posted by aerosayan

Learning Assembly coding is actually simple. I know this stuff because I was interested in OS development and made a small kernel.

See, there is indeed a very specific reason for you to be into that. Not that making a good CFD code wouldn't be enough of a reason, but you still have a very specific background that most of us really don't have.

aerosayan · January 15, 2021, 07:20

PART 3 : How to measure performance?

You'll definitely want to measure the performance of your code. Even though the assembly may look fine, the code may still run slower, as we have seen above.

There's nothing better than actually profiling your code.

You could use many tools for the job. I generally use the following : perf, valgrind, google-benchmark

perf - A very important tool used by linux kernel devs to measure the performance of their low latency code. It's a little difficult to use at first, but it's extremely fast. Look at the image attached, you'll see that perf report has highlighted which assembly instruction is consuming how much % of our execution time. This is a blessing. It becomes difficult to understand sometimes, but if you know perf well, you are going to have a good time.

Installing perf can be difficult as it depends on the kernel. Expect a difficulty curve when trying to install it for the first time, and use it.

Here's a basic intro : https://youtu.be/M6ldFtwWup0
Here's another basic intro by Chandler Carruth : https://youtu.be/nXaxk27zwlk

Chandler goes into the nitty gritty details that you might need to understsand perf better.

valgrind - Great tool for finding memory leak bugs, and it's submodule cachegrind is extremely awesome for performance analysis. The problem is that valgrind is extremely slow. So, if you want to measure performance in a fast speed using valgrind and cachegrind, you'll have to write in the code where you want to measure the performance. This is advanced stuff and I'll not be covering it for now.

To look at the results produced by cachegrind, use the GUI tool kcachegrind. It's a really good GUI tool and makes the performance results easy to understand.

kcachegrind : https://kcachegrind.github.io/html/Home.html

google-benchmark - google-benchmark is a micro-benchmark library in which you can test how fast certain sections of your code will run. Say, you want to see whether to use a tree or an array to do an operation, you'll write a simple proof of concept code and test it in google-benchmark to see which one runs faster. Since this is a micro-benchmarking library, you can't test your whole code's performance in it. Only small parts.

See Chandler's video to understand some basics about google benchmark.

google-benchmark : https://github.com/google/benchmark

aerosayan · January 15, 2021, 17:59

PART - 4 : ALWAYS profile your code...

Use the perf tool to profile your code if you need to know how much performance cost each instruction is having. This gives a very good knowledge of exactly what's making your code slow.

Look at the attached image, you can see that VDIVPD i.e a division operation is hogging up a major portion of the computation time.

However this test is not enough. You'll want to run your code and see how much time it's actually running. You'll want to make little changes to the code and see in how much time the code finishes execution.

This code ran in 69 seconds (on average of 4 trials) for 4*100,000*ncells*4 iterations in each trial: (<- I don't completely trust this result for now. But it looks promising.)

Code:

// Iterate over all cells
//
#pragma omp simd linear(i:1)
for(i=0; i<ncells*4; ++i)
{                                                                  // <- loop i
    U_[i] = UN_[i]-(1.0/static_cast<fxp>(mstage-(m-1)))*dt/cvols_[i]*R_[i];
}                                                                  // -> loop i

This code ran in 110 seconds for 4*100,000*ncells*4 iterations in each trial run

Code:

// Iterate over all cells
//
#pragma omp simd linear(i:1)
for(i=0; i<ncells*4; ++i)
{                                                                  // <- loop i
    U_[i] = UN_[i]-dt/cvols_[i]/static_cast<fxp>(mstage-(m-1))*R_[i];
}                                                                  // -> loop i

I don't know why the first code is faster, but I will take it.

aerosayan · April 10, 2021, 23:45

Part 5 : Finding where you are in the code

I have previously explained how to find where you are in the code using a marker and grep tool to search through an objdump output. All of that is correct, but sometimes CPUs have a habit of reorganizing the order of operations to ensure better performance. That makes it a little bit difficult to read the assembly code.

Maybe the section of code you're investigation, is so performance critical that, you can't even do a single unnecessary addition in the loop or out of the loop. So, you can't clearly mark the end of the loop correctly.

In those cases, you can use the assembly NOP instruction to identify where you are. NOP instruction essentially says to the CPU to do nothing at all, so it has almost no performance loss.

Considering the code given below, we want to find out in the assembly code, where the loop ends. Right after the end of the loop, we will see the NOP machine instruction, that we have inserted in the C++ code using asm("nop")

Code:

    // Setup the CSV
    //
    #pragma omp simd linear(i:1)
    for(i=0; i<ncells*4; ++i)
    {
        U1_[i] = UN1_[i]; // xdawg
    }
    asm("nop");

Here's the output from objdump, and searching for marker xdawg.

Code:

        U1_[i] = UN1_[i]; // xdawg
    8488:    c5 f8 10 24 03           vmovups xmm4,XMMWORD PTR [rbx+rax*1]
    848d:    c4 e3 5d 18 44 03 10     vinsertf128 ymm0,ymm4,XMMWORD PTR [rbx+rax*1+0x10],0x1
    8494:    01 
    8495:    c4 c1 78 11 04 04        vmovups XMMWORD PTR [r12+rax*1],xmm0
    849b:    c4 c3 7d 19 44 04 10     vextractf128 XMMWORD PTR [r12+rax*1+0x10],ymm0,0x1
    84a2:    01 
    84a3:    48 83 c0 20              add    rax,0x20
    84a7:    48 39 c2                 cmp    rdx,rax
    84aa:    75 dc                    jne    8488


   <LOTS OF CODE HERE ... WE JUST REMOVED THEM FOR CLARITY>


    8554:    c5 f8 77                 vzeroupper  <- HERE'S OUR END OF LOOP



    asm("nop");
    8557:    90                       nop <- HERE's OUR NOP INSTRUCTION

arjun · April 11, 2021, 07:16

Quote:

Originally Posted by aerosayan

PART - 4 : ALWAYS profile your code...

This code ran in 69 seconds (on average of 4 trials) for 4*100,000*ncells*4 iterations in each trial: (<- I don't completely trust this result for now. But it looks promising.)

Code:

// Iterate over all cells
//
#pragma omp simd linear(i:1)
for(i=0; i<ncells*4; ++i)
{                                                                  // <- loop i
    U_[i] = UN_[i]-(1.0/static_cast<fxp>(mstage-(m-1)))*dt/cvols_[i]*R_[i];
}                                                                  // -> loop i

This code ran in 110 seconds for 4*100,000*ncells*4 iterations in each trial run

Code:

// Iterate over all cells
//
#pragma omp simd linear(i:1)
for(i=0; i<ncells*4; ++i)
{                                                                  // <- loop i
    U_[i] = UN_[i]-dt/cvols_[i]/static_cast<fxp>(mstage-(m-1))*R_[i];
}                                                                  // -> loop i

I don't know why the first code is faster, but I will take it.

my guess is that in first code first (1/ something is computed) which may be by a specialized algorithm. In second code dt/cvols is calculated which is not so fast as 1/something.

aerosayan · May 16, 2021, 17:26

Part 6 : Always freakin' profile your code for cache misses

Coding in C++ has amazing advantages, except for all of the curses required to make it work.

Cache misses suck. Instruction cache misses aren't as much of a problem as data cache misses.

In data cache misses, L3 data cache misses cause horrible slowdowns.

If you're starting to optimize your code, just doing profiling with a simple profiler like perf is not enough. Perf is a great tool, but in many cases of writing numerical codes, you would want to also do a cache performance profile.

Best tool for this (as per my own experience) is cachegrind, and to see the results you will use kcachegrind GUI tool.

In this part, I will share how profiling my code using cachegrind literally reduced L3 cache misses from 114,075,777 to 868,001.

The numbers by themselves don't mean much, as everything is relative, except when we see them increase or decrease. If they decrease, our performance is improving.

My faulty code, was the main solution update loop:

Code:

#pragma omp simd linear(i:1)
for(i=0; i<ncells*4; ++i)
{
        U_[i] = UN_[i] + alpha*rcvols_[i]*R_[i];

}

I have shared this multiple times here, and I thought it was such a smart idea, to iterate over the complete data set using ncells*4. The array U_ contains the solution data's 4 variables, hence ncells*4.

Unfortunately, this is trash code. Utter, garbage, beyond any salvage.

Unfortunately, this code was not removed, because I didn't do a profile for the cache misses produced by this code. Specially the L3 cache misses.

Here's the result from cachegrind.

Code:

==36349== 
==36349== I   refs:      2,097,055,417
==36349== I1  misses:        2,384,958
==36349== LLi misses:            3,215
==36349== I1  miss rate:          0.11%
==36349== LLi miss rate:          0.00%
==36349== 
==36349== D   refs:      1,004,739,022  (685,037,493 rd   + 319,701,529 wr)
==36349== D1  misses:      135,452,262  ( 97,769,886 rd   +  37,682,376 wr)
==36349== LLd misses:      114,075,777  ( 86,001,879 rd   +  28,073,898 wr)
==36349== D1  miss rate:          13.5% (       14.3%     +        11.8%  )
==36349== LLd miss rate:          11.4% (       12.6%     +         8.8%  )
==36349== 
==36349== LL refs:         137,837,220  (100,154,844 rd   +  37,682,376 wr)
==36349== LL misses:       114,078,992  ( 86,005,094 rd   +  28,073,898 wr)
==36349== LL miss rate:            3.7% (        3.1%     +         8.8%  )

We can see the LLd misses and miss rate to be very high. 11.4% miss rate is significantly high. Since, each L3 miss loses about 200 cycles, this is horrible code performance. Horrible.

After noticing this behavior, I used kcachegrind to figure out, where the slowdown was. As shown in cache-1.jpg, it's the solution update loop.

Doing some manual testing, I figured it out, that the performance degradation was due to using ncells*4. Apparently for some unknown reason, the data wants to be accessed in groups of ncells. I don't know why this is the case.

This is why we can see around 81 million L3 cache misses happened in that loop. Completely unacceptable.

So, I rewrote the code.

Code:

#pragma omp simd linear(i:1)
for(i=0; i<ncells; ++i)
{                                                                  // <- loop i
    U1_[i] = UN1_[i] + alpha*rcvols_[i]*R1_[i];
}                                                                  // -> loop i

#pragma omp simd linear(i:1)
for(i=0; i<ncells; ++i)
{                                                                  // <- loop i
    U2_[i] = UN2_[i] + alpha*rcvols_[i]*R2_[i];
}                                                                  // -> loop i

#pragma omp simd linear(i:1)
for(i=0; i<ncells; ++i)
{                                                                  // <- loop i
    U3_[i] = UN3_[i] + alpha*rcvols_[i]*R3_[i];
}                                                                  // -> loop i

#pragma omp simd linear(i:1)
for(i=0; i<ncells; ++i)
{                                                                  // <- loop i
    U4_[i] = UN4_[i] + alpha*rcvols_[i]*R4_[i];
}                                                                  // -> loop i

Running cachegrind, then showed us that the L3 miss rate was reduced from 11.4% to 0.1%. This is a MASSIVE improvement. This can be also seen in the significant reduction in the reported L3 misses.

Code:

==36599== 
==36599== I   refs:      2,097,125,455
==36599== I1  misses:        2,384,984
==36599== LLi misses:            3,239
==36599== I1  miss rate:          0.11%
==36599== LLi miss rate:          0.00%
==36599== 
==36599== D   refs:      1,019,785,554  (700,075,010 rd   + 319,710,544 wr)
==36599== D1  misses:      135,472,760  ( 97,781,885 rd   +  37,690,875 wr)
==36599== LLd misses:          868,001  (    516,180 rd   +     351,821 wr)
==36599== D1  miss rate:          13.3% (       14.0%     +        11.8%  )
==36599== LLd miss rate:           0.1% (        0.1%     +         0.1%  )
==36599== 
==36599== LL refs:         137,857,744  (100,166,869 rd   +  37,690,875 wr)
==36599== LL misses:           871,240  (    519,419 rd   +     351,821 wr)
==36599== LL miss rate:            0.0% (        0.0%     +         0.1%  )

When seeing the result in kcachegrind as in cache-2.jpg, it was observed that indeed, each loop only had 90,000 L3 misses.

I'm not sure if 90,000 L3 misses is good. I don't think it is, but I will investigate that on a later date. Probably my data alignment is all messed up or something. I don't know.

arjun · May 16, 2021, 22:47

I had a calculation affected by this issue recently where the 8 processor run did the job in 12 seconds while 1 processor run did it in 300 seconds. Found out that one loop had too much going on.

Now slowly breaking down things in smaller groups :-D

aerosayan · May 17, 2021, 05:11

Quote:

Originally Posted by arjun

Found out that one loop had too much going on.

Why does that happen?
I have to look into it, but I don't know why the single threaded long loop performed poor.
It makes no sense currently to me. I always thought CPUs like big arrays. That's what SIMD tutorials recommend, and technically makes sense.
This is a mystery to me.

EDIT : Are you using the __restrict__ keyword?

aerosayan · June 13, 2021, 15:13

Part 7 : Use the Load, Calculate, Store method to write easily optimizable code

Older compilers sometimes have difficulties to optimize code if some equation is very long, and complex. It makes sense to write code that will be optimized easily even by older compilers.

The simplest method for me, has been to use the Load, Calculate, Store method.

It's as easy as doing your complex equation in multiple steps, instead of writing a single big equation. Like how we did in this piece of code from my solver. In this : we load the data, do some calculations, and store it back. The compiler loves this!!!

Code:

/// Calculate pressure for all cells
/// @param psi   : pressure
/// @param rho   : density
/// @param e     : energy
/// @param u     : x component of velocity
/// @param v     : y component of velocity
/// @param g     : gamma-1
///              > ensure it's gamma-1 not gamma
/// @param n     : total cells
///
template<typename fxp> NCX_FORCE_INLINE
void calc_pressure(fxp * ncxrestrict psi
,                  fxp * ncxrestrict rho, fxp * ncxrestrict e
,                  fxp * ncxrestrict u  , fxp * ncxrestrict v
,                  fxp   g, szt n)
{
    #pragma omp simd
    for(szt i=0; i<n; ++i)
    {
        //
        // > EQN :
        //
        //   psi[i] = g*rho[i]*( e[i] - 0.5*(u[i]*u[i] + v[i]*v[i]) );
        //
        //            ^          ^      ^   ^---------------------^ part_1
        //            ^          ^      ^-------------------------^ part_2
        //            ^          ^--------------------------------^ part_3
        //            ^-------------------------------------------^ part_4
        //

        fxp const ii_u    = u[i];
        fxp const ii_v    = v[i];
        fxp const ii_e    = e[i];
        fxp const ii_rho  = rho[i];

        fxp const uu     = ii_u * ii_u;
        fxp const vv     = ii_v * ii_v;
        fxp const part_1 = uu + vv;
        fxp const part_2 = static_cast<fxp>(0.5) *  part_1;
        fxp const part_3 = ii_e - part_2;
        fxp const part_4 = g * ii_rho * part_3;

        psi[i] = part_4;                                   //> mk-calc-pressure
    }
}

sbaffini · June 14, 2021, 03:57

Quote:

Originally Posted by aerosayan

Part 7 : Use the Load, Calculate, Store method to write easily optimizable code

Older compilers sometimes have difficulties to optimize code if some equation is very long, and complex. It makes sense to write code that will be optimized easily even by older compilers.

The simplest method for me, has been to use the Load, Calculate, Store method.

It's as easy as doing your complex equation in multiple steps, instead of writing a single big equation. Like how we did in this piece of code from my solver. In this : we load the data, do some calculations, and store it back. The compiler loves this!!!

Code:

/// Calculate pressure for all cells
/// @param psi   : pressure
/// @param rho   : density
/// @param e     : energy
/// @param u     : x component of velocity
/// @param v     : y component of velocity
/// @param g     : gamma-1
///              > ensure it's gamma-1 not gamma
/// @param n     : total cells
///
template<typename fxp> NCX_FORCE_INLINE
void calc_pressure(fxp * ncxrestrict psi
,                  fxp * ncxrestrict rho, fxp * ncxrestrict e
,                  fxp * ncxrestrict u  , fxp * ncxrestrict v
,                  fxp   g, szt n)
{
    #pragma omp simd
    for(szt i=0; i<n; ++i)
    {
        //
        // > EQN :
        //
        //   psi[i] = g*rho[i]*( e[i] - 0.5*(u[i]*u[i] + v[i]*v[i]) );
        //
        //            ^          ^      ^   ^---------------------^ part_1
        //            ^          ^      ^-------------------------^ part_2
        //            ^          ^--------------------------------^ part_3
        //            ^-------------------------------------------^ part_4
        //

        fxp const ii_u    = u[i];
        fxp const ii_v    = v[i];
        fxp const ii_e    = e[i];
        fxp const ii_rho  = rho[i];

        fxp const uu     = ii_u * ii_u;
        fxp const vv     = ii_v * ii_v;
        fxp const part_1 = uu + vv;
        fxp const part_2 = static_cast<fxp>(0.5) *  part_1;
        fxp const part_3 = ii_e - part_2;
        fxp const part_4 = g * ii_rho * part_3;

        psi[i] = part_4;                                   //> mk-calc-pressure
    }
}

I know it's kind of a vague question but, do you think this specifically relates to a given compiler and language or is it more general?

aerosayan · June 14, 2021, 07:14

Quote:

Originally Posted by sbaffini

I know it's kind of a vague question but, do you think this specifically relates to a given compiler and language or is it more general?

Not a vague question.

My boss wrote his solvers and his whole CFD software to be compiled with GCC 4.8.

Apparently users in NASA, and US Army, used older machines which generally didn't have new Operating Systems and new compilers.

10 years earlier, he used to support older versions of GCC too.

In those older versions of GCC (I think < GCC 4.8), you don't have the openmp simd directives to force the compiler to do aggressive vectorization. In such cases, you have to depend on the auto-vectorization provided by the compiler. So, you have to write the code in such a way that it's easy for the compiler to auto-vectorize.

In Game Engine development for XBOX, PS2, PS3, etc, they also use older compilers (more specifically they sometimes used Microsoft compilers, which are garbage). They definitely don't use OpenMP. So, even there, they write the code using Load, Calculate, Store, method to allow auto-vectorization.

The equation psi[i] = g*rho[i]*... gets vectorized even if I didn't break it up into parts. Because OpenMP simd directive forces the compiler to do aggresqsive vectorization on that equation. But, in case I ever need to support older compilers that don't have OpenMP SIMD directives, I would like it to be auto-vectorized. Obviously this is more noticeable in larger equations that span multiple lines.

I know there's an overhead of writing out the code in multiple parts, but I still like to do it, as it simplifies the code structure for me, and most importantly, I can understand the disassembly very easily, as every part of the equation is in a separate line.

It might not be 100% necessary for Fortran, if you use Intel compilers, since they do aggressive optimizations. However, I don't trust gfortran to always produce optimized code.

GCC on the other hand is more mature, thus I prefer to use this technique on C/C++ codes. I have also used the const keyword to instruct to the compiler that the values of the different parts won't change once initiated, so it can do further aggressive optimizations.

To add some further optimizations, I use the ncxrestrict macro which is just #define ncxrestrict __restrict__ since C/C++ codes generally suffer from aliasing issues.

Thus the C++ code should be as fast as Fortran-77.

I need to run tests someday to verify.

aerosayan · June 14, 2021, 08:02

I mentioned previously that writing the equation out in multiple parts helps in easier understanding of the equation. Even if you're not extremely skilled in understanding assembly code, you can still understand it if you write out the equation in multiple parts.

As shown in this section of disassembled code, you can see that the annotated source code, easily shows you exactly which line produces which assembly code instructions.

This is very good, since you can easily understand it, if you know just the basics of assembly code instructions like ADD, SUB, MUL, DIV etc.

Code:


        fxp const ii_u    = u[i];
    3c70:    c4 c1 79 10 34 12        vmovupd xmm6,XMMWORD PTR [r10+rdx*1]
        fxp const ii_v    = v[i];
    3c76:    c4 c1 79 10 3c 11        vmovupd xmm7,XMMWORD PTR [r9+rdx*1]
        fxp const ii_u    = u[i];
    3c7c:    c4 c3 4d 18 44 12 10     vinsertf128 ymm0,ymm6,XMMWORD PTR [r10+rdx*1+0x10],0x1
    3c83:    01 
        fxp const ii_v    = v[i];
    3c84:    c4 c3 45 18 54 11 10     vinsertf128 ymm2,ymm7,XMMWORD PTR [r9+rdx*1+0x10],0x1
    3c8b:    01 
        fxp const ii_e    = e[i];
    3c8c:    c4 c1 79 10 2c 10        vmovupd xmm5,XMMWORD PTR [r8+rdx*1]
    3c92:    c4 c3 55 18 4c 10 10     vinsertf128 ymm1,ymm5,XMMWORD PTR [r8+rdx*1+0x10],0x1
    3c99:    01 
        fxp const uu     = ii_u * ii_u;
    3c9a:    c5 fd 59 c0              vmulpd ymm0,ymm0,ymm0
        fxp const ii_rho  = rho[i];
    3c9e:    c5 f9 10 34 16           vmovupd xmm6,XMMWORD PTR [rsi+rdx*1]
        fxp const vv     = ii_v * ii_v;
    3ca3:    c5 ed 59 d2              vmulpd ymm2,ymm2,ymm2
        fxp const part_1 = uu + vv;
    3ca7:    c5 fd 58 c2              vaddpd ymm0,ymm0,ymm2
        fxp const part_2 = half * part_1;
    3cab:    c5 fd 59 c4              vmulpd ymm0,ymm0,ymm4
        fxp const part_3 = ii_e - part_2;
    3caf:    c5 f5 5c c0              vsubpd ymm0,ymm1,ymm0
        fxp const ii_rho  = rho[i];
    3cb3:    c4 e3 4d 18 4c 16 10     vinsertf128 ymm1,ymm6,XMMWORD PTR [rsi+rdx*1+0x10],0x1
    3cba:    01 
        fxp const part_4 = g * ii_rho * part_3;
    3cbb:    c5 f5 59 cb              vmulpd ymm1,ymm1,ymm3
    3cbf:    c5 fd 59 c1              vmulpd ymm0,ymm0,ymm1
        psi[i] = part_4;                                   //> mk-calc-pressure
    3cc3:    c5 f8 11 04 11           vmovups XMMWORD PTR [rcx+rdx*1],xmm0
    3cc8:    c4 e3 7d 19 44 11 10     vextractf128 XMMWORD PTR [rcx+rdx*1+0x10],ymm0,0x1

aerosayan · July 31, 2021, 04:38

Part 8 : easy method to extract assembly code

This is an important trick. It has saved me HOURS of development time.

Let's say I want to see the code generated by some arbitrary piece of code. For this example, we will consider a call to printf. Well, how do we do that, the previous marker method I showed only works on a per line basis, but it's not meant for extracting the code from a whole section. And during development, I tend to extract code from a whole section, not individual lines.

Here comes the NOP instruction

Code:

    // we'll use this

    #define NOP __asm("nop");


    //
    NOP;
    printf("hella fine\n");
    NOP;
    //

See, that we put one NOP instruction before and after the printf statement.

Now, we will use sed to extract all of the code between these two statements

Code:

$ objdump -w -SC --no-show-raw-insn -Mintel myExecutable | sed -n '/NOP/,/NOP/p'
  
    NOP;
    10de:    nop
    10df:    lea    rsi,[rip+0xf95]        # 207b <_IO_stdin_used+0x7b>
    10e6:    mov    edi,0x1
    10eb:    xor    eax,eax
    10ed:    call   1090 <__printf_chk@plt>
    printf("hella fine\n");
    NOP;

AMAZING!!!!

I have previously shown how to use objdump, this is an intuitive method to get better results.

For ease of use, I have made a bash function out of this.

Code:

# objdump code between two markers
# my executable file name is in file run.txt, so I can use this command anywhere
# and I don't have to manually enter my executable file name
#

obseg()
{
    objdump -w -SC --no-show-raw-insn -Mintel `cat run.txt` | sed -n '/'"$1"'/,/'"$1"'/p'
}

Now, I can use it simply as:

Code:

$ obseg NOP

You can use other markers instead of NOP, and it would work similarly.

Hopefully, it was useful to some.

January 14, 2021, 15:32	Motivation and introduction for learning assembly coding	#1
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	I'm really thankful to this forum and its members for helping me out in my hobby project. However I have seen that many members, even super experienced ones don't want to mess with understanding the machine(let's say assembly for brevity) code generated by the compiler. I understand why diving deep into the assembly code might not seem very interesting or even worth the effort for most. Unfortunately it is very useful, and people are missing out on the benefits of understanding and analyzing the assembly code. This post is to start a discussion and me willing to help anyone interested in learning more. PART 1 : What are the benefits of understanding assembly code? Here are a few benefits of understanding assembly code : - We know exactly how much fast a small portion of code will run, since we know from Intel/AMD CPU manuals how much fast (throughput and latency) any particular machine instruction like (IMUL, ADD, SUB) are and if the generated assembly uses the highest SIMD vector registers (i.e XMM (okay), YMM (fast and most common on Intel i3-i5-i7 family), ZMM (really fast and available on special CPUs)) So, if you see that your generated assembly code uses YMM registers, you're in for a good time. That means your code is using AVX2 vectorization and not the slow SSE vectorization. - If you know how the assembly code works, you know exactly how your data needs to be in order to gain maximum performance. (Hint : 1D arrays is the best. Linear access is the best.) You might want to store the CSV as vector = [rho, rhou, rhov, rhoE][rho, rhou, rhov, rhoE] You might think that having the data close to each other like that will improve your cache performance. You're right. So what's the problem? The problem is that you're doomed to only using slow XMM registers. That's what the compiler will generate in order to be safe. If you want to force the compiler to generate AVX2 vectorized code that uses YMM registers, you can. However the problem is that in order to use the data in that form (say CSV to PSV conversion operation is required), AVX2 vectorized code will use machine instructions that do complex operations like : interleave the data, permutate the data, rotate the data and other very complex operations. This will cause your AVX2 "optimized" code to run slower than the code first generated by the compiler. What's the "right" way in this context? Store the data as vector = [rho, rho, rho, rho][rhou, rhou, rhou, rhou], [rhov, rhov, rhov][rhoE, rhoE, rhoE, rhoE] and access the data as arrays U1[i], U2[i], U3[i], U4[i] etc. Where you'll use pointers to set the first element of each of those arrays. I have done this, and every loop is vectorized to use AVX2 instructions by the compiler!* This is 4X faster than the serial code for double precision data and 8X faster than serial code for single precision data. That's because YMM registers are 256 bits wide and can fit 4 and 8 double and single precision values respectively.And then every vector operation will add/multiply/subtract/divide the numbers (4 or 8) at once in a single instruction! If your code isn't using SIMD vectorization, you're wasting performance. Sorry, that's mathematically proven. PART 2 : How do we compile and study the assembly code... coming soon.. sbaffini, ships26 and aero_head like this.

January 14, 2021, 16:10		#2
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	PART 2 : How do we compile and study the assembly code I hope some are interested to dive deep into the generated assembly code and optimize our operations for maximum performance. Let's look at the operation UN = U that is done at every timestep. We copy over the values inside U from the previous timestep and use it to solve for our current timestep. Cool... this is a very important operation as it is right in the middle of our solver loop and I don't want it to be slow. Since this is a very simple operation, you would be able to understand what's going on, and able to focus on the method. Here's the C++ code: Code: // Store solution from the previous time step for use in mstage iterations // #pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { UN1_[i] = U1_[i]; // mark <- NOTE this string "mark" UN2_[i] = U2_[i]; // mark UN3_[i] = U3_[i]; // mark UN4_[i] = U4_[i]; // mark } Notice how our operation is a simple for loop and our data is laid out into an 1D array. This is very important to notice, since our grid is actually a 2D structured grid as shown in the picture, and the naive method would have been to access the data as a 2D array inside fortran or a std::vector<std::vector<double>> in C++. Both are bad for vectorization, as the compiler warns that you're not doing "sequential access" i.e iterate over an 1D array. Computers love 1D arrays and linear access. Give her some. Now, before we go any further .. you'll need some tools. TOOLS required : g++ compiler, objdump tool, grep tool running under GNU/Linux OS. We will want to provide the flags -g to g++ to ensure that our source code and debug symbols are also included in the compiled binary. This will allow us to understand the assembly code easily. Compile: Code: $ g++ -g -O3 -fopenmp -mavx2 my_awesome_code.cpp The -O3 or -O2 optimization is required. Otherwise why would you study unoptimized code? This is strictly required. -fopenmp -mavx2 are used to allow OpenMP instructions to be used in the C++ code. And since I've used them, I included the flags here. They're not strictly required. Our task is to analyze the machine code generated by the compiler. However since we can't read machine code, we use a disassembler to disassemble it into assembly code. objdump is our disassembler of choice. Here's the command: Code: $ objdump -SC -Mintel a.out \| grep 'mark' -A30 -S annotates the code with the strings of your source code. This helps you read and understand the assembly code. -C is something I forgot, but it's required for now. -Mintel tells the disassembler to use the Intel Assembly code syntax. Trust me, it's life saving. You don't want to read AT&T Assembly code syntax. a.out is our executable This will print all of the annotated assembly code. That would be roughly around 9000 to 9,000,000 lines (based on your project size). So, we need something to make it easier for us to search through this blob of text and find our region of interest. That's why grep is used. grep 'mark' -A30 will search for the string "mark" (see in the C++ code) and print 30 lines of assembly code after it. This lets us find our interesting assembly code in an instant. Change "mark" to whatever you like. Here's the assembly code: Code: objdump -SC -Mintel a.out \| grep 'mark' -A30 UN1_[i] = U1_[i]; // mark 3259: c4 c1 7d 10 64 05 00 vmovupd ymm4,YMMWORD PTR [r13+rax1+0x0] 3260: c4 c1 78 11 24 02 vmovups XMMWORD PTR [r10+rax1],xmm4 3266: c4 c3 7d 19 64 02 10 vextractf128 XMMWORD PTR [r10+rax1+0x10],ymm4,0x1 326d: 01 UN2_[i] = U2_[i]; // mark 326e: c4 c1 7d 10 14 04 vmovupd ymm2,YMMWORD PTR [r12+rax1] 3274: c4 c1 78 11 14 01 vmovups XMMWORD PTR [r9+rax1],xmm2 327a: c4 c3 7d 19 54 01 10 vextractf128 XMMWORD PTR [r9+rax1+0x10],ymm2,0x1 3281: 01 UN3_[i] = U3_[i]; // mark 3282: c5 fd 10 1c 03 vmovupd ymm3,YMMWORD PTR [rbx+rax1] 3287: c4 c1 78 11 1c 00 vmovups XMMWORD PTR [r8+rax1],xmm3 328d: c4 c3 7d 19 5c 00 10 vextractf128 XMMWORD PTR [r8+rax1+0x10],ymm3,0x1 3294: 01 UN4_[i] = U4_[i]; // mark 3295: c4 c1 7d 10 24 03 vmovupd ymm4,YMMWORD PTR [r11+rax1] 329b: c5 f8 11 24 07 vmovups XMMWORD PTR [rdi+rax1],xmm4 32a0: c4 e3 7d 19 64 07 10 vextractf128 XMMWORD PTR [rdi+rax1+0x10],ymm4,0x1 32a7: 01 32a8: 48 83 c0 20 add rax,0x20 32ac: 48 39 85 68 fd ff ff cmp QWORD PTR [rbp-0x298],rax 32b3: 75 a4 jne 3259 <main+0xab9> 32b5: 39 d6 cmp esi,edx 32b7: 74 3a je 32f3 <main+0xb53> 32b9: 48 63 c6 movsxd rax,esi UN1_[i] = U1_[i]; // mark 32bc: c4 c1 7b 10 44 c5 00 vmovsd xmm0,QWORD PTR [r13+rax8+0x0] 32c3: c4 c1 7b 11 04 c2 vmovsd QWORD PTR [r10+rax8],xmm0 UN2_[i] = U2_[i]; // mark 32c9: c4 c1 7b 10 04 c4 vmovsd xmm0,QWORD PTR [r12+rax8] 32cf: c4 c1 7b 11 04 c1 vmovsd QWORD PTR [r9+rax8],xmm0 UN3_[i] = U3_[i]; // mark 32d5: c5 fb 10 04 c3 vmovsd xmm0,QWORD PTR [rbx+rax8] 32da: c4 c1 7b 11 04 c0 vmovsd QWORD PTR [r8+rax8],xmm0 UN4_[i] = U4_[i]; // mark 32e0: c4 c1 7b 10 04 c3 vmovsd xmm0,QWORD PTR [r11+rax8] 32e6: c5 fb 11 04 c7 vmovsd QWORD PTR [rdi+rax8],xmm0 32eb: 48 83 c0 01 add rax,0x1 32ef: 39 c2 cmp edx,eax 32f1: 7f c9 jg 32bc <main+0xb1c> Eureka!!!!! Notice how, we can see YMM registers right after the annotated C++ code. You can now tweak your code to optimize more if you want. (EDIT : You'll notice that it's using VMOVUPS U for unalligned. I might want it to use VMOVAPS A for alligned. In older hardware, VMOVAPS is significantly faster.) (EDIT : For clarification, you'll still need to profile your code. Just because it uses YMM registers doesn't mean that it will be fast in every case. The case shown above was a sample, so you need to profile your code. Most likely, in many cases where you don't have large arrays of data, serial code will be faster since CPUs can do roughly 10^8 (IIRC) operations in a second. If your arrays are very small, you will want to stick with serial code.) Notice how, we can see XMM registers at the bottom. Those are there as a safety feature. SIMD vectors instructions only work on vectors of lengths that are multiple of 4,8,16 etc. This means that in some arrays, there will be left over elements. Those elements are also needed to be calculated, thus the XMM registers are used. ( EDIT : The previously shown code took 10.0 seconds for 100,000 iterations in both serial and SIMD vectorized execution. So where did our 4X or 8X boost go? Remember this golden tid-bit : "SIMD vectorizations are used to do a single operation over a large amount of data." In the previous code shown, we were'nt really doing a single operation, we were doing four operations. My optimization was to split that up into 4 loops: Code: // Store solution from the previous time step for use in mstage iterations // #pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { UN1_[i] = U1_[i]; // mark } #pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { UN2_[i] = U2_[i]; // mark } #pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { UN3_[i] = U3_[i]; // mark } #pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { UN4_[i] = U4_[i]; // mark } This caused the code to be further optimized and execution time was only 6.5 seconds! Why was this version faster? The cache was going to be hot for this type of linear acces, and looking at the assembly we can see that the "serial" code a the end of the loop for the leftover elements was completely unrolled in this version. To be honest, I didn't know that this version would be faster, but now that I know how the assembly code looks for fast loops, I know what to look out for in the future. In the previous version, the operations on the leftover elements using XMM registers also used a conditional loop that we see in -O2 optimization. The complete loop unrolling that we see in this version is seen in -O3 optimization. This is how I was able to tell why this version was faster!!!!!! I knew from my previous experience how -O3 vs -O2 optimized code looks and which is faster. Do you now see why I say that learning assembly will help you a lot? Code: UN4_[i] = U4_[i]; // mark 39c2: c4 c1 7d 10 24 33 vmovupd ymm4,YMMWORD PTR [r11+rsi1] 39c8: c4 c1 78 11 24 37 vmovups XMMWORD PTR [r15+rsi1],xmm4 39ce: c4 c3 7d 19 64 37 10 vextractf128 XMMWORD PTR [r15+rsi1+0x10],ymm4,0x1 39d5: 01 But where's 8x or 4x boost? I see only 2X bost!* The "serial" code was actually SSE vectorized. I just said it was "serial" for ease of understanding. AVX2 is 2X faster than SSE. ) Now, you know how easy it is to analyze your assembly code. There's a lot more to learn, but the intention of this post was to show how easy it was to analyze your assembly code, and to start a discussion. Thanks and regards ~sayan sbaffini, ships26 and aero_head like this. Last edited by aerosayan; January 14, 2021 at 18:49. Reason: added optimization tips

January 15, 2021, 05:06		#3
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,168 Blog Entries: 29 Rep Power: 39	Just a disclaimer for myself... of course, if I could, I would like to be good at everything (yet, probably, assembly would still not make it into the top 5 in this case). But healthy time and brain occupations are limited, and typically you invest them where better returns are expected. Of course, in theory, assembly is almost fundamental in programming. But if I barely understand C++, I'm not gonna look at assembly with any great return. In the same time I would spend on it optimizing a certain loop I could implement, say, a new turbulence model in a code, and I would perceive that as something with better ROI. You know, we are who we are. However, keep rolling this as, for the exact same reasons above, it is going to help a lot of us aero_head likes this.

April 10, 2021, 23:45		#8
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	Part 5 : Finding where you are in the code I have previously explained how to find where you are in the code using a marker and grep tool to search through an objdump output. All of that is correct, but sometimes CPUs have a habit of reorganizing the order of operations to ensure better performance. That makes it a little bit difficult to read the assembly code. Maybe the section of code you're investigation, is so performance critical that, you can't even do a single unnecessary addition in the loop or out of the loop. So, you can't clearly mark the end of the loop correctly. In those cases, you can use the assembly NOP instruction to identify where you are. NOP instruction essentially says to the CPU to do nothing at all, so it has almost no performance loss. Considering the code given below, we want to find out in the assembly code, where the loop ends. Right after the end of the loop, we will see the NOP machine instruction, that we have inserted in the C++ code using asm("nop") Code: // Setup the CSV // #pragma omp simd linear(i:1) for(i=0; i<ncells4; ++i) { U1_[i] = UN1_[i]; // xdawg } asm("nop"); Here's the output from objdump, and searching for marker xdawg. Code: U1_[i] = UN1_[i]; // xdawg 8488: c5 f8 10 24 03 vmovups xmm4,XMMWORD PTR [rbx+rax1] 848d: c4 e3 5d 18 44 03 10 vinsertf128 ymm0,ymm4,XMMWORD PTR [rbx+rax1+0x10],0x1 8494: 01 8495: c4 c1 78 11 04 04 vmovups XMMWORD PTR [r12+rax1],xmm0 849b: c4 c3 7d 19 44 04 10 vextractf128 XMMWORD PTR [r12+rax*1+0x10],ymm0,0x1 84a2: 01 84a3: 48 83 c0 20 add rax,0x20 84a7: 48 39 c2 cmp rdx,rax 84aa: 75 dc jne 8488 <LOTS OF CODE HERE ... WE JUST REMOVED THEM FOR CLARITY> 8554: c5 f8 77 vzeroupper <- HERE'S OUR END OF LOOP asm("nop"); 8557: 90 nop <- HERE's OUR NOP INSTRUCTION aero_head likes this.

May 16, 2021, 22:47		#11
arjun Senior Member Arjun Join Date: Mar 2009 Location: Nurenberg, Germany Posts: 1,278 Rep Power: 34	I had a calculation affected by this issue recently where the 8 processor run did the job in 12 seconds while 1 processor run did it in 300 seconds. Found out that one loop had too much going on. Now slowly breaking down things in smaller groups :-D aerosayan and aero_head like this.

July 31, 2021, 04:38		#17
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	Part 8 : easy method to extract assembly code This is an important trick. It has saved me HOURS of development time. Let's say I want to see the code generated by some arbitrary piece of code. For this example, we will consider a call to printf. Well, how do we do that, the previous marker method I showed only works on a per line basis, but it's not meant for extracting the code from a whole section. And during development, I tend to extract code from a whole section, not individual lines. Here comes the NOP instruction Code: // we'll use this #define NOP __asm("nop"); // NOP; printf("hella fine\n"); NOP; // See, that we put one NOP instruction before and after the printf statement. Now, we will use sed to extract all of the code between these two statements Code: $ objdump -w -SC --no-show-raw-insn -Mintel myExecutable \| sed -n '/NOP/,/NOP/p' NOP; 10de: nop 10df: lea rsi,[rip+0xf95] # 207b <_IO_stdin_used+0x7b> 10e6: mov edi,0x1 10eb: xor eax,eax 10ed: call 1090 <__printf_chk@plt> printf("hella fine\n"); NOP; AMAZING!!!! I have previously shown how to use objdump, this is an intuitive method to get better results. For ease of use, I have made a bash function out of this. Code: # objdump code between two markers # my executable file name is in file run.txt, so I can use this command anywhere # and I don't have to manually enter my executable file name # obseg() { objdump -w -SC --no-show-raw-insn -Mintel `cat run.txt` \| sed -n '/'"$1"'/,/'"$1"'/p' } Now, I can use it simply as: Code: $ obseg NOP You can use other markers instead of NOP, and it would work similarly. Hopefully, it was useful to some. arjun likes this.