Fortran->Assembly : How to remove redundant non-vectorized code?

aerosayan · November 17, 2020, 09:47

SUMMARY :

When gfortran vectorizes an equation such as x = y*z , the compiler will only use vector registers (such as YMM) and vector opcodes, if and only if the compiler knows that all of the data (x,y,z) can be loaded on said vector registers. However if the compiler doesn't know whether all of the data can be loaded onto the registers, it will generate redundant code such that the remaining data can be operated upon.

This redundant code will generally not use the highest available SIMD vector registers and will bloat the executable causing ICACHE misses or preventing function/subroutine inlining optimizations.

I would like to remove this redundant code since I know that all of my data (x,y,z) will fit perfectly on the YMM vector registers.

DETAILS :

Code compiled and run on Intel i3 CPU with AVX2 SIMD support.

I observed that for the sample operation x = y*z, with array lengths of 8 each, the compiler is able to generate only vectorized code for the equation x = y*z. This happens since 8 integers can be perfectly fit onto a AVX2 YMM register.

CODE :

Code:

! COMPILE : gfortran -g -O3 -mavx2 -march=skylake fortran_vectorization_1a.dis.F90
! DISASM  : objdump -d -SC -Mintel a.out

program main
implicit none
integer *4, dimension(8) :: x, y, z

! Initialize
x = 0
y = (/10, 11, 12, 13, 14, 15, 16, 17/)
z = (/1 ,  2,  3,  4,  5,  6,  7,  8/)

! Vector multiplication
x(:) = y*z

! Prevent dead code elimination
print *, x

end program

GODBOLT LINK : https://gcc.godbolt.org/z/GW9E5W

ASSEMBLY : Slightly cleaned up

Code:

00000000000011b0 <MAIN__>:
program main
    11b0:    55                       push   rbp
program main
    11b8:    48 89 e5                 mov    rbp,rsp
    11bb:    41 54                    push   r12
    11bd:    48 83 e4 e0              and    rsp,0xffffffffffffffe0
    11c1:    48 81 ec c0 02 00 00     sub    rsp,0x2c0
y = (/10, 11, 12, 13, 14, 15, 16, 17/)
    11c8:    c5 f9 6f 0d 90 0e 00     vmovdqa xmm1,XMMWORD PTR [rip+0xe90]        # 2060 <A.0.3879>
    11cf:    00 
    11d0:    c5 f9 6f 15 98 0e 00     vmovdqa xmm2,XMMWORD PTR [rip+0xe98]        # 2070 <A.0.3879+0x10>
    11d7:    00 
z = (/1 ,  2,  3,  4,  5,  6,  7,  8/)
    11d8:    c5 f9 6f 1d 60 0e 00     vmovdqa xmm3,XMMWORD PTR [rip+0xe60]        # 2040 <A.1.3881>
    11df:    00 
    11e0:    c5 f9 6f 25 68 0e 00     vmovdqa xmm4,XMMWORD PTR [rip+0xe68]        # 2050 <A.1.3881+0x10>
    11e7:    00 
y = (/10, 11, 12, 13, 14, 15, 16, 17/)
    11e8:    c5 f8 29 4c 24 20        vmovaps XMMWORD PTR [rsp+0x20],xmm1
    11ee:    c5 f8 29 54 24 30        vmovaps XMMWORD PTR [rsp+0x30],xmm2
x(:) = y*z
    11f4:    c5 fd 6f 6c 24 20        vmovdqa ymm5,YMMWORD PTR [rsp+0x20]
z = (/1 ,  2,  3,  4,  5,  6,  7,  8/)
    120a:    c5 f8 29 5c 24 40        vmovaps XMMWORD PTR [rsp+0x40],xmm3
    1210:    c5 f8 29 64 24 50        vmovaps XMMWORD PTR [rsp+0x50],xmm4
x(:) = y*z
    1223:    c4 e2 55 40 44 24 40     vpmulld ymm0,ymm5,YMMWORD PTR [rsp+0x40]
x(:) = y*z
    123d:    c5 fd 7f 04 24           vmovdqa YMMWORD PTR [rsp],ymm0

We can clearly see above that only the vector registers and vector operators were used for the x=y*z operation.

However when we increase the length of the arrays to 9, one integer can't be fit on a YMM vector register. So, the compiler has to generate extra non-vectorized code to take care of the last integer.

CODE :

Code:

! COMPILE : gfortran -g -O3 -mavx2 -march=skylake fortran_vectorization_1b.dis.F90
! DISASM  : objdump -d -SC -Mintel a.out

program main
implicit none
integer *4, dimension(9) :: x, y, z

! Initialize
x = 0
y = (/10, 11, 12, 13, 14, 15, 16, 17, 18/)
z = (/1 ,  2,  3,  4,  5,  6,  7,  8,  9/)

! Vector multiplication
x(:) = y*z

! Prevent dead code elimination
print *, x

end program

GODOT LINK : https://gcc.godbolt.org/z/bdrsa7

ASSEMBLY : Slightly cleaned up...

Code:

00000000000011b0 <MAIN__>:
program main
    11b0:    55                       push   rbp
    11b1:    48 89 e5                 mov    rbp,rsp
    11b4:    41 54                    push   r12
    11b6:    48 83 e4 e0              and    rsp,0xffffffffffffffe0
    11ba:    48 81 ec 00 03 00 00     sub    rsp,0x300
y = (/10, 11, 12, 13, 14, 15, 16, 17, 18/)
    11c1:    8b 15 d9 0e 00 00        mov    edx,DWORD PTR [rip+0xed9]        # 20a0 <A.0.3879+0x20>
z = (/1 ,  2,  3,  4,  5,  6,  7,  8,  9/)
    11c7:    8b 05 93 0e 00 00        mov    eax,DWORD PTR [rip+0xe93]        # 2060 <A.1.3881+0x20>
y = (/10, 11, 12, 13, 14, 15, 16, 17, 18/)
    11cd:    c5 f9 6f 0d ab 0e 00     vmovdqa xmm1,XMMWORD PTR [rip+0xeab]        # 2080 <A.0.3879>
    11d4:    00 
z = (/1 ,  2,  3,  4,  5,  6,  7,  8,  9/)
    11d5:    89 84 24 a0 00 00 00     mov    DWORD PTR [rsp+0xa0],eax
y = (/10, 11, 12, 13, 14, 15, 16, 17, 18/)
    11dc:    c5 f9 6f 15 ac 0e 00     vmovdqa xmm2,XMMWORD PTR [rip+0xeac]        # 2090 <A.0.3879+0x10>
    11e3:    00 
x(:) = y*z
    11e4:    0f af c2                 imul   eax,edx
z = (/1 ,  2,  3,  4,  5,  6,  7,  8,  9/)
    11e7:    c5 f9 6f 1d 51 0e 00     vmovdqa xmm3,XMMWORD PTR [rip+0xe51]        # 2040 <A.1.3881>
    11ee:    00 
    11ef:    c5 f9 6f 25 59 0e 00     vmovdqa xmm4,XMMWORD PTR [rip+0xe59]        # 2050 <A.1.3881+0x10>
    11f6:    00 
y = (/10, 11, 12, 13, 14, 15, 16, 17, 18/)
    11f7:    c5 f8 29 4c 24 40        vmovaps XMMWORD PTR [rsp+0x40],xmm1
    11fd:    c5 f8 29 54 24 50        vmovaps XMMWORD PTR [rsp+0x50],xmm2
x(:) = y*z
    1203:    89 44 24 20              mov    DWORD PTR [rsp+0x20],eax
    1207:    c5 fd 6f 6c 24 40        vmovdqa ymm5,YMMWORD PTR [rsp+0x40]
z = (/1 ,  2,  3,  4,  5,  6,  7,  8,  9/)
    1224:    c5 f8 29 9c 24 80 00     vmovaps XMMWORD PTR [rsp+0x80],xmm3
    122b:    00 00 
    122d:    c5 f8 29 a4 24 90 00     vmovaps XMMWORD PTR [rsp+0x90],xmm4
    1234:    00 00 
x(:) = y*z
    1243:    c4 e2 55 40 84 24 80     vpmulld ymm0,ymm5,YMMWORD PTR [rsp+0x80]
    124a:    00 00 00 
y = (/10, 11, 12, 13, 14, 15, 16, 17, 18/)
    124d:    89 54 24 60              mov    DWORD PTR [rsp+0x60],edx
x(:) = y*z
    1264:    c5 fd 7f 04 24           vmovdqa YMMWORD PTR [rsp],ymm0

We can clearly see above that the compiler had to use imul to take care of the integer multiplication that couldn't be fit onto the vector registers.

This is a good feature. However it becomes a problem when we use allocatable arrays. The compiler doesn't know the size of the array so it HAS TO generate both the vectorized and non-vectorized versions of the code. This can be seen in the next example.

CODE :

Code:

! COMPILE : gfortran -g -O3 -mavx2 -march=skylake fortran_vectorization_1c.dis.F90
! DISASM  : objdump -d -SC -Mintel --visualize-jumps a.out

program main
implicit none
integer *4, dimension(:), allocatable :: x, y, z
integer *4 :: n

read *, n

! Allocate
allocate(x(n))
allocate(y(n))
allocate(z(n))

! Initialize
x = 0
y(n/2) = n
z(n/2) = n/2

! Vector multiplication
x(:) = y*z

! Prevent dead code elimination
print *, x
print *, y
print *, z

! Scrub clean
deallocate(x,y,z)
end program

GODOT LINK : https://gcc.godbolt.org/z/1fdzK8

ASSEMBLY : The generated code is too large. I'm only posting the x=y*z part.

Code:

x(:) = y*z
    1498:    c4 c1 7e 6f 0c 07        vmovdqu ymm1,YMMWORD PTR [r15+rax*1]
    149e:    c4 c2 75 40 04 04        vpmulld ymm0,ymm1,YMMWORD PTR [r12+rax*1]
    14a4:    c4 c1 7e 7f 04 06        vmovdqu YMMWORD PTR [r14+rax*1],ymm0
    14aa:    48 83 c0 20              add    rax,0x20
    14ae:    48 39 c2                 cmp    rdx,rax
    14b1:    75 e5                    jne    1498 <MAIN__+0x278>
    14b3:    48 89 ca                 mov    rdx,rcx
    14b6:    48 83 e2 f8              and    rdx,0xfffffffffffffff8
    14ba:    48 8d 42 01              lea    rax,[rdx+0x1]
    14be:    48 39 ca                 cmp    rdx,rcx
    14c1:    0f 84 aa 01 00 00        je     1671 <MAIN__+0x451>
    14c7:    c5 f8 77                 vzeroupper 
    14ca:    48 8d 50 ff              lea    rdx,[rax-0x1]
    14ce:    41 8b 34 97              mov    esi,DWORD PTR [r15+rdx*4]
    14d2:    41 0f af 34 94           imul   esi,DWORD PTR [r12+rdx*4]
    14d7:    41 89 34 96              mov    DWORD PTR [r14+rdx*4],esi
    14db:    48 8d 50 01              lea    rdx,[rax+0x1]
    14df:    48 39 d1                 cmp    rcx,rdx
    14e2:    7c 7b                    jl     155f <MAIN__+0x33f>
    14e4:    41 8b 34 87              mov    esi,DWORD PTR [r15+rax*4]
    14e8:    41 0f af 34 84           imul   esi,DWORD PTR [r12+rax*4]
    14ed:    41 89 34 86              mov    DWORD PTR [r14+rax*4],esi
    14f1:    48 8d 70 02              lea    rsi,[rax+0x2]
    14f5:    48 39 f1                 cmp    rcx,rsi
    14f8:    7c 65                    jl     155f <MAIN__+0x33f>
    14fa:    41 8b 3c 97              mov    edi,DWORD PTR [r15+rdx*4]
    14fe:    41 0f af 3c 94           imul   edi,DWORD PTR [r12+rdx*4]
    1503:    41 89 3c 96              mov    DWORD PTR [r14+rdx*4],edi
    1507:    48 8d 50 03              lea    rdx,[rax+0x3]
    150b:    48 39 d1                 cmp    rcx,rdx
    150e:    7c 4f                    jl     155f <MAIN__+0x33f>
    1510:    41 8b 3c b7              mov    edi,DWORD PTR [r15+rsi*4]
    1514:    41 0f af 3c b4           imul   edi,DWORD PTR [r12+rsi*4]
    1519:    41 89 3c b6              mov    DWORD PTR [r14+rsi*4],edi
    151d:    48 8d 70 04              lea    rsi,[rax+0x4]
    1521:    48 39 f1                 cmp    rcx,rsi
    1524:    7c 39                    jl     155f <MAIN__+0x33f>
    1526:    41 8b 3c 97              mov    edi,DWORD PTR [r15+rdx*4]
    152a:    41 0f af 3c 94           imul   edi,DWORD PTR [r12+rdx*4]
    152f:    41 89 3c 96              mov    DWORD PTR [r14+rdx*4],edi
    1533:    48 8d 50 05              lea    rdx,[rax+0x5]
    1537:    48 39 d1                 cmp    rcx,rdx
    153a:    7c 23                    jl     155f <MAIN__+0x33f>
    153c:    41 8b 3c b7              mov    edi,DWORD PTR [r15+rsi*4]
    1540:    48 83 c0 06              add    rax,0x6
    1544:    41 0f af 3c b4           imul   edi,DWORD PTR [r12+rsi*4]
    1549:    41 89 3c b6              mov    DWORD PTR [r14+rsi*4],edi
    154d:    48 39 c1                 cmp    rcx,rax
    1550:    7c 0d                    jl     155f <MAIN__+0x33f>
    1552:    41 8b 04 94              mov    eax,DWORD PTR [r12+rdx*4]
    1556:    41 0f af 04 97           imul   eax,DWORD PTR [r15+rdx*4]
    155b:    41 89 04 96              mov    DWORD PTR [r14+rdx*4],eax

We can clearly see that at the top, the vector registers and operations were used. However, since the compiler HAS TO be sure that it will work on arrays of all lengths, the compiler will generate 7 more IMUL operations to take care of 7 possible integers that couldn't be fit onto a AVX2 register.

We can clearly see how much bloat is generated. We know that we will only pass in an array of integers which has length of 8 or multiples of 8 to this vector operation. So, we don't actually need the rest of the 7 redundant IMUL operations. However since the arrays x,y,z are allocatalbe, the compiler doesn't know this.

Is there anyway to tell the compiler to only generate vectorized code like we see in the first case?

CONCLUSION:

The three examples given above show that the compiler generates bloat when it doesn't know for sure that all of the data can be fit perfectly on AVX2 vector registers.

This seems like a good opportunity for optimization.

I would even call it necessary. Just look at the bloat generated for the following equation :

Code:

! Find the area of a triangle by using cross product
 vols = 0.5*sqrt((x1*(y2-y3))**2.0 + (x2*(y3-y1))**2.0 + (x3*(y1-y2))**2.0)

GODOT LINK : https://gcc.godbolt.org/z/oxcWzd

Thanks

aerosayan · November 17, 2020, 14:26

So I was able to use GODBOLT to try my code in Intel's Fortran compiler (x86-64 IFORT 21.1.9) and was able to generate satisfactory assembly code.

I don't have access to an IFORT compiler, so I can't use it, I'm still looking for how I could achieve such results in GFORTRAN compiler.

Here's what I did :

Looked up this Intel article "Data Allignment to Assist Vectorization" by Rakesh Krishnaiyer.

URL : https://software.intel.com/content/www/us/en/develop/articles/data-alignment-to-assist-vectorization.html

ARCHIVE : https://archive.vn/RHEIe

I saw that we could give the compiler some information about the value N, so that it would be able to optimize the code based on it. As shown in the first example, we should get purely vectorized code when N%8==0. This is due to the YMM vector registers being able to hold 8 integers.

So, I used !DIR$ ASSUME_ALIGNED X: 32 to tell the compiler that we have X aligned on 32.

Then I used !DIR$ ASSUME (MOD(N,8) .EQ. 0) to tell the compiler that N will be a multiple of 8.

I kept rest of the code same.

Code:

program main
implicitnone
integer *4 :: n, i
integer *4, dimension(:), allocatable :: x, y, z

read *, n

! Allocate
allocate(x(n))
allocate(y(n))
allocate(z(n))

! Initialize
x = 0
y(n/2) = n
z(n/2) = n/2

! Vector multiplication
!DIR$ ASSUME_ALIGNED X: 32
!DIR$ ASSUME (MOD(N,8) .EQ. 0)
do i=1,n
    x(i) = y(i)*z(i)
enddo

! Prevent dead code elimination
print *, x
print *, y
print *, z

! Scrub clean
deallocate(x,y,z)
endprogram

The generated assembly code for the loop of x(i) = y(i)*z(i) was :

Code:

..B1.15:
vmovdqu  ymm0, YMMWORD PTR [r13+r12*4]
vpmulld  ymm1, ymm0, YMMWORD PTR [r14+r12*4]
vmovdqu  YMMWORD PTR [rax+r12*4], ymm1
add      r12, 8
cmp      r12, rcx
jb       ..B1.15

Here's the live GODBOLT link for the code shown above : https://gcc.godbolt.org/z/PPf1zE
Here's the live GODBOLT link for the code to calculate the volume of triangles using cross product : https://gcc.godbolt.org/z/Y5orh1

Now that I was able to optimize the code in IFORT, I know that it is a viable optimization. Unfortunately, I still don't know how to do it in GFORTRAN. Hopefully, someone will be able to help if they know more about GFORTRAN's and GCC's compiler directives.

I will keep looking too.

aerosayan · November 18, 2020, 02:26

I optimized the solution for GFORTRAN compiler, but I'm not too happy with it.

The solution is to let the compiler know that the loop ends at a multiple of 8. Here's how I did it.

Code:

! Vector multiplication
do i=1,AND(n,-8)
    x(i) = y(i)*z(i)
end do

The equation AND(N,-8) is mathematically equivalent to N - MOD(N,8).

This lets the compiler know that it doesn't need to generate the redundant non-vectorized code. However I'm not too happy with it since gfortran should have compiler directives to make this easier.

Anyway, here's the full code.

Code:

! COMPILE : gfortran -g -O3 -mavx2 -march=skylake fortran_vectorization_1g.dis.F90
! DISASM  : objdump -d -SC -Mintel --visualize-jumps a.out

program main
implicit none
integer *4, dimension(:), allocatable :: x, y, z
integer *4 :: n, i

read *, n

! Allocate
allocate(x(n))
allocate(y(n))
allocate(z(n))

! Initialize
do i=1,n
    x(i) = 0
    y(i) = i
    z(i) = 2*i
end do

! Loop truncation math
print *, "AND(N,-8)    :", AND(N,-8)
print *, "N - MOD(N,8) :", N - MOD(N,8)

! Vector multiplication
do i=1,AND(n,-8)
    x(i) = y(i)*z(i)
end do


! Prevent dead code elimination
print *, "RESULT       :"
do i=1,n
    print *, y(i), "         x", z(i) ,"=", x(i)
end do

! Scrub clean
deallocate(x,y,z)
end program

Here's the live GODBOLT link : https://gcc.godbolt.org/z/MW7cPG

Here's the assembly code of the required operation x(i) = y(i)*z(i) :

Code:

    x(i) = y(i)*z(i)
    15a8:    48 8b 4c 24 18           mov    rcx,QWORD PTR [rsp+0x18]
    15ad:    c5 fe 6f 24 01           vmovdqu ymm4,YMMWORD PTR [rcx+rax*1]
    15b2:    48 8b 4c 24 10           mov    rcx,QWORD PTR [rsp+0x10]
    15b7:    c5 fd 7f 64 24 20        vmovdqa YMMWORD PTR [rsp+0x20],ymm4
    15bd:    c4 e2 5d 40 04 01        vpmulld ymm0,ymm4,YMMWORD PTR [rcx+rax*1]
    15c3:    48 8b 4c 24 08           mov    rcx,QWORD PTR [rsp+0x8]
    15c8:    c5 fe 7f 04 01           vmovdqu YMMWORD PTR [rcx+rax*1],ymm0
    15cd:    48 83 c0 20              add    rax,0x20
    15d1:    48 39 d0                 cmp    rax,rdx
    15d4:    75 d2                    jne    15a8 <MAIN__+0x378>

We can clearly see that only vector operations were used. However I'm not too happy with the MOV operations in there. I'll look at it later.

Running the executable and entering N=8 gives the result:

Code:

8
 AND(N,-8)    :           8
 N - MOD(N,8) :           8
 RESULT       :
           1          x           2 =           2
           2          x           4 =           8
           3          x           6 =          18
           4          x           8 =          32
           5          x          10 =          50
           6          x          12 =          72
           7          x          14 =          98
           8          x          16 =         128

Running the executable and entering N=31 gives the result :

Code:

31
 AND(N,-8)    :          24
 N - MOD(N,8) :          24
 RESULT       :
           1          x           2 =           2
           2          x           4 =           8
           3          x           6 =          18
           4          x           8 =          32
           5          x          10 =          50
           6          x          12 =          72
           7          x          14 =          98
           8          x          16 =         128
           9          x          18 =         162
          10          x          20 =         200
          11          x          22 =         242
          12          x          24 =         288
          13          x          26 =         338
          14          x          28 =         392
          15          x          30 =         450
          16          x          32 =         512
          17          x          34 =         578
          18          x          36 =         648
          19          x          38 =         722
          20          x          40 =         800
          21          x          42 =         882
          22          x          44 =         968
          23          x          46 =        1058
          24          x          48 =        1152
          25          x          50 =           0
          26          x          52 =           0
          27          x          54 =           0
          28          x          56 =           0
          29          x          58 =           0
          30          x          60 =           0
          31          x          62 =           0

We can clearly see that AND(N,-8) == N - MOD(N,8). And we can clearly see that the vector operation happened only till the I==24.

This shows that the optimization was a success.

Also observed HUGE optimization for the code that calculates volume of triangles : https://gcc.godbolt.org/z/1or59d

However, I'm looking for input from other Fortran experts who know if we can achieve this result in an easier way as done in IFORT.

sbaffini · November 18, 2020, 09:22

You are certainly learning something, and I am certainly not an expert, but never were any of my codes so good that the next thing to optimize was at this level.

May I ask what sort of improvements are you achieving for your test code on triangle areas and what, if any, in your actual code? Is it actually worth the effort?

aerosayan · November 18, 2020, 10:34

Quote:

Originally Posted by sbaffini

You are certainly learning something, and I am certainly not an expert, but never were any of my codes so good that the next thing to optimize was at this level.

May I ask what sort of improvements are you achieving for your test code on triangle areas and what, if any, in your actual code? Is it actually worth the effort?

I was just playing around when I observed this code bloat. I was implementing the triangle volume calculation for my CFD solver when I observed that the actual AVX2 vectorized assembly code (which uses YMM registers) was only around 14 lines while the generated SSE vectorized redundant code (which uses XMM registers) was around 7*14 lines. And the volume calculation equation wasn't even that complex.

Imagine how much amount of bloat would be present in a generic number crunching function/subroutine.

The method described here has tremendous advantages and applications:

- Almost all of my vectorized operations can now be supercharged by removing the redundant code. All I need to do, is to ensure that I fill up the YMM registers completely. For example : If there are 31 cells, I will add an extra dummy cell to make the cell list have 32 elements.

- This would allow less Instruction Cache misses and would allow the compiler to basically inline a lot of functions/subroutines, thus removing any cost of function/subroutine call. It would make sense to use such an optimization when we know that everything else will be accelerated.

- I just used the volume calculation as an example, since it was sufficiently complicated and it was the first equation that I tried to optimize. Since ALL our cell data (eg, rho, u, v, E etc) is organized in 1D arrays, we can SIMD vectorize most of our CFD solver code (I agree it is tough to do and I need more experience with it).

- The optimization is surprisingly portable. A YMM register would fit four 64 bit doubles. To get the same optimization, I just need to use AND(N,-4). For 32 bit floats, I would need AND(N,-8) since eight floats can be fit on a YMM register. When I need to run my code on a machine that supports AVX512, I would need to simply change the -8 and -4 part. This can be done using a #define macro. If the machine doesn't support AVX2, and only supports SSE, the code will still compile and run if we define the optimization as AND(N, MACRO_SIMD_MAX_PACKED_ELEMENTS_BLAH_BLAH).

aerosayan · November 19, 2020, 12:24

I did some testing to see if I could get the same optimization while continuing to use Fortran's amazing array slice operations.

Don't wanna write ugly code with xyz(i) 's everywhere.

Here's how to do it : We just need to explicitly define the range of operation i.e 1,AND(N,-8) and the compiler understands that it can optimize.

Code:

! Vector operartion
vols(1:AND(N,-8)) = 0.5*sqrt((x1*(y2-y3))**2.0 + (x2*(y3-y1))**2.0 + (x3*(y1-y2))**2.0)

GODBOLT LINK : https://gcc.godbolt.org/z/7PoKzf

Assembly code generated :

Code:

        and     ebx, -8
        jle     .L6
        movsx   rbx, ebx
        vmovaps ymm4, YMMWORD PTR .LC2[rip]
        sal     rbx, 2
        xor     eax, eax
.L9:
        vmovups ymm3, YMMWORD PTR [r9+rax]
        vmovups ymm1, YMMWORD PTR [r11+rax]
        vmovups ymm0, YMMWORD PTR [r15+rax]
        vsubps  ymm2, ymm3, ymm1
        vsubps  ymm1, ymm1, ymm0
        vsubps  ymm0, ymm0, ymm3
        vmulps  ymm2, ymm2, YMMWORD PTR [r13+0+rax]
        vmulps  ymm1, ymm1, YMMWORD PTR [r14+rax]
        vmulps  ymm0, ymm0, YMMWORD PTR [r10+rax]
        vmulps  ymm1, ymm1, ymm1
        vfmadd231ps     ymm1, ymm2, ymm2
        vfmadd132ps     ymm0, ymm1, ymm0
        vsqrtps ymm0, ymm0
        vmulps  ymm0, ymm0, ymm4
        vmovups YMMWORD PTR [r12+rax], ymm0
        add     rax, 32
        cmp     rax, rbx
        jne     .L9

pcosta · November 20, 2020, 05:37

https://fortran-lang.discourse.group/

this forum may be better suited for a detailed discussion about the Fortran programming language.

November 17, 2020, 14:26		#2
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	So I was able to use GODBOLT to try my code in Intel's Fortran compiler (x86-64 IFORT 21.1.9) and was able to generate satisfactory assembly code. I don't have access to an IFORT compiler, so I can't use it, I'm still looking for how I could achieve such results in GFORTRAN compiler. Here's what I did : Looked up this Intel article "Data Allignment to Assist Vectorization" by Rakesh Krishnaiyer. URL : https://software.intel.com/content/www/us/en/develop/articles/data-alignment-to-assist-vectorization.html ARCHIVE : https://archive.vn/RHEIe I saw that we could give the compiler some information about the value N, so that it would be able to optimize the code based on it. As shown in the first example, we should get purely vectorized code when N%8==0. This is due to the YMM vector registers being able to hold 8 integers. So, I used !DIR$ ASSUME_ALIGNED X: 32 to tell the compiler that we have X aligned on 32. Then I used !DIR$ ASSUME (MOD(N,8) .EQ. 0) to tell the compiler that N will be a multiple of 8. I kept rest of the code same. Code: program main implicitnone integer 4 :: n, i integer 4, dimension(:), allocatable :: x, y, z read , n ! Allocate allocate(x(n)) allocate(y(n)) allocate(z(n)) ! Initialize x = 0 y(n/2) = n z(n/2) = n/2 ! Vector multiplication !DIR$ ASSUME_ALIGNED X: 32 !DIR$ ASSUME (MOD(N,8) .EQ. 0) do i=1,n x(i) = y(i)z(i) enddo ! Prevent dead code elimination print , x print , y print , z ! Scrub clean deallocate(x,y,z) endprogram The generated assembly code for the loop of x(i) = y(i)z(i) was : Code: ..B1.15: vmovdqu ymm0, YMMWORD PTR [r13+r124] vpmulld ymm1, ymm0, YMMWORD PTR [r14+r124] vmovdqu YMMWORD PTR [rax+r124], ymm1 add r12, 8 cmp r12, rcx jb ..B1.15 Here's the live GODBOLT link for the code shown above : https://gcc.godbolt.org/z/PPf1zE Here's the live GODBOLT link for the code to calculate the volume of triangles using cross product : https://gcc.godbolt.org/z/Y5orh1 Now that I was able to optimize the code in IFORT, I know that it is a viable optimization. Unfortunately, I still don't know how to do it in GFORTRAN. Hopefully, someone will be able to help if they know more about GFORTRAN's and GCC's compiler directives. I will keep looking too. Last edited by aerosayan; November 17, 2020 at 14:46. Reason: Added godbolt link*

November 20, 2020, 05:37		#7
pcosta New Member Pedro Costa Join Date: Jul 2017 Posts: 9 Rep Power: 9	https://fortran-lang.discourse.group/ this forum may be better suited for a detailed discussion about the Fortran programming language. sbaffini likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Patankar CFD FORTRAN 90 Code FVM	siddiquesil	Main CFD Forum	19	October 25, 2022 05:21
Parallel Fortran 95 code using Openmp	Shiv1510	Main CFD Forum	5	February 14, 2020 11:36
CGNS fortran code compiling with intel visual fortran xe	dokeun	Main CFD Forum	1	April 5, 2012 21:01
Comparison between C/C++ and Fortran?	rick	Main CFD Forum	45	September 6, 2011 00:52
CFD Code with Fortran	murat	Main CFD Forum	0	October 11, 2009 17:05

November 18, 2020, 09:22		#4
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,168 Blog Entries: 29 Rep Power: 39	You are certainly learning something, and I am certainly not an expert, but never were any of my codes so good that the next thing to optimize was at this level. May I ask what sort of improvements are you achieving for your test code on triangle areas and what, if any, in your actual code? Is it actually worth the effort?