# Assembly 3: Optimizations and assembly

## Compiler optimizations

### Argument elision

A compiler may decide to elide (or remove) certain operations setting up function call arguments, if it can decide that the registers containing these arguments will hold the correct value before the function call takes place. Let's see an example of a function disassembled function f in f31.s:

subq    $8, %rsp call _Z1gi@PLT addq$8, %rsp
addl    $1, %eax ret This function calls another function g, adds 1 to g's return value, and returns that value. It is possible that the function has the following definition in C++: int f() { return 1 + g(); }  However, the actual definition of f in f31.cc is: int f(x) { return 1 + g(x); }  The compiler realizes that the argument to function g, which is passed via register %rdi, already has the right value when g is called, so it doesn't bother doing anything about it. This is one example of numerous optimizations a compiler can perform to reduce the size of generated code. ### Inlining A compiler may also copy the body of function to its call site, instead of doing an explicit function call, when it decides that the overhead of performing a function call outweights the overhead of doing this copy. For example, if we have a function g defined as g(x) = 2 + x, and f is defined as f(x) = 1 + g(x), then the compiler may actually generate f(x) as simply 3 + x, without inserting any call instructions. In assembly terms, function g will look like leal 2(%rdi), %eax ret and f will simply be leal 3(%rdi), %eax ret ### Tail call elimination Let's look at another example in f32.s: addl$1, %edi
jmp _Z1gi@PLT

This function doesn't even contain a ret instruction! What is going on? Let's take a look at the actual definition of f, in f32.cc:

int f(int x) {
return g(x + 1);
}


Note that the call to function g is the last operation in function f, and the return value of f is just the return value of the invocation of g. In this case the compiler can perform a tail call elimination: instead of calling g explicitly, it can simply jump to g and have g return to the same address that f would have returned to.

A tail call elimination may occur if a function (caller) ends with another function call (callee) and performs no cleanup once the callee returns. In this case the caller and simply jump to the callee, instead of doing an explicit call.

### Loop unrolling

Before we jump into loop unrolling, let's take a small excursion into an aspect of calling conventions called caller/callee-saved registers. This will help us under the sample program in f33.s better.

#### Calling conventions: caller/callee-saved registers

Let's look at the function definition in f33.s:

    pushq   %r12
pushq   %rbp
pushq   %rbx
testl   %edi, %edi
je  .L4
movl    %edi, %r12d
movl    $0, %ebx movl$0, %ebp
.L3:
movl    %ebx, %edi
call    _Z1gj@PLT
addl    %eax, %ebp
addl    $1, %ebx cmpl %ebx, %r12d jne .L3 .L1: movl %ebp, %eax popq %rbx popq %rbp popq %r12 ret .L4: movl %edi, %ebp jmp .L1 From the assembly we can tell that the backwards jump to .L3 is likely a loop. The loop index is in %ebx and the loop bound is in %r12d. Note that upon entry to the function we first moved the value %rdi to %r12d. This is necessary because in the loop f calls g, and %rdi is used to pass arguments to g, so we must move its value to a different register to used it as the loop bound (this case %r12). But there is more to this: the compiler also needs to ensure that this register's value is preserved across function calls. Calling conventions dictate that certain registers always exhibit this property, and they are called callee-saved registers. If a register is callee-saved, then the caller doesn't have to save its value before entering a function call. We note that upon entry to the function, f saved a bunch of registers by pushing them to the stack: %r12, %rbp, %rbx. It is because all these registers are callee-saved registers, and f uses them during the function call. In general, the following registers in x86_64 are callee-saved: %rbx, %r12-%r15, %rbp, %rsp (%rip) All the other registers are caller-saved, which means the callee doesn't have to preserve their values. If the caller wants to reuse values in these registers across function calls, it will have to explicitly save and restore these registers. In general, the following registers in x86_64 are caller-saved: %rax, %rcx, %rdx, %r8-%r11 Now let's get back to loop unrolling. Let us a look at the program in f34.s:  testl %edi, %edi je .L7 leal -1(%rdi), %eax cmpl$7, %eax
jbe .L8
pxor    %xmm0, %xmm0
movl    %edi, %edx
xorl    %eax, %eax
movdqa  .LC0(%rip), %xmm1
shrl    $2, %edx movdqa .LC1(%rip), %xmm2 .L5: addl$1, %eax
paddd   %xmm1, %xmm0
paddd   %xmm2, %xmm1
cmpl    %edx, %eax
jb  .L5
movdqa  %xmm0, %xmm1
movl    %edi, %edx
andl    $-4, %edx psrldq$8, %xmm1
paddd   %xmm1, %xmm0
movdqa  %xmm0, %xmm1
cmpl    %edx, %edi
psrldq  $4, %xmm1 paddd %xmm1, %xmm0 movd %xmm0, %eax je .L10 .L3: leal 1(%rdx), %ecx addl %edx, %eax cmpl %ecx, %edi je .L1 addl %ecx, %eax leal 2(%rdx), %ecx cmpl %ecx, %edi je .L1 addl %ecx, %eax leal 3(%rdx), %ecx cmpl %ecx, %edi je .L1 addl %ecx, %eax leal 4(%rdx), %ecx cmpl %ecx, %edi je .L1 addl %ecx, %eax leal 5(%rdx), %ecx cmpl %ecx, %edi je .L1 addl %ecx, %eax leal 6(%rdx), %ecx cmpl %ecx, %edi je .L1 addl %ecx, %eax addl$7, %edx
leal    (%rax,%rdx), %ecx
cmpl    %edx, %edi
cmovne  %ecx, %eax
ret
.L7:
xorl    %eax, %eax
.L1:
rep ret
.L10:
rep ret
.L8:
xorl    %edx, %edx
xorl    %eax, %eax
jmp .L3

Wow this looks long and repetitive! Especially the section under label .L3! If we take a look at the original function definition in f34.cc, we will find that it's almost the same as f33.cc, except that in f34.cc we know the definition of g as well. With knowledge of what g does the compiler's optimizer decides that unrolling the loop into 7-increment batches results in faster code.

Code like this can become difficult to understand, especially when the compiler begins to use more advanced registers reserved for vector operations. We can fine-tune the optimizer to disable certain optimizations. For example, we can use the -mno-sse -fno-unroll-loops compiler options to disable the use of SSE registers and loop unrolling. The resulting code, in f35.s, for the same function definitions in f34.cc, becomes much easier to understand:

    testl   %edi, %edi
je  .L4
xorl    %edx, %edx
xorl    %eax, %eax
.L3:
addl    %edx, %eax
addl    $1, %edx cmpl %edx, %edi jne .L3 rep ret .L4: xorl %eax, %eax ret Note that the compiler still performed inlining to eliminate function g. ### Optimizing recursive functions Let's look at the following recursive function in f36.cc: int f(int x) { if (x > 0) { return x * f(x - 1); } else { return 0; } }  At the first glance it may seem that the function returns factorial of x. But it actually returns 0. Despite it doing a series of multiplications, in the end it always multiplies the whole result with 0, which produces 0. When we compile this function to assembly without much optimization, we see the expensive computation occurring:  movl$0, %eax
testl   %edi, %edi
jg  .L8
rep ret
.L8:
pushq   %rbx
movl    %edi, %ebx
leal    -1(%rdi), %edi
call    _Z1fi
imull   %ebx, %eax
popq    %rbx
ret

In f37.cc there is an actual factorial function:

int f(int x) {
if (x > 0) {
return x * f(x - 1);
} else {
return 1;
}
}


If we compile this function using level-2 optimization (-O2), we get the following assembly:

    testl   %edi, %edi
movl    $1, %eax jle .L4 .L3: imull %edi, %eax subl$1, %edi
jne .L3
rep ret
.L4:
rep ret

There is no call instructions again! The compiler has transformed the recursive function into a loop.

If we revisit our "fake" factorial function that always returns 0, and compile it with -O2, we see yet more evidence of compiler's deep understanding of our program:

xorl    %eax, %eax
ret

### Optimizing arithmetic operations

The assembly code in f39.s looks like this:

leal    (%rdi,%rdi,2), %eax
leal    (%rdi,%rax,4), %eax
ret

It looks like some rather complex address computations! The first leal instruction basically loads %eax with value 3*%rdi (or %rdi + 2*%rdi). The second leal multiplies the previous result by another 4, and adds another %rdi to it. So what it actually does is 3*%rdi*4 + %rdi, or simply 13*%rdi. This is also revealed in the function name in f39.s.

The compiler choose to use leal instructions instead of an explicit multiply because the two leal instructions actually take less space.