Assembly 3: Optimizations and assembly

Compiler optimizations

Argument elision

A compiler may decide to elide (or remove) certain operations setting up function call arguments, if it can decide that the registers containing these arguments will hold the correct value before the function call takes place. Let's see an example of a function disassembled function f in f31.s:

subq    $8, %rsp
call    _Z1gi@PLT
addq    $8, %rsp
addl    $1, %eax
ret

This function calls another function g, adds 1 to g's return value, and returns that value.

It is possible that the function has the following definition in C++:

int f() {
    return 1 + g();
}

However, the actual definition of f in f31.cc is:

int f(x) {
    return 1 + g(x);
}

The compiler realizes that the argument to function g, which is passed via register %rdi, already has the right value when g is called, so it doesn't bother doing anything about it. This is one example of numerous optimizations a compiler can perform to reduce the size of generated code.

Inlining

A compiler may also copy the body of function to its call site, instead of doing an explicit function call, when it decides that the overhead of performing a function call outweights the overhead of doing this copy. For example, if we have a function g defined as g(x) = 2 + x, and f is defined as f(x) = 1 + g(x), then the compiler may actually generate f(x) as simply 3 + x, without inserting any call instructions. In assembly terms, function g will look like

leal    2(%rdi), %eax
ret

and f will simply be

leal    3(%rdi), %eax
ret

Tail call elimination

Let's look at another example in f32.s:

addl    $1, %edi
jmp _Z1gi@PLT

This function doesn't even contain a ret instruction! What is going on? Let's take a look at the actual definition of f, in f32.cc:

int f(int x) {
    return g(x + 1);
}

Note that the call to function g is the last operation in function f, and the return value of f is just the return value of the invocation of g. In this case the compiler can perform a tail call elimination: instead of calling g explicitly, it can simply jump to g and have g return to the same address that f would have returned to.

A tail call elimination may occur if a function (caller) ends with another function call (callee) and performs no cleanup once the callee returns. In this case the caller and simply jump to the callee, instead of doing an explicit call.

Loop unrolling

Before we jump into loop unrolling, let's take a small excursion into an aspect of calling conventions called caller/callee-saved registers. This will help us under the sample program in f33.s better.

Calling conventions: caller/callee-saved registers

Let's look at the function definition in f33.s:

    pushq   %r12
    pushq   %rbp
    pushq   %rbx
    testl   %edi, %edi
    je  .L4
    movl    %edi, %r12d
    movl    $0, %ebx
    movl    $0, %ebp
.L3:
    movl    %ebx, %edi
    call    _Z1gj@PLT
    addl    %eax, %ebp
    addl    $1, %ebx
    cmpl    %ebx, %r12d
    jne .L3
.L1:
    movl    %ebp, %eax
    popq    %rbx
    popq    %rbp
    popq    %r12
    ret
.L4:
    movl    %edi, %ebp
    jmp .L1

From the assembly we can tell that the backwards jump to .L3 is likely a loop. The loop index is in %ebx and the loop bound is in %r12d. Note that upon entry to the function we first moved the value %rdi to %r12d. This is necessary because in the loop f calls g, and %rdi is used to pass arguments to g, so we must move its value to a different register to used it as the loop bound (this case %r12). But there is more to this: the compiler also needs to ensure that this register's value is preserved across function calls. Calling conventions dictate that certain registers always exhibit this property, and they are called callee-saved registers. If a register is callee-saved, then the caller doesn't have to save its value before entering a function call.

We note that upon entry to the function, f saved a bunch of registers by pushing them to the stack: %r12, %rbp, %rbx. It is because all these registers are callee-saved registers, and f uses them during the function call. In general, the following registers in x86_64 are callee-saved:
%rbx, %r12-%r15, %rbp, %rsp (%rip)

All the other registers are caller-saved, which means the callee doesn't have to preserve their values. If the caller wants to reuse values in these registers across function calls, it will have to explicitly save and restore these registers. In general, the following registers in x86_64 are caller-saved:
%rax, %rcx, %rdx, %r8-%r11

Now let's get back to loop unrolling. Let us a look at the program in f34.s:

    testl   %edi, %edi
    je  .L7
    leal    -1(%rdi), %eax
    cmpl    $7, %eax
    jbe .L8
    pxor    %xmm0, %xmm0
    movl    %edi, %edx
    xorl    %eax, %eax
    movdqa  .LC0(%rip), %xmm1
    shrl    $2, %edx
    movdqa  .LC1(%rip), %xmm2
.L5:
    addl    $1, %eax
    paddd   %xmm1, %xmm0
    paddd   %xmm2, %xmm1
    cmpl    %edx, %eax
    jb  .L5
    movdqa  %xmm0, %xmm1
    movl    %edi, %edx
    andl    $-4, %edx
    psrldq  $8, %xmm1
    paddd   %xmm1, %xmm0
    movdqa  %xmm0, %xmm1
    cmpl    %edx, %edi
    psrldq  $4, %xmm1
    paddd   %xmm1, %xmm0
    movd    %xmm0, %eax
    je  .L10
.L3:
    leal    1(%rdx), %ecx
    addl    %edx, %eax
    cmpl    %ecx, %edi
    je  .L1
    addl    %ecx, %eax
    leal    2(%rdx), %ecx
    cmpl    %ecx, %edi
    je  .L1
    addl    %ecx, %eax
    leal    3(%rdx), %ecx
    cmpl    %ecx, %edi
    je  .L1
    addl    %ecx, %eax
    leal    4(%rdx), %ecx
    cmpl    %ecx, %edi
    je  .L1
    addl    %ecx, %eax
    leal    5(%rdx), %ecx
    cmpl    %ecx, %edi
    je  .L1
    addl    %ecx, %eax
    leal    6(%rdx), %ecx
    cmpl    %ecx, %edi
    je  .L1
    addl    %ecx, %eax
    addl    $7, %edx
    leal    (%rax,%rdx), %ecx
    cmpl    %edx, %edi
    cmovne  %ecx, %eax
    ret
.L7:
    xorl    %eax, %eax
.L1:
    rep ret
.L10:
    rep ret
.L8:
    xorl    %edx, %edx
    xorl    %eax, %eax
    jmp .L3

Wow this looks long and repetitive! Especially the section under label .L3! If we take a look at the original function definition in f34.cc, we will find that it's almost the same as f33.cc, except that in f34.cc we know the definition of g as well. With knowledge of what g does the compiler's optimizer decides that unrolling the loop into 7-increment batches results in faster code.

Code like this can become difficult to understand, especially when the compiler begins to use more advanced registers reserved for vector operations. We can fine-tune the optimizer to disable certain optimizations. For example, we can use the -mno-sse -fno-unroll-loops compiler options to disable the use of SSE registers and loop unrolling. The resulting code, in f35.s, for the same function definitions in f34.cc, becomes much easier to understand:

    testl   %edi, %edi
    je  .L4
    xorl    %edx, %edx
    xorl    %eax, %eax
.L3:
    addl    %edx, %eax
    addl    $1, %edx
    cmpl    %edx, %edi
    jne .L3
    rep ret
.L4:
    xorl    %eax, %eax
    ret

Note that the compiler still performed inlining to eliminate function g.

Optimizing recursive functions

Let's look at the following recursive function in f36.cc:

int f(int x) {
    if (x > 0) {
        return x * f(x - 1);
    } else {
        return 0;
    }
}

At the first glance it may seem that the function returns factorial of x. But it actually returns 0. Despite it doing a series of multiplications, in the end it always multiplies the whole result with 0, which produces 0.

When we compile this function to assembly without much optimization, we see the expensive computation occurring:

    movl    $0, %eax
    testl   %edi, %edi
    jg  .L8
    rep ret
.L8:
    pushq   %rbx
    movl    %edi, %ebx
    leal    -1(%rdi), %edi
    call    _Z1fi
    imull   %ebx, %eax
    popq    %rbx
    ret

In f37.cc there is an actual factorial function:

int f(int x) {
    if (x > 0) {
        return x * f(x - 1);
    } else {
        return 1;
    }
}

If we compile this function using level-2 optimization (-O2), we get the following assembly:

    testl   %edi, %edi
    movl    $1, %eax
    jle .L4
.L3:
    imull   %edi, %eax
    subl    $1, %edi
    jne .L3
    rep ret
.L4:
    rep ret

There is no call instructions again! The compiler has transformed the recursive function into a loop.

If we revisit our "fake" factorial function that always returns 0, and compile it with -O2, we see yet more evidence of compiler's deep understanding of our program:

xorl    %eax, %eax
ret

Optimizing arithmetic operations

The assembly code in f39.s looks like this:

leal    (%rdi,%rdi,2), %eax
leal    (%rdi,%rax,4), %eax
ret

It looks like some rather complex address computations! The first leal instruction basically loads %eax with value 3*%rdi (or %rdi + 2*%rdi). The second leal multiplies the previous result by another 4, and adds another %rdi to it. So what it actually does is 3*%rdi*4 + %rdi, or simply 13*%rdi. This is also revealed in the function name in f39.s.

The compiler choose to use leal instructions instead of an explicit multiply because the two leal instructions actually take less space.