Compiler optimizations
Argument elision
A compiler may decide to elide (or remove) certain operations setting up
function call arguments, if it can decide that the registers containing these
arguments will hold the correct value before the function call takes place.
Let's see an example of a function disassembled function f
in f31.s
:
subq $8, %rsp
call _Z1gi@PLT
addq $8, %rsp
addl $1, %eax
ret
This function calls another function g
, adds 1 to g
's return value, and
returns that value.
It is possible that the function has the following definition in C++:
int f() {
return 1 + g();
}
However, the actual definition of f
in f31.cc
is:
int f(x) {
return 1 + g(x);
}
The compiler realizes that the argument to function g
, which is passed via
register %rdi
, already has the right value when g
is called, so it doesn't
bother doing anything about it. This is one example of numerous optimizations
a compiler can perform to reduce the size of generated code.
Inlining
A compiler may also copy the body of function to its call site, instead of
doing an explicit function call, when it decides that the overhead of
performing a function call outweights the overhead of doing this copy. For
example, if we have a function g
defined as g(x) = 2 + x
, and f
is
defined as f(x) = 1 + g(x)
, then the compiler may actually generate f(x)
as simply 3 + x
, without inserting any call
instructions. In assembly
terms, function g
will look like
leal 2(%rdi), %eax
ret
and f
will simply be
leal 3(%rdi), %eax
ret
Tail call elimination
Let's look at another example in f32.s
:
addl $1, %edi
jmp _Z1gi@PLT
This function doesn't even contain a ret
instruction! What is going on?
Let's take a look at the actual definition of f
, in f32.cc
:
int f(int x) {
return g(x + 1);
}
Note that the call to function g
is the last operation in function f
, and
the return value of f
is just the return value of the invocation of g
.
In this case the compiler can perform a tail call elimination: instead of
calling g
explicitly, it can simply jump to g
and have g
return to the
same address that f
would have returned to.
A tail call elimination may occur if a function (caller) ends with another function call (callee) and performs no cleanup once the callee returns. In this case the caller and simply jump to the callee, instead of doing an explicit call.
Loop unrolling
Before we jump into loop unrolling, let's take a small excursion into an
aspect of calling conventions called caller/callee-saved registers. This will
help us under the sample program in f33.s
better.
Calling conventions: caller/callee-saved registers
Let's look at the function definition in f33.s
:
pushq %r12
pushq %rbp
pushq %rbx
testl %edi, %edi
je .L4
movl %edi, %r12d
movl $0, %ebx
movl $0, %ebp
.L3:
movl %ebx, %edi
call _Z1gj@PLT
addl %eax, %ebp
addl $1, %ebx
cmpl %ebx, %r12d
jne .L3
.L1:
movl %ebp, %eax
popq %rbx
popq %rbp
popq %r12
ret
.L4:
movl %edi, %ebp
jmp .L1
From the assembly we can tell that the backwards jump to .L3
is likely a
loop. The loop index is in %ebx
and the loop bound is in %r12d
. Note that
upon entry to the function we first moved the value %rdi
to %r12d
. This is
necessary because in the loop f
calls g
, and %rdi
is used to pass
arguments to g
, so we must move its value to a different register to used it
as the loop bound (this case %r12
). But there is more to this: the compiler
also needs to ensure that this register's value is preserved across function
calls. Calling conventions dictate that certain registers always exhibit this
property, and they are called callee-saved registers. If a register is
callee-saved, then the caller doesn't have to save its value before entering a
function call.
We note that upon entry to the function, f
saved a bunch of registers by
pushing them to the stack: %r12
, %rbp
, %rbx
. It is because all these
registers are callee-saved registers, and f
uses them during the function
call. In general, the following registers in x86_64 are callee-saved:
%rbx
, %r12-%r15
, %rbp
, %rsp
(%rip
)
All the other registers are caller-saved, which means the callee doesn't
have to preserve their values. If the caller wants to reuse values in these
registers across function calls, it will have to explicitly save and restore
these registers. In general, the following registers in x86_64 are caller-saved:
%rax
, %rcx
, %rdx
, %r8-%r11
Now let's get back to loop unrolling. Let us a look at the program in f34.s
:
testl %edi, %edi
je .L7
leal -1(%rdi), %eax
cmpl $7, %eax
jbe .L8
pxor %xmm0, %xmm0
movl %edi, %edx
xorl %eax, %eax
movdqa .LC0(%rip), %xmm1
shrl $2, %edx
movdqa .LC1(%rip), %xmm2
.L5:
addl $1, %eax
paddd %xmm1, %xmm0
paddd %xmm2, %xmm1
cmpl %edx, %eax
jb .L5
movdqa %xmm0, %xmm1
movl %edi, %edx
andl $-4, %edx
psrldq $8, %xmm1
paddd %xmm1, %xmm0
movdqa %xmm0, %xmm1
cmpl %edx, %edi
psrldq $4, %xmm1
paddd %xmm1, %xmm0
movd %xmm0, %eax
je .L10
.L3:
leal 1(%rdx), %ecx
addl %edx, %eax
cmpl %ecx, %edi
je .L1
addl %ecx, %eax
leal 2(%rdx), %ecx
cmpl %ecx, %edi
je .L1
addl %ecx, %eax
leal 3(%rdx), %ecx
cmpl %ecx, %edi
je .L1
addl %ecx, %eax
leal 4(%rdx), %ecx
cmpl %ecx, %edi
je .L1
addl %ecx, %eax
leal 5(%rdx), %ecx
cmpl %ecx, %edi
je .L1
addl %ecx, %eax
leal 6(%rdx), %ecx
cmpl %ecx, %edi
je .L1
addl %ecx, %eax
addl $7, %edx
leal (%rax,%rdx), %ecx
cmpl %edx, %edi
cmovne %ecx, %eax
ret
.L7:
xorl %eax, %eax
.L1:
rep ret
.L10:
rep ret
.L8:
xorl %edx, %edx
xorl %eax, %eax
jmp .L3
Wow this looks long and repetitive! Especially the section under label .L3
!
If we take a look at the original function definition in f34.cc
, we will
find that it's almost the same as f33.cc
, except that in f34.cc
we know
the definition of g
as well. With knowledge of what g
does the compiler's
optimizer decides that unrolling the loop into 7-increment batches results in
faster code.
Code like this can become difficult to understand, especially when the compiler
begins to use more advanced registers reserved for vector operations. We can
fine-tune the optimizer to disable certain optimizations. For example, we can
use the -mno-sse -fno-unroll-loops
compiler options to disable the use of SSE
registers and loop unrolling. The resulting code, in f35.s
, for the same function
definitions in f34.cc
, becomes much easier to understand:
testl %edi, %edi
je .L4
xorl %edx, %edx
xorl %eax, %eax
.L3:
addl %edx, %eax
addl $1, %edx
cmpl %edx, %edi
jne .L3
rep ret
.L4:
xorl %eax, %eax
ret
Note that the compiler still performed inlining to eliminate function g
.
Optimizing recursive functions
Let's look at the following recursive function in f36.cc
:
int f(int x) {
if (x > 0) {
return x * f(x - 1);
} else {
return 0;
}
}
At the first glance it may seem that the function returns factorial of x
.
But it actually returns 0. Despite it doing a series of multiplications, in the end
it always multiplies the whole result with 0, which produces 0.
When we compile this function to assembly without much optimization, we see the expensive computation occurring:
movl $0, %eax
testl %edi, %edi
jg .L8
rep ret
.L8:
pushq %rbx
movl %edi, %ebx
leal -1(%rdi), %edi
call _Z1fi
imull %ebx, %eax
popq %rbx
ret
In f37.cc
there is an actual factorial function:
int f(int x) {
if (x > 0) {
return x * f(x - 1);
} else {
return 1;
}
}
If we compile this function using level-2 optimization (-O2
), we get the
following assembly:
testl %edi, %edi
movl $1, %eax
jle .L4
.L3:
imull %edi, %eax
subl $1, %edi
jne .L3
rep ret
.L4:
rep ret
There is no call
instructions again! The compiler has transformed the
recursive function into a loop.
If we revisit our "fake" factorial function that always returns 0, and compile
it with -O2
, we see yet more evidence of compiler's deep understanding of
our program:
xorl %eax, %eax
ret
Optimizing arithmetic operations
The assembly code in f39.s
looks like this:
leal (%rdi,%rdi,2), %eax
leal (%rdi,%rax,4), %eax
ret
It looks like some rather complex address computations! The first leal
instruction basically loads %eax
with value 3*%rdi
(or %rdi + 2*%rdi
).
The second leal
multiplies the previous result by another 4, and adds
another %rdi
to it. So what it actually does is 3*%rdi*4 + %rdi
, or
simply 13*%rdi
. This is also revealed in the function name in f39.s
.
The compiler choose to use leal
instructions instead of an explicit
multiply because the two leal
instructions actually take less space.