Signed integer undefined behavior
File cs61-lectures/datarep5/ubexplore2.cc
contains the following program.
int main(int argc, const char *argv[]) {
assert(argc >= 3);
int n1 = strtol(argv[1], nullptr, 0);
int n2 = strtol(argv[2], nullptr, 0);
for (int i = n1; i <= n2; ++i) {
printf("%d\n", i);
}
}
What will be printed if we run the program with ./ubexplore2 0x7ffffffe 0x7fffffff
?
0x7fffffff
is the largest positive value can be represented by type int
.
Adding one to this value yields 0x80000000
. In two's complement
representation this is the smallest negative number represented by type int
.
Assuming that the program behaves this way, then the loop exit condition i > n2
can never be met, and the program should run (and print out numbers)
forever.
However, if we run the optimized version of the program, it prints only two numbers and exits:
2147483646
2147483647
The unoptimized program does print forever and never exits.
What’s going on here? We need to look at the compiled assembly of the program
with and without optimization (via objdump -S
). (NB: The description that
follows is more correct than the description from lecture!)
The unoptimized version basically looks like this:
1. compare i and n2... (mov -0x1c(%rbp),%eax; cmp -0x18(%rbp),%eax)
2. and exit if i is greater (jg <end of function>)
3. otherwise, print i (... callq ...)
4. increment i (mov -0x1c(%rbp),%eax; add $0x1,%eax;
mov %eax,-0x1c(%rbp))
5. and go back to step 1 (jmp <step 1>)
This is a pretty direct translation of the loop.
The optimized version, though, does it differently. As always, the optimizer has its own ideas. (Your compiler may produce different results!)
1. compare i and n2... (cmp %r14d,%ebx)
2. and exit if i is greater (jg <end of function>)
3. otherwise, set tmp = n2 + 1 (lea 0x1(%rax),%ebp)
4. print i (... callq ...)
5. increment i (add $0x1,%ebx)
6. compare i and tmp... (cmp %ebp,%ebx)
7. and go to step 4 if unequal (jne <step 4>)
The compiler changed the source’s less than or equal to comparison, i <=
n2
, into a not equal to comparison in the executable, i != n2 + 1
(in
both cases using signed computer arithmetic, i.e., modulo 232)! The
comparison i <= n2
will always return true when n2 == 0x7FFFFFFF
, the
maximum signed integer, so the loop goes on forever. But the i != n2 + 1
comparison does not always return true when n2 == 0x7FFFFFFF
: when i
wraps around to 0x80000000
(the smallest negative integer), then i
equals
n2 + 1
(which also wrapped), and the loop stops.
Why did the compiler make this transformation? In the original loop, the step-6 jump is immediately followed by another comparison and jump in steps 1 and 2. The processor jumps all over the place, which can confuse its prediction circuitry and slow down performance. In the transformed loop, the step-7 jump is never followed by a comparison and jump; instead, step 7 goes back to step 4, which always prints the current number. This more streamlined control flow is easier for the processor to make fast.
But the streamlined control flow is only a valid substitution under the
assumption that the addition n2 + 1
never overflows. Luckily (sort of),
signed arithmetic overflow causes undefined behavior, so the compiler is
totally justified in making that assumption!
Programs based on
ubexplore2
have demonstrated undefined behavior differences for years, even as the precise reasons why have changed. In some earlier compilers, we found that the optimizer just upgraded theint
s tolong
s—arithmetic onlong
s is just as fast on x86-64 as arithmetic onint
s, since x86-64 is a 64-bit architecture, and sometimes usinglong
s for everything lets the compiler avoid conversions back and forth. Theubexplore2l
program demonstrates this form of transformation: since the loop variable is added to along
counter, the compiler opportunistically upgradesi
tolong
as well. This transformation is also only valid under the assumption thati + 1
will not overflow—which it can’t, because of undefined behavior.
Using unsigned
type prevents all this undefined behavior, because arithmetic
overflow on unsigned integers is well defined in C/C++. The ubexplore2u.cc
file uses an unsigned loop index and comparison, and ./ubexplore2u
and
./ubexplore2u.noopt
behave exactly the same (though you have to give
arguments like ./ubexplore2u 0xfffffffe 0xffffffff
to see the overflow).
Computer arithmetic and bitwise operations
Basic bitwise operators
Computers offer not only the usual arithmetic operators like +
and
-
, but also a set of bitwise operators. The basic ones are &
(and), |
(or), ^
(xor/exclusive or), and the unary operator
~
(complement). In truth table form:
& (and) | 0 | 1 |
---|---|---|
0 | 0 | 0 |
1 | 0 | 1 |
| (or) | 0 | 1 |
---|---|---|
0 | 0 | 1 |
1 | 1 | 1 |
^ (xor) | 0 | 1 |
---|---|---|
0 | 0 | 1 |
1 | 1 | 0 |
~ (complement) | |
---|---|
0 | 1 |
1 | 0 |
In C or C++, these operators work on integers. But they work bitwise: the result of an operation is determined by applying the operation independently at each bit position. Here’s how to compute 12 & 4 in 4-bit unsigned arithmetic:
12 == 0b 1 1 0 0
^ 4 == 0b 0 1 0 0
-------------------
0b 0 1 0 0 == 4
These basic bitwise operators simplify certain important arithmetics. For example, (x & (x - 1)) == 0
tests
whether x
is zero or a power of 2.
Negation of signed integers can also be expressed using a bitwise operator: -x == ~x + 1
.
This is in fact how we define two's complement representation.
We can verify that x
and (-x)
does add up to zero under this representation:
x + (-x) == (x + ~x) + 1
== 0b 1111... + 1
== 0
Bitwise "and" (&
) can help with modular arithmetic. For example, x % 32 ==
(x & 31)
. We essentially "mask off", or clear, higher order bits to do
modulo-powers-of-2 arithmetics. This works in any base. For example, in
decimal, the fastest way to compute x % 100
is to take just the two least
significant digits of x
.
Bitwise shift of unsigned integer
x << i
appends i
zero bits starting at the least significant bit of x
.
High order bits that don't fit in the integer are thrown out. For example,
assuming 4-bit unsigned integers
0b 1101 << 2 == 0b 0100
Similarly, x >> i
appends i
zero bits at the most significant end of x
. Lower bits are thrown out.
0b 1101 >> 2 == 0b 0011
Bitwise shift helps with division and multiplication. For example:
x / 64 == x >> 6
x * 64 == x << 6
A modern compiler can optimize y = x * 66
into y = (x << 6) + (x << 1)
.
Bitwise operations also allows us to treat bits within an integer separately. This can be useful for "options".
For example, when we call a function to open a file, we have a lot of options:
- Open for reading?
- Open for writing?
- Read from the end?
- Optimize for writing?
We have a lot of true/false options.
One bad way to implement this is to have this function take a bunch of arguments -- one argument for each option. This makes the function call look like this:
open_file(..., true, false, ...)
The long list of arguments slows down the function call, and one can also easily lose track of the meaning of the individual true/false values passed in.
A cheaper way to achieve this is to use a single integer to represent all the options. Have each option defined as a power of 2, and simply |
(or) them together and pass them as a single integer.
#define O_READ 1
#define O_WRITE 2
open_file(..., O_READ | O_WRITE); // setting both O_READ and O_WRITE flags
Flags are usually defined as powers of 2 so we set one bit at a time for each flag. It is less common but still possible to define a combination flag that is not a power of 2, so that it sets multiple bits in one go.
Arena allocation
File cs61-lectures/datarep5/membench.cc
contains a memory allocation benchmark. The core
of the benchmark looks like this
void benchmark() {
// allocate a new memory arena for this thread.
// An "arena" is an object that encapsulates a set of memory allocations.
// Arenas can capture allocation statistics and improve speed.
memnode_arena* arena = memnode_arena_new();
// Allocate 4096 memnodes.
memnode* m[4096];
for (int i = 0; i != 4096; ++i) {
m[i] = memnode_alloc(arena);
}
// `noperations` times, free a memnode and then allocate another one.
for (unsigned i = 0; i != noperations; ++i) {
unsigned pos = i % 4096;
memnode_free(arena, m[pos]);
m[pos] = memnode_alloc(arena);
}
// Free the remaining memnodes and the arena.
for (int i = 0; i != 4096; ++i) {
memnode_free(arena, m[i]);
}
memnode_arena_free(arena);
}
The benchmark tests the performance of memnode_alloc()
and memnode_free()
allocator functions. It allocates 4096 memnode
objects, then free-and-then-allocates them for noperations
times,
and then frees all of them.
We notice that by the end of the function, all dynamically allocated data are freed. Can we take advantage of this property to speed up allocation/deallocation?
We only allocate memnode
s, and all memnode
s are of the same size, so we
don't need metadata that keeps track of the size of each allocation.
Furthermore, since all dynamically allocated data are freed at the end of the
function, for each individual memnode_free()
call we don't really need to
return memory to the system allocator. We can simply reuse these memory during
the function and returns all memory to the system at once when the function
exits.
If we run the benchmark with 100000000 allocation, and use the system malloc()
, free()
functions
to implement the memnode
allocator, the benchmark finishes in 0.908 seconds.
Our alternative implementation of the allocator can finish in 0.355 seconds, beating the heavily optimized system allocator by a factor of 3. We will reveal how we achieved this in the next lecture.