Data representation 5: Undefined behavior, bitwise operations, arena allocation

Signed integer undefined behavior

File cs61-lectures/datarep5/ubexplore2.cc contains the following program.

int main(int argc, const char *argv[]) {
    assert(argc >= 3);
    int n1 = strtol(argv[1], nullptr, 0);
    int n2 = strtol(argv[2], nullptr, 0);

    for (int i = n1; i <= n2; ++i) {
        printf("%d\n", i);
    }
}

What will be printed if we run the program with ./ubexplore2 0x7ffffffe 0x7fffffff?

0x7fffffff is the largest positive value can be represented by type int. Adding one to this value yields 0x80000000. In two's complement representation this is the smallest negative number represented by type int.

Assuming that the program behaves this way, then the loop exit condition i > n2 can never be met, and the program should run (and print out numbers) forever.

However, if we run the optimized version of the program, it prints only two numbers and exits:

2147483646
2147483647

The unoptimized program does print forever and never exits.

What’s going on here? We need to look at the compiled assembly of the program with and without optimization (via objdump -S). (NB: The description that follows is more correct than the description from lecture!)

The unoptimized version basically looks like this:

1. compare i and n2...            (mov -0x1c(%rbp),%eax; cmp -0x18(%rbp),%eax)
2.   and exit if i is greater     (jg <end of function>)
3. otherwise, print i             (... callq ...)
4. increment i                    (mov -0x1c(%rbp),%eax; add $0x1,%eax;
                                   mov %eax,-0x1c(%rbp))
5. and go back to step 1          (jmp <step 1>)

This is a pretty direct translation of the loop.

The optimized version, though, does it differently. As always, the optimizer has its own ideas. (Your compiler may produce different results!)

1. compare i and n2...            (cmp %r14d,%ebx)
2.   and exit if i is greater     (jg <end of function>)
3. otherwise, set tmp = n2 + 1    (lea 0x1(%rax),%ebp)
4. print i                        (... callq ...)
5. increment i                    (add $0x1,%ebx)
6. compare i and tmp...           (cmp %ebp,%ebx)
7.   and go to step 4 if unequal  (jne <step 4>)

The compiler changed the source’s less than or equal to comparison, i <= n2, into a not equal to comparison in the executable, i != n2 + 1 (in both cases using signed computer arithmetic, i.e., modulo 2³²)! The comparison i <= n2 will always return true when n2 == 0x7FFFFFFF, the maximum signed integer, so the loop goes on forever. But the i != n2 + 1 comparison does not always return true when n2 == 0x7FFFFFFF: when i wraps around to 0x80000000 (the smallest negative integer), then i equals n2 + 1 (which also wrapped), and the loop stops.

Why did the compiler make this transformation? In the original loop, the step-6 jump is immediately followed by another comparison and jump in steps 1 and 2. The processor jumps all over the place, which can confuse its prediction circuitry and slow down performance. In the transformed loop, the step-7 jump is never followed by a comparison and jump; instead, step 7 goes back to step 4, which always prints the current number. This more streamlined control flow is easier for the processor to make fast.

But the streamlined control flow is only a valid substitution under the assumption that the addition n2 + 1 never overflows. Luckily (sort of), signed arithmetic overflow causes undefined behavior, so the compiler is totally justified in making that assumption!

Programs based on ubexplore2 have demonstrated undefined behavior differences for years, even as the precise reasons why have changed. In some earlier compilers, we found that the optimizer just upgraded the ints to longs—arithmetic on longs is just as fast on x86-64 as arithmetic on ints, since x86-64 is a 64-bit architecture, and sometimes using longs for everything lets the compiler avoid conversions back and forth. The ubexplore2l program demonstrates this form of transformation: since the loop variable is added to a long counter, the compiler opportunistically upgrades i to long as well. This transformation is also only valid under the assumption that i + 1 will not overflow—which it can’t, because of undefined behavior.

Using unsigned type prevents all this undefined behavior, because arithmetic overflow on unsigned integers is well defined in C/C++. The ubexplore2u.cc file uses an unsigned loop index and comparison, and ./ubexplore2u and ./ubexplore2u.noopt behave exactly the same (though you have to give arguments like ./ubexplore2u 0xfffffffe 0xffffffff to see the overflow).

Computer arithmetic and bitwise operations

Basic bitwise operators

Computers offer not only the usual arithmetic operators like + and -, but also a set of bitwise operators. The basic ones are & (and), | (or), ^ (xor/exclusive or), and the unary operator ~ (complement). In truth table form:

& (and)	0	1
0	0	0
1	0	1

\| (or)	0	1
0	0	1
1	1	1

^ (xor)	0	1
0	0	1
1	1	0

~ (complement)
0	1
1	0

In C or C++, these operators work on integers. But they work bitwise: the result of an operation is determined by applying the operation independently at each bit position. Here’s how to compute 12 & 4 in 4-bit unsigned arithmetic:

   12 == 0b 1 1 0 0
^   4 == 0b 0 1 0 0
-------------------
         0b 0 1 0 0 == 4

These basic bitwise operators simplify certain important arithmetics. For example, (x & (x - 1)) == 0 tests whether x is zero or a power of 2.

Negation of signed integers can also be expressed using a bitwise operator: -x == ~x + 1. This is in fact how we define two's complement representation. We can verify that x and (-x) does add up to zero under this representation:

x + (-x) == (x + ~x) + 1
         == 0b 1111... + 1
         == 0

Bitwise "and" (&) can help with modular arithmetic. For example, x % 32 == (x & 31). We essentially "mask off", or clear, higher order bits to do modulo-powers-of-2 arithmetics. This works in any base. For example, in decimal, the fastest way to compute x % 100 is to take just the two least significant digits of x.

Bitwise shift of unsigned integer

x << i appends i zero bits starting at the least significant bit of x. High order bits that don't fit in the integer are thrown out. For example, assuming 4-bit unsigned integers

0b 1101 << 2 == 0b 0100

Similarly, x >> i appends i zero bits at the most significant end of x. Lower bits are thrown out.

0b 1101 >> 2 == 0b 0011

Bitwise shift helps with division and multiplication. For example:

x / 64 == x >> 6

x * 64 == x << 6

A modern compiler can optimize y = x * 66 into y = (x << 6) + (x << 1).

Bitwise operations also allows us to treat bits within an integer separately. This can be useful for "options".

For example, when we call a function to open a file, we have a lot of options:

Open for reading?
Open for writing?
Read from the end?
Optimize for writing?

We have a lot of true/false options.

One bad way to implement this is to have this function take a bunch of arguments -- one argument for each option. This makes the function call look like this:

open_file(..., true, false, ...)

The long list of arguments slows down the function call, and one can also easily lose track of the meaning of the individual true/false values passed in.

A cheaper way to achieve this is to use a single integer to represent all the options. Have each option defined as a power of 2, and simply | (or) them together and pass them as a single integer.

#define O_READ 1
#define O_WRITE 2

open_file(..., O_READ | O_WRITE); // setting both O_READ and O_WRITE flags

Flags are usually defined as powers of 2 so we set one bit at a time for each flag. It is less common but still possible to define a combination flag that is not a power of 2, so that it sets multiple bits in one go.

Arena allocation

File cs61-lectures/datarep5/membench.cc contains a memory allocation benchmark. The core of the benchmark looks like this

void benchmark() {
    // allocate a new memory arena for this thread.
    // An "arena" is an object that encapsulates a set of memory allocations.
    // Arenas can capture allocation statistics and improve speed.
    memnode_arena* arena = memnode_arena_new();

    // Allocate 4096 memnodes.
    memnode* m[4096];
    for (int i = 0; i != 4096; ++i) {
        m[i] = memnode_alloc(arena);
    }

    // `noperations` times, free a memnode and then allocate another one.
    for (unsigned i = 0; i != noperations; ++i) {
        unsigned pos = i % 4096;
        memnode_free(arena, m[pos]);
        m[pos] = memnode_alloc(arena);
    }

    // Free the remaining memnodes and the arena.
    for (int i = 0; i != 4096; ++i) {
        memnode_free(arena, m[i]);
    }
    memnode_arena_free(arena);
}

The benchmark tests the performance of memnode_alloc() and memnode_free() allocator functions. It allocates 4096 memnode objects, then free-and-then-allocates them for noperations times, and then frees all of them.

We notice that by the end of the function, all dynamically allocated data are freed. Can we take advantage of this property to speed up allocation/deallocation?

We only allocate memnodes, and all memnodes are of the same size, so we don't need metadata that keeps track of the size of each allocation. Furthermore, since all dynamically allocated data are freed at the end of the function, for each individual memnode_free() call we don't really need to return memory to the system allocator. We can simply reuse these memory during the function and returns all memory to the system at once when the function exits.

If we run the benchmark with 100000000 allocation, and use the system malloc(), free() functions to implement the memnode allocator, the benchmark finishes in 0.908 seconds.

Our alternative implementation of the allocator can finish in 0.355 seconds, beating the heavily optimized system allocator by a factor of 3. We will reveal how we achieved this in the next lecture.