Data representation 4: Pointers and undefined behavior

Arrays and pointers

One of C’s more interesting choices is that it explicitly relates pointers and arrays. Although arrays are laid out in memory in a specific way, they generally behave like pointers when they are used. This property probably arose from C’s desire to explicitly model memory as an array of bytes, and it has beautiful and confounding effects.

We’ve already seen one of these effects. The hexdump function has this signature (arguments and return type):

void hexdump(const void* ptr, size_t size);

But we can just pass an array as argument to hexdump:

char c[10];
hexdump(c, sizeof(c));

When used in an expression like this—here, as an argument—the array magically changes into a pointer to its first element. The above call has the same meaning as this:

hexdump(&c[0], 10 * sizeof(c[0]));

C programmers transition between arrays and pointers very naturally.

A confounding effect is that unlike all other types, in C arrays are passed to and returned from functions by reference rather than by value. C is a call-by-value language except for arrays. This means that all function arguments and return values are copied, so that parameter modifications inside a function do not affect the objects passed by the caller—except for arrays. For instance:
void f(int a[2]) {
    a[0] = 1;
}
int main() {
    int x[2] = {100, 101};
    f(x);
    printf("%d\n", x[0]); // prints 1!
}
If you don’t like this behavior, you can get around it by using a struct or a C++ std::array.
#include <array>
struct array1 { int a[2]; };
void f1(array1 arg) {
    arg.a[0] = 1;
}
void f2(std::array<int, 2> a) {
    a[0] = 1;
}
int main() {
    array1 x = {{100, 101}};
    f1(x);
    printf("%d\n", x.a[0]); // prints 100
    std::array<int, 2> x2 = {100, 101};
    f2(x2);
    printf("%d\n", x2[0]); // prints 100
}

Pointer arithmetic

C++ extends the logic of this array–pointer correspondence to support arithmetic on pointers as well.

Pointer arithmetic rule. In the C abstract machine, arithmetic on pointers produces the same result as arithmetic on the corresponding array indexes.

Specifically, consider an array T a[n] and pointers T* p1 = &a[i] and T* p2 = &a[j]. Then:

Equality: p1 == p2 if and only if (iff) p1 and p2 point to the same address, which happens iff i == j.
Inequality: Similarly, p1 != p2 iff i != j.
Less-than: p1 < p2 iff i < j.
Also, p1 <= p2 iff i <= j; and p1 > p2 iff i > j; and p1 >= p2 iff i >= j.
Pointer difference: What should p1 - p2 mean? Using array indexes as the basis, p1 - p2 == i - j. (But the type of the difference is always ptrdiff_t, which on x86-64 is long, the signed version of size_t.)
Addition: p1 + k (where k is an integer type) equals the pointer &a[i + k]. (k + p1 returns the same thing.)
Subtraction: p1 - k equals &a[i - k].
Increment and decrement: ++p1 means p1 = p1 + 1, which means p1 = &a[i + 1]. Similarly, --p1 means p1 = &a[i - 1]. (There are also postfix versions, p1++ and p1--, but C++ style prefers the prefix versions.)

No other arithmetic operations on pointers are allowed. You can’t multiply pointers, for example. (You can multiply addresses by casting the pointers to the address type, uintptr_t—so (uintptr_t) p1 * (uintptr_t) p2—but why would you?)

From pointers to iterators

Let’s write a function that can sum all the integers in an array.

int sum(int a[], int size) {
    int sum = 0;
    for (int i = 0; i != size; ++i) {
        sum += a[i];
    }
    return sum;
}

This function can compute the sum of the elements of any int array. But because of the pointer–array relationship, its a argument is really a pointer. That allows us to call it with subarrays as well as with whole arrays. For instance:

int a[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
int s1 = sum(a, 10);       // 45
int s2 = sum(&a[0], 10);   // same as s1
int s3 = sum(&a[1], 5);    // sums s[1]...s[5], computing 15
int s4 = sum(a + 1, 5);    // same as s3

This way of thinking about arrays naturally leads to a style that avoids sizes entirely, using instead a sentinel or boundary argument that defines the end of the interesting part of the array.

int sum(int* first, int* last) {
    int sum = 0;
    while (first != last) {
        sum += *first;
        ++first;
    }
    return sum;
}

These expressions compute the same sums as the above:

int s1 = sum(a, a + 10);
int s2 = sum(&a[0], &a[0] + 10);
int s3 = sum(&a[1], &a[1] + 5);
int s4 = sum(a + 1, a + 6);

Note that the data from first to last forms a half-open range. iIn mathematical notation, we care about elements in the range [first, last): the element pointed to by first is included (if it exists), but the element pointed to by last is not. Half-open ranges give us a simple and clear way to describe empty ranges, such as zero-element arrays: if first == last, then the range is empty.

Note that given a ten-element array a, the pointer a + 10 can be formed and compared, but must not be dereferenced—the element a[10] does not exist. The C/C++ abstract machines allow users to form pointers to the “one-past-the-end” boundary elements of arrays, but users must not dereference such pointers.

So in C, two pointers naturally express a range of an array. The C++ standard template library, or STL, brilliantly abstracts this pointer notion to allow two iterators, which are pointer-like objects, to express a range of any standard data structure—an array, a vector, a hash table, a balanced tree, whatever. This version of sum works for any container of ints; notice how little it changed:

template <typename It>
int sum(It first, It last) {
    int sum = 0;
    while (first != last) {
        sum += *first;
        ++first;
    }
    return sum;
}

Some example uses:

std::set<int> set_of_ints;
int s1 = sum(set_of_ints.begin(), set_of_ints.end());
std::list<int> linked_list_of_ints;
int s2 = sum(linked_list_of_ints.begin(), linked_list_of_ints.end());

Addresses vs. pointers

What’s the difference between these expressions? (Again, a is an array of type T, and p1 == &a[i] and p2 == &a[j].)

ptrdiff_t d1 = p1 - p2;
ptrdiff_t d2 = (uintptr_t) p1 - (uintptr_t) p2;

The first expression is defined analogously to index arithmetic, so d1 == i - j. But the second expression performs the arithmetic on the addresses corresponding to those pointers. We will expect d2 to equal sizeof(T) * d1. Always be aware of which kind of arithmetic you’re using. Generally arithmetic on pointers should not involve sizeof, since the sizeof is included automatically according to the abstract machine; but arithmetic on addresses almost always should involve sizeof.

Undefined behavior

Although C++ is a low-level language, the abstract machine is surprisingly strict about which pointers may be formed and how they can be used. Violate the rules and you’re in hell because you have invoked the dreaded undefined behavior.

Given an array a[N] of N elements of type T:

Forming a pointer &a[i] (or a + i) with 0 ≤ i ≤ N is safe.
Forming a pointer &a[i] with i < 0 or i > N causes undefined behavior.
Dereferencing a pointer &a[i] with 0 ≤ i < N is safe.
Dereferencing a pointer &a[i] with i < 0 or i ≥ N causes undefined behavior.

(For the purposes of these rules, objects that are not arrays count as single-element arrays. So given T x, we can safely form &x and &x + 1 and dereference &x.)

What “undefined behavior” means is horrible. A program that executes undefined behavior is erroneous. But the compiler need not catch the error. In fact, the abstract machine says anything goes: undefined behavior is “behavior … for which this International Standard imposes no requirements.” “Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).” Other possible behaviors include allowing hackers from the moon to steal all of a program’s data, take it over, and force it to delete the hard drive on which it is running. Once undefined behavior executes, a program may do anything, including making demons fly out of the programmer’s nose.

Pointer arithmetic, and even pointer comparisons, are also affected by undefined behavior. It’s undefined to go beyond and array’s bounds using pointer arithmetic. And pointers may be compared for equality or inequality even if they point to different arrays or objects, but if you try to compare different arrays via less-than, like this:

int a[10];
int b[10];
if (a < b + 10) ...

that causes undefined behavior.

If you really want to compare pointers that might be to different arrays—for instance, you’re writing a hash function for arbitrary pointers—cast them to uintptr_t first.

Undefined behavior and optimization

A program that causes undefined behavior is not a C++ program. The abstract machine says that a C++ program, by definition, is a program whose behavior is always defined. The C++ compiler is allowed to assume that its input is a C++ program. (Obviously!) So the compiler can assume that its input program will never cause undefined behavior. Thus, since undefined behavior is “impossible,” if the compiler can prove that a condition would cause undefined behavior later, it can assume that condition will never occur.

Consider this program:

char* x = /* some value */;
assert(x + 1 > x);
printf("x = %p, x + 1 = %p\n", x, x + 1);

If we supply a value equal to (char*) -1, we’re likely to see output like this:

x = 0xffffffffffffffff, x + 1 = 0

with no assertion failure! But that’s an apparently impossible result. The printout can only happen if x + 1 > x (otherwise, the assertion will fail and stop the printout). But x + 1, which equals 0, is less than x, which is the largest 8-byte value!

The impossible happens because of undefined behavior reasoning. When the compiler sees an expression like x + 1 > x (with x a pointer), it can reason this way:

“Ah, x + 1. This must be a pointer into the same array as x (or it might be a boundary pointer just past that array, or just past the non-array object x). This must be so because forming any other pointer would cause undefined behavior.
“The pointer comparison is the same as an index comparison. x + 1 > x means the same thing as &x[1] > &x[0]. But that holds iff 1 > 0.
“In my infinite wisdom, I know that 1 > 0. Thus x + 1 > x always holds, and the assertion will never fail.
“My job is to make this code run fast. The fastest code is code that’s not there. This assertion will never fail—might as well remove it!”

Integer undefined behavior

Arithmetic on signed integers also has important undefined behaviors. Signed integer arithmetic must never overflow. That is, the compiler may assume that the mathematical result of any signed arithmetic operation, such as x + y (with x and y both int), can be represented inside the relevant type. It causes undefined behavior, therefore, to add 1 to the maximum positive integer. (The ubexplore.cc program demonstrates how this can produce impossible results, as with pointers.)

Arithmetic on unsigned integers is much safer with respect to undefined behavior. Unsigned integers are defined to perform arithmetic modulo their size. This means that if you add 1 to the maximum positive unsigned integer, the result will always be zero.

Dividing an integer by zero causes undefined behavior whether or not the integer is signed.

Sanitizers

Sanitizers, which in our makefiles are turned on by supplying SAN=1, can catch many undefined behaviors as soon as they happen. Sanitizers are built in to the compiler itself; a sanitizer involves cooperation between the compiler and the language runtime. This has the major performance advantage that the compiler introduces exactly the required checks, and the optimizer can then use its normal analyses to remove redundant checks.

That said, undefined behavior checking can still be slow. Undefined behavior allows compilers to make assumptions about input values, and those assumptions can directly translate to faster code. Turning on undefined behavior checking can make some benchmark programs run 30% slower [link].