Data representation 2

Material we didn’t cover in lecture is in yellow boxes.

Abstract machine and hardware

Every C program runs on an abstract machine. The behavior of this machine is defined by the C standard, a technical document.

The abstract machine defines what a C program means. But usually a C program also runs on hardware, and the hardware determines what behavior we see. How do abstract machines and hardware relate?

Mapping abstract machine behavior to instructions on real hardware is the task of the C compiler (and the standard library and operating system). A C compiler is correct if and only if it translates each correct program to instructions that simulate the expected behavior of the abstract machine.

But what about incorrect programs?

Undefined behavior

Sometimes a C program is incorrect, but there are different kinds of incorrectness. For example:

int main(void) {
    const char* a = "Hello";
    const char* b = ", world\n";
    printf(a + b);
}

This program is erroneous: you can’t add two values of type const char*. Every conforming C implementation must refuse to run this program.

But then there’s this.

int main(void) {
    const char* a = "Hello, world\n";
    int b = 100;
    printf(a + b);
}

This program is also erroneous, but in a subtler and more dangerous way. It is OK to add values of type const char* and int. For instance, "ABC" + 1 evaluates to the string "BC". But going beyond the string’s bounds, as we did above, is illegal and invokes undefined behavior.

A good compiler will complain about that program, but what about this one?

int main(int argc, char** argv) {
    const char* a = "Hello, world\n";
    int b = strtol(argv[1], 0, NULL);
    printf(a + b);
}

This program might or might not invoke undefined behavior: it depends on the inputs.

A program that executes undefined behavior is erroneous. But the compiler need not catch the error. In fact, the abstract machine says anything goes: undefined behavior is “behavior … for which this International Standard imposes no requirements.” “Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).”

Once undefined behavior executes, a program may do anything. And the compiler can assume that undefined behavior is impossible—so if the compiler can prove that a condition would cause undefined behavior later, it can assume that condition will never occur.

This is a bad situation. Undefined behavior might work on your machine and compiler right now, but a compiler update or machine change could destroy the meaning of your program. Luckily, good hygiene and use of sanitizers can catch many undefined behaviors (but by no means all).

Two common sources of undefined behavior are integer overflow and memory errors. The problem set checks your understanding of both these issues.

Objects

The C abstract machine concerns the construction and modification of objects. An object is a region of memory that contains a value, such as the integer 12. (Specifically, “a region of data storage in the execution environment, the contents of which can represent values”.) Consider:

int global = 1;
int f(void) {
    int local = global + 1;
    int* ptr;
    ptr = (int*) malloc(sizeof(int));
    *ptr = local + 1;
}

There are four objects here:

global
local
ptr
the anonymous memory allocated by malloc(sizeof(int)) and accessed by \*ptr

Objects never overlap: the C abstract machine requires that each of these objects occupies distinct memory.

Each object has a lifetime, which is called storage duration by the standard. There are three different kinds of lifetime.

static lifetime: The object lasts as long as the program runs. (Example: global)
automatic lifetime: The compiler allocates and destroys the object automatically. (local, ptr)
dynamic lifetime: The programmer allocates and destroys the object explicitly. (\*ptr)

An object can have many names. For example, here, local and \*ptr refer to the same object:

int f(void) {
    int local = 1;
    int* ptr = &local;
}

The different names for an object are sometimes called aliases.

What happens when an object is uninitialized? The answer depends on its lifetime.

static lifetime (e.g., int global; at file scope): The object is initialized to 0.
automatic or dynamic lifetime (e.g., int local; in a function, or int\* ptr = (int\*) malloc(sizeof(int))): The object is uninitialized and reading the object’s value before it is assigned causes undefined behavior.

Objects with dynamic lifetime aren’t easy to use correctly. Dynamic lifetime causes many serious problems in C programs, including memory leaks, use-after-free, double-free, and so forth (more on these Thursday). Those serious problems cause undefined behavior and play a “disastrously central role” in “our ongoing computer security nightmare”.

But dynamic lifetime is critically important. Only with dynamic lifetime can you construct an object whose size isn’t known at compile time, or construct an object that outlives its creating function.

Memory layout

How does a hardware machine implement the three kinds of lifetime? We can use a program to find out. (See cs61-lectures/l02/mexplore.c)

Hardware implements C objects using memory (so called because it remembers object values). At a high level, a memory is a modifiable array of 2^W bytes, where a byte is a number between 0 and 255 inclusive. That means that, for any number a between 0 and 2^W–1, we can:

Write a byte at address a.
Read the byte at address a (obtaining the most-recently-written value).

The number a is called an address, and since every memory address corresponds to a byte, this is a byte-addressable memory.

On old machines, such as old Macintoshes (pre-OS X), C programs worked directly with this kind of memory. It was a disaster: an incorrect program could overwrite memory belonging to any other running program. Modern machines avoid this problem; we'll see how in unit 4.

The compiler and operating system work together to put objects at different addresses. A program’s address space (which is the range of addresses accessible to a program) divides into regions called segments. The most important ones are:

Code (aka text). Contains instructions and constant static objects; unmodifiable; static lifetime.
Data. Modifiable; static lifetime.
Heap. Modifiable; dynamic lifetime.
Stack. Modifiable; automatic lifetime.

Data layout

Memory stores bytes, but the C abstract machine refers to values of many types, some of which don’t fit in a single byte. The compiler, hardware, and standard together define how objects map to bytes. Each object uses a contiguous range of addresses (and thus bytes).

Since C is designed to help software interface with hardware devices, the C standard is pretty transparent about how objects are stored. A C program can ask how big an object is using the sizeof keyword. sizeof(T) returns the number of bytes in the representation of an object of type T, and sizeof(x) returns the size of object x. The result of sizeof is a value of type size_t, which is an unsigned integer type large enough to hold any representable size. (Question: What's the relationship between size_t and W, the size of the address space?)

Objects also have alignment. This restricts where they can be stored in memory. For instance, on x86-64 machines, int has alignment 4. This means that the address of any int in the program is a multiple of 4. You can query the alignment of a type or object using the __alignof__ keyword. An object's size is always a multiple of its alignment.

Derived types

As in many languages, C users can define new types by combining existing ones. C supports four kinds of derived type:

Pointers: The type T\* represents the address of an object of type T.
Homogenous collections: Every element of the collection has the same type. Implemented by arrays, like int x[20].
Heterogenous collections: Different elements of the collection can have different types. Implemented by structs.
Overlapping collections: Different elements of the collection occupy the same memory. Implemented by unions.

Elements of homogeneous and heterogeneous collections are laid out contiguously in memory, subject to alignment requirements. Arrays are laid out contiguously, with no gaps (why?); structs are laid out contiguously, but with possible gaps (why?).

Pointer values

Pointer values are implemented as addresses, which are just numbers—indexes into memory. The close relationship between pointers and addresses is a source of C's efficiency and a source of problems.

The & address-of operator obtains a pointer to an object. (You can't take the address of a value—&3 and &(x + 1) are erroneous.) The \* dereference operator returns the object a pointer points to.

Pointers and integers are fungible: you can turn an pointer into an integer and vice versa using casts. The uintptr_t type is an integer type large enough to hold any address (pointer value).

Where there are casts, undefined behavior is close behind. It is undefined behavior to access a value through a pointer of the wrong type. But there is an exception for pointers of type char\* and unsigned char\* to allow access to raw memory: you can examine any object using those pointer types, and you can safely copy objects from place to place using those pointer types (or functions like memmove that use those pointer types).

Pointer arithmetic

A minor glory of C is its support for pointer arithmetic, which relates arrays and pointers. Given an array T x[N], the following is always true (the abstract machine requires it): If 0≤i≤N, then &x[i] == x + i. And given two indexes 0≤i,j≤N,

&x[i] - &x[j] == (x + i) - (x + j) == (ptrdiff_t) (i - j)

(where ptrdiff_t is the signed equivalent of size_t). It’s cool that pointer arithmetic obeys the usual arithmetic laws! And we can compare pointers into the same object. Given a = &x[i] and b = &x[j], we have a \< b if and only if i \< j.

Pointer arithmetic is cool because it lets us write very tight, efficient code loops. You will often see loops that modify pointers rather than indexes. But there are dangers too. It is illegal to form a pointer that points outside an object. So given an array T x[N], the pointer value x+N+1 causes undefined behavior. And it is illegal to compare or subtract pointers that point into distinct objects! (You can safely perform the comparison by converting to uintptr_t first.)

Note that pointer arithmetic is different from address arithmetic. (uintptr) (p + 1) does not always produce the same value as ((uintptr) p) + 1. (Why?)