Lecture 3

Segments

Review: What data is stored where? The addresses are taken from executions of ./mexplore.

Object declaration (C program text)	Lifetime (abstract machine)	Segment (executable location in Linux)	Example address range (runtime location in x86-64 Linux)
Constant global	Static	Code (Text)	0x400000 (≈1 × 2²²)
Global	Static	Data	0x600000 (≈1.5 × 2²²)
Local	Automatic	Stack	0x7fff448d0000 (≈2²⁵ × 2²²)
Anonymous, returned by malloc	Dynamic	Heap	0x1a00000 (≈8 × 2²²)

Constant global data and global data have the same lifetime, but are stored in different segments. The operating system uses different segments so it can prevent the program from modifying constants. It marks the code segment, which contains functions (instructions) and constant global data, as read-only, and any attempt to modify code-segment memory causes a crash (a “Segmentation violation”).

Automatic-lifetime and dynamic-lifetime data is allocated and freed as the program runs. The compiler manages automatic-lifetime data using the stack (we’ll see how soon); the application programmer manages dynamic-lifetime data by calling malloc and free.

Static-lifetime data and compiled instructions exist for the whole program runtime. Before the program runs, the compiler writes the initial values for this data into object files, which the linker combines into an executable. The operating system loads those values into memory before it starts executing the program’s instructions. This means that an executable is normally at least as big as the static-lifetime data. There is an exception, however: a special, separate segment, the “bss” segment, is used to hold static-lifetime data with initial value zero. Such data is common; all static-lifetime data is initialized to zero unless otherwise specified in the program text. Rather than storing a bunch of zeros in the object files and executable, the compiler and linker simply track the location and size of all zero-initialized global data. The operating system sets this memory to zero during the program load process. Since clearing memory is faster than loading data from disk, this optimization saves both time (the program loads faster) and space (the executable is smaller).

Integer representation

A bit is the fundamental unit of digital information: it’s either 0 or 1.

C manages memory in units of bytes, which, on all modern machines, contains 8 bits. (But not all machines!) C’s unit for a byte is char: the abstract machine says a byte is stored in char. That means an unsigned char holds values [0, 255] inclusive.

Other integer types can hold more values. On x86-64, an unsigned short can hold values [0, 65535] inclusive. An unsigned int can hold values [0, 4294967295] inclusive. And an unsigned long can hold values [0, 18446744073709551615] inclusive.

The abstract machine doesn’t specify how large integers are stored in memory—it’s the compiler and hardware’s job to make that choice. But modern computers make one of two basic choices, called little endian and big endian. Here’s how it works.

Write the large integer in hexadecimal format, including all leading zeros. For example, the unsigned int value 65534 would be written 0x0000FFFE. There will be twice as many hexadecimal digits as sizeof(TYPE).
Both little-endian and big-endian mode divide the integer into its component bytes, which are its digits in base 256. In our example, they are, from most to least significant, 0x00, 0x00, 0xFF, and 0xFE.
In little endian representation, the bytes are stored in memory from least to most significant. If our example was stored at address 0x30, we would have:

0x30: 0xFE    0x31: 0xFF    0x32: 0x00    0x33: 0x00

In big endian representation, the bytes are stored in memory from most to least significant, which is the reverse order.

0x30: 0x00    0x31: 0x00    0x32: 0xFF    0x33: 0xFE

x86-64 uses little endian order. The Internet’s fundamental protocols, such as IP and TCP, use big endian order, which is therefore also called “network” byte order.

Alignment

Repeated executions of programs like ./mexplore show that the C compiler and library restricts the addresses at which some kinds of data appear. In particular, the address of every int value is always a multiple of 4, whether it’s located on the stack (automatic lifetime), the data segment (static lifetime), or the heap (dynamic lifetime).

A bunch of observations will show you these rules:

Type	Size	Address restriction
`char` (`signed char`, `unsigned char`)	1	No restriction
`short` (`unsigned short`)	2	Multiple of 2
`int` (`unsigned int`)	4	Multiple of 4
`long` (`unsigned long`)	8	Multiple of 8
`float`	4	Multiple of 4
`double`	8	Multiple of 8
`T\*`	8	Multiple of 8

These are the alignment restrictions for an x86-64 machine.

Alignment restrictions can make hardware simpler, and therefore faster. For instance, consider cache blocks, which we discussed in the first lecture. CPUs access memory through a transparent hardware cache. Data moves from primary memory, or RAM (which is large—a couple gigabytes on most laptops—and uses cheaper, slower technology) to the cache in units of 64 or 128 bytes. Those units are always aligned: on a machine with 128-byte cache blocks, the bytes with memory addresses [127, 128, 129, 130] live in two different cache blocks (with addresses [0, 127] and [128, 255]). But the 4 bytes with addresses [4n, 4n+1, 4n+2, 4n+3] always live in the same cache block. (This is true for any small power of two: the 8 bytes with addresses [8n,…,8n+7] always live in the same cache block.) In general, it’s often possible to make a system faster by leveraging restrictions—and here, the CPU hardware can load data faster when it can assume that the data lives in exactly one cache line.

The compiler, library, and operating system all work together to enforce alignment restrictions. In abstract machine terms, accessing data through an unaligned pointer causes undefined behavior. Note that undefined behavior occurs even though x86-64 hardware has optional alignment: the hardware is capable of dereferencing an unaligned pointer.

The compiler’s __alignof__ special form returns a type or object’s alignment. On x86-64 Linux, __alignof__(T) == sizeof(T) for all fundamental types (the types built in to C: integer types, floating point types, and pointers). But this isn’t always true; on x86-32 Linux, double has size 8 but alignment 4.

User-defined types: rules

The sizes and alignments for user-defined types—arrays, structs, and unions—are derived from a couple simple rules or principles. Here they are. The first rule applies to all types.

1. First-member rule. The address of the first member of a collection equals the address of the collection.

Thus, the address of an array is the same as the address of its first element. The address of a struct is the same as the address of the first member of the struct.

The next three rules depend on the class of user-defined type. Every C abstract machine enforces these rules.

2. Array rule. The address of the ith element of an array of type T is ADDRESSOF(array) + i \* sizeof(T).

3. Struct rule. The second and subsequent members of a struct are laid out in order, with no overlap, subject to alignment constraints.

4. Union rule. All members of a union share the address of the union.

Every C abstract machine also enforces

5. Malloc rule. Any non-NULL pointer returned by malloc has alignment appropriate for any type. In other words, assuming the allocated size is adequate, the pointer returned from malloc can safely be cast to T\* for any T.

Oddly, this holds even for small allocations. The C standard (the abstract machine) requires that malloc(1) return a pointer whose alignment is appropriate for any type, including types that don’t fit.

The malloc rule implies that every C implementation has a maximum alignment. The GCC and clang compilers allow you to query that alignment with this expression: __alignof__(struct __attribute__((aligned)) {}). On x86-64 Linux, this returns 16. (This is the size and alignment of long double, as well as some special types associated with “SIMD” instructions that we won’t discuss in this course.) So every x86-64 Linux malloc implementation should always return pointers that are 16-byte aligned. The maximum alignment is a multiple of every other alignment; on our machine (and most others), this works out simply because every alignment is a power of two.

The last rule is not required by the abstract machine, but it’s how sizes and alignments on our machines work:

6. Minimum rule. The sizes and alignments of user-defined types, and the offsets of struct members, are minimized within the constraints of the other rules.

The minimum rule, and the sizes and alignments of basic types, are defined by the x86-64 Linux “ABI”—its Application Binary Interface. This specification standardizes how x86-64 Linux C compilers should behave, and lets users mix and match compilers without problems.

Consequences of the size and alignment rules

From these rules we can derive some interesting consequences.

For example, the offset of any struct member—which is the difference between the address of the member and the address of the containing struct—is a multiple of the member’s alignment.

To see why, consider a struct s with member m at offset o. The malloc rule says that any pointer returned from malloc is correctly aligned for s. Every pointer returned from malloc is maximally aligned, equalling 16\*x for some integer x. The struct rule says that the address of m, which is 16\*x + o, is correctly aligned. That means that 16\*x + o = __alignof__(m)\*y for some integer y. Divide both sides by a = __alignof__(m) and you see that 16\*x/a + o/a = y. But 16/a is an integer—the maximum alignment is a multiple of every alignment—so 16\*x/a is an integer. We can conclude that o/a must also be an integer!

For another, the size of every type is a multiple of its alignment.

To see why, consider an array with two elements. By the array rule, these elements have addresses a and a+sizeof(T), where a is the address of the array. Both of these addresses contain a T, so they are both a multiple of __alignof__(T). That means sizeof(T) is also a multiple of __alignof__(T).

Finally, we can also derive the necessity for padding at the end of structs. (How?)

Pointer representation

We distinguish pointers, which are concepts in the C abstract machine, from addresses, which are hardware concepts. A pointer combines an address and a type.

The memory representation of a pointer is the same as the memory representation of its address, so a pointer with address 0x1347810A is stored the same way as the integer with the same value.

The C abstract machine defines an unsigned integer type uintptr_t that can hold any address. (You have to #include \<inttypes.h\> to get the definition.) On most machines, including x86-64, uintptr_t is the same as unsigned long. Casts between pointer types and uintptr_t are information preserving, so this assertion will never fail:

void* ptr = malloc(...);
uintptr_t addr = (uintptr_t) ptr;
void* ptr2 = (void*) addr;
assert(ptr == ptr2);

Since it is a 64-bit architecture, the size of an x86-64 address is 64 bits (8 bytes). That’s also the size of x86-64 pointers.

Pointer arithmetic

Since addresses are integers, we can clearly do all kinds of arithmetic on them, including useless arithmetic (multiplying two addresses together is unlikely to produce a useful result).

But the C language also lets us do arithmetic and comparisons on pointers. Here’s the rule:

Pointer arithmetic rule. In the C abstract machine, arithmetic on pointers produces the same result as arithmetic on the corresponding array indexes.

Specifically, consider an array T a[n] and a pointer T\* p = &a[i], where 0 \<= i \<= n. Then:

For an integer x with 0 \<= i + x \<= n, we have p + x == &a[i + x] and x + p == &a[x + i].
For an integer x with 0 \<= i - x \<= n, we have p - x == &a[i - x].
For a pointer T\* q = &a[j] with 0 \<= j \<= n, we have p - q == (ptrdiff_t) (i - j) and q - p == (ptrdiff_t) (j - i).
For a pointer T\* q = &a[j] with 0 \<= j \<= n, we have p \< q iff i \< j (and similarly for \>, \<=, and \>=).

(ptrdiff_t is a signed type large enough to hold the difference between any two pointers. Like uintptr_t, it’s defined by #include \<inttypes.h\>; on x86-64, it’s the same as long.)

The pointer arithmetic design is one of the basic building blocks of C, and it has fundamental consequences for C’s expressivity. A library design that takes advantage of pointer arithmetic makes it easy to apply functions to subarrays as well as arrays. (Thanks to pointer arithmetic and the properties of addition and subtraction, a pointer to an array’s ith element behaves like the subarray starting at index i.) It is therefore natural for C functions to take pointers where, in other languages, you might expect to see arrays. The C++ language’s iterator design takes the underlying abstract idea of pointer arithmetic to an impressive and beautiful place, making it possible to build very fast data types and generic algorithms.

But there are some gotchas having to do with undefined behavior.

It is undefined behavior to form a pointer before the beginning of an array or after the end. That is, if i + x \< 0 or i + x \> n where n is the size of the underlying array, then &p[i] + x causes undefined behavior.
It is undefined behavior to subtract or compare pointers that are not to the same array. (Exception: == and != comparisons are OK.)

If you want to subtract or compare pointers that might not be to the same array, do the subtraction or comparison using addresses by casting the pointers to uintptr_t first.

The pointer arithmetic rule means that computations on pointers produce different numeric results than computations on addresses. Consider m + x where x is an integer. If m is an address type like uintptr_t, then m + x is computed in modular arithmetic, using the machine’s add instruction. But if m is a pointer type, then the compiler must scale x by the size of the underlying type. For any type T, we have:

T* a = ...;
int i;
T* b = a + i; // assume defined behavior
assert((uintptr_t) b == (uintptr_t) a + i * sizeof(T));
assert(b - a == i);
assert((uintptr_t) b - (uintptr_t) a == i * sizeof(T));