Data representation 3: Layout

Last lecture discussed how objects are represented in memory, for a couple important kinds of objects (integers of various sizes, signed and unsigned). This lecture concerns layout: the ways that compilers and operating systems place multiple objects in relationship to one another. The abstract machine defines certain aspects of layout; others are left up to the compiler and runtime and operating system.

Segments

One aspect of layout is that objects are segregated into different address ranges based on lifetime. These ranges are called segments. The compiler decides on a segment for each object based on its lifetime. The linker then groups all the program’s objects by segment (so, for instance, global variables from different compiler runs are grouped together into a single segment). Finally, when a program runs, the operating system loads the segments into memory. (The stack and heap segments grow on demand.)

Object declaration
(C program text)

Lifetime
(abstract machine)

Segment
(executable location in Linux)

Example address range
(runtime location in x86-64 Linux)

Constant global

Static

Code (aka Text)

0x400000 (≈1 × 222)

Global

Static

Data

0x600000 (≈1.5 × 222)

Local

Automatic

Stack

0x7fff448d0000 (≈225 × 222)

Anonymous, returned by new

Dynamic

Heap

0x1a00000 (≈8 × 222)

Constant global data and global data have the same lifetime, but are stored in different segments. The operating system uses different segments so it can prevent the program from modifying constants. It marks the code segment, which contains functions (instructions) and constant global data, as read-only, and any attempt to modify code-segment memory causes a crash (a “Segmentation violation”).

An executable is normally at least as big as the static-lifetime data (the code and data segments together). Since all that data must be in memory for the entire lifetime of the program, it’s written to disk and then loaded by the OS before the program starts running. There is an exception, however: the “bss” segment is used to hold modifiable static-lifetime data with initial value zero. Such data is common, since all static-lifetime data is initialized to zero unless otherwise specified in the program text. Rather than storing a bunch of zeros in the object files and executable, the compiler and linker simply track the location and size of all zero-initialized global data. The operating system sets this memory to zero during the program load process. Clearing memory is faster than loading data from disk, so this optimization saves both time (the program loads faster) and space (the executable is smaller).

Compiler layout

The compiler has complete freedom to pick locations for objects, subject to the abstract machine’s constraints—most importantly, that each object occupies disjoint memory from any other object that’s active at the same time. For instance, consider this program:

void f() {
    int i1 = 0;
    int i2 = 1;
    int i3 = 2;
    char c1 = 3;
    char c2 = 4;
    char c3 = 5;
    ...
}

On Linux, GCC will put all these variables into the stack segment, which we can see using hexdump. But it can put them in the stack segment in any order, as we can see by reordering the declarations (try declaration order i1, c1, i2, c2, c3), by changing optimization levels, or by adding different scopes (braces). The abstract machine gives the programmer no guarantees about how object addresses relate. In fact, the compiler may move objects around during execution, as long as it ensures that the program behaves according to the abstract machine. Modern optimizing compilers often do this, particularly for automatic objects.

But what order does the compiler choose? With optimization disabled, the compiler appears to lay out objects in decreasing order by declaration, so the first declared variable in the function has the highest address. With optimization enabled, the compiler follows roughly the same guideline, but it also rearranges objects by type—for instance, it tends to group chars together—and it can reuse space if different variables in the same function have disjoint lifetimes. The optimizing compiler tends to use less space for the same set of variables.

Collections: Abstract machine layout

The C++ programming language offers several collection mechanisms for grouping subobjects together into new kinds of object. The collections are structs, arrays, and unions. (Classes are a kind of struct.) Although the compiler can lay out different objects however it likes relative to one another, the abstract machine defines how subobjects are laid out inside a collection. This is important, because it lets C/C++ programs exchange messages with hardware and even with programs written in other languages. (Messages can be exchanged only when both parties agree on layout. C/C++’s rules let a C program match any known layout.)

The sizes and alignments for user-defined types—arrays, structs, and unions—are derived from a couple simple rules or principles. Here they are. The first rule applies to all types.

1. First-member rule. The address of the first member of a collection equals the address of the collection.

Thus, the address of an array is the same as the address of its first element. The address of a struct is the same as the address of the first member of the struct.

The next three rules depend on the class of collection. Every C abstract machine enforces these rules.

2. Struct rule. The second and subsequent members of a struct are laid out in order, with no overlap, subject to alignment constraints.

3. Array rule. The address of the ith element of an array of type T is ADDRESSOF(array) + i * sizeof(T).

4. Union rule. All members of a union share the address of the union.

In C, every struct follows the struct rule, but in C++, only simple structs follow the rule. Complicated structs, such as structs with some public and some private members, or structs with virtual functions, can be laid out however the compiler chooses. The typical situation is that C++ compilers for a machine architecture (e.g., “Linux x86-64”) will all agree on a layout procedure for complicated structs. This allows code compiled by different compilers to interoperate.

Alignment

Repeated executions of programs like ./mexplore show that the C compiler and library restricts the addresses at which some kinds of data appear. In particular, the address of every int value is always a multiple of 4, whether it’s located on the stack (automatic lifetime), the data segment (static lifetime), or the heap (dynamic lifetime).

A bunch of observations will show you these rules:

Type Size Address restriction
char (signed char, unsigned char) 1 No restriction
short (unsigned short) 2 Multiple of 2
int (unsigned int) 4 Multiple of 4
long (unsigned long) 8 Multiple of 8
float 4 Multiple of 4
double 8 Multiple of 8
T* 8 Multiple of 8

These are the alignment restrictions for an x86-64 Linux machine.

These restrictions hold for most x86-64 operating systems, except that on Windows, the long type has size and alignment 4. (The long long type has size and alignment 8 on all x86-64 operating systems.)

Just like every type has a size, every type has an alignment. The alignment of a type T is a number a≥1 such that the address of every object of type T must be a multiple of a. Every object with type T has size sizeof(T)—it occupies sizeof(T) contiguous bytes of memory; and has alignment alignof(T)—the address of its first byte is a multiple of alignof(T). You can also say sizeof(x) and alignof(x) where x is the name of an object or another expression.

Alignment restrictions can make hardware simpler, and therefore faster. For instance, consider cache blocks. CPUs access memory through a transparent hardware cache. Data moves from primary memory, or RAM (which is large—a couple gigabytes on most laptops—and uses cheaper, slower technology) to the cache in units of 64 or 128 bytes. Those units are always aligned: on a machine with 128-byte cache blocks, the bytes with memory addresses [127, 128, 129, 130] live in two different cache blocks (with addresses [0, 127] and [128, 255]). But the 4 bytes with addresses [4n, 4n+1, 4n+2, 4n+3] always live in the same cache block. (This is true for any small power of two: the 8 bytes with addresses [8n,…,8n+7] always live in the same cache block.) In general, it’s often possible to make a system faster by leveraging restrictions—and here, the CPU hardware can load data faster when it can assume that the data lives in exactly one cache line.

The compiler, library, and operating system all work together to enforce alignment restrictions.

On x86-64 Linux, alignof(T) == sizeof(T) for all fundamental types (the types built in to C: integer types, floating point types, and pointers). But this isn’t always true; on x86-32 Linux, double has size 8 but alignment 4.

It’s possible to construct user-defined types of arbitrary size, but the largest alignment required by a machine is fixed for that machine. C++ lets you find the maximum alignment for a machine with alignof(std::max_align_t); on x86-64, this is 16, the alignment of the type long double (and the alignment of some less-commonly-used SIMD “vector” types).

Alignment rules

Two more rules and we can reason about how collection sizes and alignments interact.

Every C++ abstract machine enforces

5. Malloc rule. Any non-null pointer returned by malloc has alignment appropriate for any type. In other words, assuming the allocated size is adequate, the pointer returned from malloc can safely be cast to T* for any T.

Oddly, this holds even for small allocations. The C++ standard (the abstract machine) requires that malloc(1) return a pointer whose alignment is appropriate for any type, including types that don’t fit.

The last rule is not required by the abstract machine, but it’s how sizes and alignments on our machines work:

6. Minimum rule. The sizes and alignments of user-defined types, and the offsets of struct members, are minimized within the constraints of the other rules.

The minimum rule, and the sizes and alignments of basic types, are defined by the x86-64 Linux “ABI”—its Application Binary Interface. This specification standardizes how x86-64 Linux C compilers should behave, and lets users mix and match compilers without problems.

Consequences of the size and alignment rules

From these rules we can derive some interesting consequences.

First, the size of every type is a multiple of its alignment.

To see why, consider an array with two elements. By the array rule, these elements have addresses a and a+sizeof(T), where a is the address of the array. Both of these addresses contain a T, so they are both a multiple of alignof(T). That means sizeof(T) is also a multiple of alignof(T).

We can also characterize the sizes and alignments of different collections.

Thus, the alignment of every collection equals the maximum of the alignments of its components.

It’s also true that the alignment equals the least common multiple of the alignments of its components. You might have thought lcm was a better answer, but the max is the same as the lcm for every architecture that matters, because all fundamental alignments are powers of two.

The size of a struct might be larger than the sum of the sizes of its components, because of alignment constraints. Since the compiler must lay out struct components in order, and it must obey the components’ alignment constraints, and it must ensure different components occupy disjoint addresses, it must sometimes introduce extra space in structs. Here’s an example: the struct will have 3 bytes of padding after char c, to ensure that int i2 has the correct alignment.

struct twelve_bytes {
    int i1;
    char c;
    int i2;
};

Thanks to padding, reordering struct components can sometimes reduce the total size of a struct.

The rules also imply that the offset of any struct member—which is the difference between the address of the member and the address of the containing struct—is a multiple of the member’s alignment.

To see why, consider a struct s with member m at offset o. The malloc rule says that any pointer returned from malloc is correctly aligned for s. Every pointer returned from malloc is maximally aligned, equalling 16*x for some integer x. The struct rule says that the address of m, which is 16*x + o, is correctly aligned. That means that 16*x + o = alignof(m)*y for some integer y. Divide both sides by a = alignof(m) and you see that 16*x/a + o/a = y. But 16/a is an integer—the maximum alignment is a multiple of every alignment—so 16*x/a is an integer. We can conclude that o/a must also be an integer!

Finally, we can also derive the necessity for padding at the end of structs. (How?)

Pointer representation

We distinguish pointers, which are concepts in the C abstract machine, from addresses, which are hardware concepts. A pointer combines an address and a type.

The memory representation of a pointer is the same as the memory representation of its address, so a pointer with address 0x1347810A is stored the same way as the integer with the same value.

The C abstract machine defines an unsigned integer type uintptr_t that can hold any address. (You have to #include <inttypes.h> or <cinttypes> to get the definition.) On most machines, including x86-64, uintptr_t is the same as unsigned long. Casts between pointer types and uintptr_t are information preserving, so this assertion will never fail:

void* ptr = malloc(...);
uintptr_t addr = (uintptr_t) ptr;
void* ptr2 = (void*) addr;
assert(ptr == ptr2);

Since it is a 64-bit architecture, the size of an x86-64 address is 64 bits (8 bytes). That’s also the size of x86-64 pointers.

Compiler hijinks

In C++, most dynamic memory allocation uses special language operators, new and delete, rather than library functions.

Though this seems more complex than the library-function style, it has advantages. A C compiler cannot tell what malloc and free do (especially when they are redefined to debugging versions, as in the problem set), so a C compiler cannot necessarily optimize calls to malloc and free away. But the C++ compiler may assume that all uses of new and delete follow the rules laid down by the abstract machine. That means that if the compiler can prove that an allocation is unnecessary or unused, it is free to remove that allocation!

For example, we compiled this program in the problem set environment (based on test003.cc):

int main() {
    char* ptrs[10];
    for (int i = 0; i < 10; ++i) {
        ptrs[i] = new char[i + 1];
    }
    for (int i = 0; i < 5; ++i) {
        delete[] ptrs[i];
    }
    m61_printstatistics();
}

The optimizing C++ compiler removes all calls to new and delete, leaving only the call to m61_printstatistics()! (For instance, try objdump -d testXXX to look at the compiled x86-64 instructions.) This is valid because the compiler is explicitly allowed to eliminate unused allocations, and here, since the ptrs variable is local and doesn’t escape main, all allocations are unused. The C compiler cannot perform this useful transformation. (But the C compiler can do other cool things, such as unroll the loops.)