Last lecture discussed how objects are represented in memory, for a couple important kinds of objects (integers of various sizes, signed and unsigned). This lecture concerns layout: the ways that compilers and operating systems place multiple objects in relationship to one another. The abstract machine defines certain aspects of layout; others are left up to the compiler and runtime and operating system.
Segments
One aspect of layout is that objects are segregated into different address ranges based on lifetime. These ranges are called segments. The compiler decides on a segment for each object based on its lifetime. The linker then groups all the program’s objects by segment (so, for instance, global variables from different compiler runs are grouped together into a single segment). Finally, when a program runs, the operating system loads the segments into memory. (The stack and heap segments grow on demand.)
Object declaration |
Lifetime |
Segment |
Example address range |
---|---|---|---|
Constant global |
Static |
Code (aka Text) |
0x400000 (≈1 × 222) |
Global |
Static |
Data |
0x600000 (≈1.5 × 222) |
Local |
Automatic |
Stack |
0x7fff448d0000 (≈225 × 222) |
Anonymous, returned by |
Dynamic |
Heap |
0x1a00000 (≈8 × 222) |
Constant global data and global data have the same lifetime, but are stored in different segments. The operating system uses different segments so it can prevent the program from modifying constants. It marks the code segment, which contains functions (instructions) and constant global data, as read-only, and any attempt to modify code-segment memory causes a crash (a “Segmentation violation”).
An executable is normally at least as big as the static-lifetime data (the code and data segments together). Since all that data must be in memory for the entire lifetime of the program, it’s written to disk and then loaded by the OS before the program starts running. There is an exception, however: the “bss” segment is used to hold modifiable static-lifetime data with initial value zero. Such data is common, since all static-lifetime data is initialized to zero unless otherwise specified in the program text. Rather than storing a bunch of zeros in the object files and executable, the compiler and linker simply track the location and size of all zero-initialized global data. The operating system sets this memory to zero during the program load process. Clearing memory is faster than loading data from disk, so this optimization saves both time (the program loads faster) and space (the executable is smaller).
Compiler layout
The compiler has complete freedom to pick locations for objects, subject to the abstract machine’s constraints—most importantly, that each object occupies disjoint memory from any other object that’s active at the same time. For instance, consider this program:
void f() {
int i1 = 0;
int i2 = 1;
int i3 = 2;
char c1 = 3;
char c2 = 4;
char c3 = 5;
...
}
On Linux, GCC will put all these variables into the stack segment, which we
can see using hexdump
. But it can put them in the stack segment in any
order, as we can see by reordering the declarations (try declaration order
i1
, c1
, i2
, c2
, c3
), by changing optimization levels, or by adding
different scopes (braces). The abstract machine gives the programmer no
guarantees about how object addresses relate. In fact, the compiler may move
objects around during execution, as long as it ensures that the program
behaves according to the abstract machine. Modern optimizing compilers often
do this, particularly for automatic objects.
But what order does the compiler choose? With optimization disabled, the
compiler appears to lay out objects in decreasing order by declaration, so the
first declared variable in the function has the highest address. With
optimization enabled, the compiler follows roughly the same guideline, but it
also rearranges objects by type—for instance, it tends to group char
s
together—and it can reuse space if different variables in the same function
have disjoint lifetimes. The optimizing compiler tends to use less space for
the same set of variables.
Collections: Abstract machine layout
The C++ programming language offers several collection mechanisms for grouping subobjects together into new kinds of object. The collections are structs, arrays, and unions. (Classes are a kind of struct.) Although the compiler can lay out different objects however it likes relative to one another, the abstract machine defines how subobjects are laid out inside a collection. This is important, because it lets C/C++ programs exchange messages with hardware and even with programs written in other languages. (Messages can be exchanged only when both parties agree on layout. C/C++’s rules let a C program match any known layout.)
The sizes and alignments for user-defined types—arrays, structs, and unions—are derived from a couple simple rules or principles. Here they are. The first rule applies to all types.
1. First-member rule. The address of the first member of a collection equals the address of the collection.
Thus, the address of an array is the same as the address of its first element. The address of a struct is the same as the address of the first member of the struct.
The next three rules depend on the class of collection. Every C abstract machine enforces these rules.
2. Struct rule. The second and subsequent members of a struct are laid out in order, with no overlap, subject to alignment constraints.
3. Array rule. The address of the i
th element of an array of
type T
is ADDRESSOF(array) + i * sizeof(T)
.
4. Union rule. All members of a union share the address of the union.
In C, every struct follows the struct rule, but in C++, only simple structs follow the rule. Complicated structs, such as structs with some
public
and someprivate
members, or structs withvirtual
functions, can be laid out however the compiler chooses. The typical situation is that C++ compilers for a machine architecture (e.g., “Linux x86-64”) will all agree on a layout procedure for complicated structs. This allows code compiled by different compilers to interoperate.
Alignment
Repeated executions of programs like ./mexplore
show that the C
compiler and library restricts the addresses at which some kinds of data
appear. In particular, the address of every int
value is always a
multiple of 4, whether it’s located on the stack (automatic lifetime),
the data segment (static lifetime), or the heap (dynamic lifetime).
A bunch of observations will show you these rules:
Type | Size | Address restriction |
---|---|---|
char (signed char , unsigned char ) |
1 | No restriction |
short (unsigned short ) |
2 | Multiple of 2 |
int (unsigned int ) |
4 | Multiple of 4 |
long (unsigned long ) |
8 | Multiple of 8 |
float |
4 | Multiple of 4 |
double |
8 | Multiple of 8 |
T* |
8 | Multiple of 8 |
These are the alignment restrictions for an x86-64 Linux machine.
These restrictions hold for most x86-64 operating systems, except that on Windows, the
long
type has size and alignment 4. (Thelong long
type has size and alignment 8 on all x86-64 operating systems.)
Just like every type has a size, every type has an alignment. The alignment of
a type T
is a number a
≥1 such that the address of every object of type T
must be a multiple of a
. Every object with type T
has size sizeof(T)
—it
occupies sizeof(T)
contiguous bytes of memory; and has alignment
alignof(T)
—the address of its first byte is a multiple of alignof(T)
.
You can also say sizeof(x)
and alignof(x)
where x
is the name of an
object or another expression.
Alignment restrictions can make hardware simpler, and therefore faster. For instance, consider cache blocks. CPUs access memory through a transparent hardware cache. Data moves from primary memory, or RAM (which is large—a couple gigabytes on most laptops—and uses cheaper, slower technology) to the cache in units of 64 or 128 bytes. Those units are always aligned: on a machine with 128-byte cache blocks, the bytes with memory addresses [127, 128, 129, 130] live in two different cache blocks (with addresses [0, 127] and [128, 255]). But the 4 bytes with addresses [4n, 4n+1, 4n+2, 4n+3] always live in the same cache block. (This is true for any small power of two: the 8 bytes with addresses [8n,…,8n+7] always live in the same cache block.) In general, it’s often possible to make a system faster by leveraging restrictions—and here, the CPU hardware can load data faster when it can assume that the data lives in exactly one cache line.
The compiler, library, and operating system all work together to enforce alignment restrictions.
On x86-64 Linux, alignof(T) == sizeof(T)
for all fundamental types (the
types built in to C: integer types, floating point types, and pointers). But
this isn’t always true; on x86-32 Linux, double
has size 8 but alignment 4.
It’s possible to construct user-defined types of arbitrary size, but the
largest alignment required by a machine is fixed for that machine. C++ lets
you find the maximum alignment for a machine with alignof(std::max_align_t)
;
on x86-64, this is 16, the alignment of the type long double
(and the
alignment of some less-commonly-used SIMD “vector” types).
Alignment rules
Two more rules and we can reason about how collection sizes and alignments interact.
Every C++ abstract machine enforces
5. Malloc rule. Any non-null pointer returned by malloc
has
alignment appropriate for any type. In other words, assuming the
allocated size is adequate, the pointer returned from malloc
can
safely be cast to T*
for any T
.
Oddly, this holds even for small allocations. The C++ standard (the
abstract machine) requires that malloc(1)
return a pointer whose
alignment is appropriate for any type, including types that don’t fit.
The last rule is not required by the abstract machine, but it’s how sizes and alignments on our machines work:
6. Minimum rule. The sizes and alignments of user-defined types, and the offsets of struct members, are minimized within the constraints of the other rules.
The minimum rule, and the sizes and alignments of basic types, are defined by the x86-64 Linux “ABI”—its Application Binary Interface. This specification standardizes how x86-64 Linux C compilers should behave, and lets users mix and match compilers without problems.
Consequences of the size and alignment rules
From these rules we can derive some interesting consequences.
First, the size of every type is a multiple of its alignment.
To see why, consider an array with two elements. By the array rule,
these elements have addresses a
and a+sizeof(T)
, where a
is
the address of the array. Both of these addresses contain a T
, so
they are both a multiple of alignof(T)
. That means
sizeof(T)
is also a multiple of alignof(T)
.
We can also characterize the sizes and alignments of different collections.
- The size of an array of
N
elements of typeT
isN * sizeof(T)
: the sum of the sizes of its elements. The alignment of the array isalignof(T)
. - The size of a union is the maximum of the sizes of its components (because the union can only hold one component at a time). Its alignment is also the maximum of the alignments of its components.
- The size of a struct is at least as big as the sum of the sizes of its components. Its alignment is the maximum of the alignments of its components.
Thus, the alignment of every collection equals the maximum of the alignments of its components.
It’s also true that the alignment equals the least common multiple of the alignments of its components. You might have thought lcm was a better answer, but the max is the same as the lcm for every architecture that matters, because all fundamental alignments are powers of two.
The size of a struct might be larger than the sum of the sizes of its
components, because of alignment constraints. Since the compiler must lay
out struct components in order, and it must obey the components’ alignment
constraints, and it must ensure different components occupy disjoint
addresses, it must sometimes introduce extra space in structs. Here’s an
example: the struct will have 3 bytes of padding after char c
, to ensure
that int i2
has the correct alignment.
struct twelve_bytes {
int i1;
char c;
int i2;
};
Thanks to padding, reordering struct components can sometimes reduce the total size of a struct.
The rules also imply that the offset of any struct member—which is the difference between the address of the member and the address of the containing struct—is a multiple of the member’s alignment.
To see why, consider a struct s
with member m
at offset o
.
The malloc rule says that any pointer returned from malloc
is
correctly aligned for s
. Every pointer returned from malloc
is
maximally aligned, equalling 16*x
for some integer x
. The
struct rule says that the address of m
, which is 16*x + o
, is
correctly aligned. That means that 16*x + o = alignof(m)*y
for some integer y
. Divide both sides by a = alignof(m)
and you see that 16*x/a + o/a = y
. But 16/a
is an integer—the
maximum alignment is a multiple of every alignment—so 16*x/a
is an
integer. We can conclude that o/a
must also be an integer!
Finally, we can also derive the necessity for padding at the end of structs. (How?)
Pointer representation
We distinguish pointers, which are concepts in the C abstract machine, from addresses, which are hardware concepts. A pointer combines an address and a type.
The memory representation of a pointer is the same as the memory representation of its address, so a pointer with address 0x1347810A is stored the same way as the integer with the same value.
The C abstract machine defines an unsigned integer type uintptr_t
that can
hold any address. (You have to #include <inttypes.h>
or <cinttypes>
to get
the definition.) On most machines, including x86-64, uintptr_t
is the same
as unsigned long
. Casts between pointer types and uintptr_t
are
information preserving, so this assertion will never fail:
void* ptr = malloc(...);
uintptr_t addr = (uintptr_t) ptr;
void* ptr2 = (void*) addr;
assert(ptr == ptr2);
Since it is a 64-bit architecture, the size of an x86-64 address is 64 bits (8 bytes). That’s also the size of x86-64 pointers.
Compiler hijinks
In C++, most dynamic memory allocation uses special language operators, new
and delete
, rather than library functions.
Though this seems more complex than the library-function style, it has
advantages. A C compiler cannot tell what malloc
and free
do (especially
when they are redefined to debugging versions, as in the problem set), so a C
compiler cannot necessarily optimize calls to malloc
and free
away. But
the C++ compiler may assume that all uses of new
and delete
follow the
rules laid down by the abstract machine. That means that if the compiler can
prove that an allocation is unnecessary or unused, it is free to remove that
allocation!
For example, we compiled this program in the problem set environment (based on
test003.cc
):
int main() {
char* ptrs[10];
for (int i = 0; i < 10; ++i) {
ptrs[i] = new char[i + 1];
}
for (int i = 0; i < 5; ++i) {
delete[] ptrs[i];
}
m61_printstatistics();
}
The optimizing C++ compiler removes all calls to new
and delete
, leaving
only the call to m61_printstatistics()
! (For instance, try objdump -d
testXXX
to look at the compiled x86-64 instructions.) This is valid because
the compiler is explicitly allowed to eliminate unused allocations, and here,
since the ptrs
variable is local and doesn’t escape main
, all
allocations are unused. The C compiler cannot perform this useful
transformation. (But the C compiler can do other cool things, such as unroll
the loops.)