Data representation 3: Layout and allocators

Overview

Last time, we examined the way C++ compilers assign sizes to objects, and the way they lay out compound objects in memory. We ended on a cliffhanger about alignment. This time we explain alignment in depth, and then describe an arena allocator, a purpose-built dynamic memory allocator that improves performance when many small objects are allocated and freed.

Mysterious struct sizes

datarep3/structs.cc

#include <cstdio>
#include "hexdump.hh"

int main() {
    struct {
        int a;
        int b;
        char c;
        char d;
    } s1 = {61, 62, 63, 64};

    struct {
        int a;
        char b;
        int c;
        char d;
    } s2 = {61, 62, 63, 64};

    hexdump_object(s1);
    hexdump_object(s2);
}

Alignments and compound types

Hardware (and the compiler) impose alignment restrictions on primitive types
- Whenever an object of type T is stored in memory, its address must be a multiple of alignof(T)
- Standard says alignments must be powers of two
- Therefore every alignment is an integral multiple of all smaller alignments
The members of a compound object must obey alignment restrictions
This imposes alignment requirements on compound types
- Array: alignof(T[N]) = alignof(T)
- Union: alignof(union { T₁ m₁; …; T_N m_N; }) = max_I alignof(T_I)
- Struct: alignof(struct { T₁ m₁; …; T_N m_N; }) = max_I alignof(T_I)
For structs, alignment can require adding padding between some members
- You can change a struct’s size by reordering its members!

Alignment and size

For an array T a[N]:
- Array layout rule: addressof(a[I]) = addressof(a) + I×sizeof(T)
- Array alignment rule: alignof(a) = alignof(T)
What can you deduce about the relationship between alignof(T) and sizeof(T)?

addressof(a[1]) = addressof(a[0]) + sizeof(T)
addressof(a[0]) = addressof(a) is a multiple of alignof(a) = alignof(T)
addressof(a[1]) must be a multiple of alignof(T)
So sizeof(T) must be a multiple of alignof(T)!

Struct rule

Struct members are laid out sequentially in memory
- In the order they are declared
- With minimum padding required to obey alignment restrictions

Struct example 1

struct s {     // <- Alignment = ?, Size = ?
               // <- Padding = ?
    int a;     // <- Alignment = 4, Offset = ?
               // <- Padding = ?
    char b;    // <- Alignment = 1, Offset = ?
               // <- Padding = ?
    int c;     // <- Alignment = 4, Offset = ?
               // <- Padding = ?
    char d;    // <- Alignment = 1, Offset = ?
               // <- Padding = ?
};

Struct example 1, completed

struct s {     // <- Alignment = 4, Size = 16
               // <- Padding = 0 (always)
    int a;     // <- Alignment = 4, Offset = 0 (always)
               // <- Padding = 0 (`b` is less aligned than `a`)
    char b;    // <- Alignment = 1, Offset = 4
               // <- Padding = 3 (recover alignment for `c`)
    int c;     // <- Alignment = 4, Offset = 8
               // <- Padding = 0
    char d;    // <- Alignment = 1, Offset = 12
               // <- Padding = 3 (recover alignment for `sizeof(s)`)
};

Struct rule, mathematically

Given struct { T₁ m₁; T₂ m₂; … T_N m_N; } s:
alignof(s) = max_I alignof(T_I)
A helper function, offsetof, computes the offset of each member relative to the address of the struct. The members are laid out in order, starting at the beginning of the struct, with minimal padding added between members to respect member alignments.

offsetof(s, m₁) = 0

For I>1, offsetof(s, m_I) =
⎡
⎢
⎢

offsetof(s, m_I-1) + sizeof(T_I-1)

alignof(T_I)

⎤
⎥
⎥
× alignof(T_I)

(The expression \lceil X / Y \rceil \times Y rounds up X to the nearest multiple of Y.) We can see that offsetof(s, m_I) is always a multiple of alignof(T_I), as we require.
Finally, we can compute the size of the whole structure:

sizeof(s) =

⎡
⎢
⎢

offsetof(s, m_N) + sizeof(T_N)

alignof(s)

⎤
⎥
⎥

× alignof(s)

Alignment and `malloc`

Malloc rule: Any non-null pointer returned by malloc and friends has the maximum alignment of any type (alignof(std::max_align_t))
- On x86-64, this is 16
- C++ new need not follow this rule; new T returns a pointer aligned for T
Standard vs. ABI
- All C++ implementations on all machines lay out struct members sequentially (standard requirement)
- Padding minimization is not required by the standard, but is required on x86-64 Linux (application-binary interface)

Performance and layout

datarep3/accessor.cc

// access 10M integers in up, down, or random order
unsigned long sum = 0;
unsigned long rand_sum = 0;

for (int i = 0; i != size; ++i) {
    int r = rand() % size;

    int idx;
    if (style == access_up) {
        idx = i;
    } else if (style == access_down) {
        idx = size - i - 1;
    } else if (style == access_random) {
        idx = r;
    }

    sum += v[idx];
    rand_sum += r;
}

Loop accesses a 100,000,000—element array of ints
In ascending, descending, or random order
What’s different between the orders? Will performance change?

Sequential faster than random

Many aspects of computer hardware and software are built to optimize for compact and sequential memory access
- Processor cache, prefetching algorithms, etc.—subjects of Unit 3
Consequences for data structure layout
- Vectors often faster than linked lists
- Fewer allocations often faster than more allocations
Be aware of these characteristics, but don’t over-optimize prematurely

Performance consequences of memory allocation

datarep3/mb0.cc

// allocate and free nodes `noperations` times
long counter = 0;
for (unsigned i = 0; i != noperations; ++i) {
    int pos = random_node_index(randomness);
    if (n[pos] == nullptr) {
        n[pos] = new node;
        n[pos]->contents = counter;
        ++counter;
    } else {
        delete n[pos];
        n[pos] = nullptr;
    }
}

Loop allocates & deallocates “nodes” 20,000,000 times
- Up to 8,192 nodes allocated at a time
- Models tasks like tree building & freeing, etc.
mb0.cc uses a general-purpose memory allocator
- Can we speed it up?

Memory arenas

An arena allocator is an object whose purpose is to handle the allocation and deallocation of other objects
It can be faster than a general-purpose allocator because it can make assumptions about the objects it allocates, such as that they are all the same size, or that they will all be freed at the same time