Data representation 2: Sizes and layout

Learning more after lecture

Lecture notes (on the course site’s Lectures menu)
Textbook readings (see the course site’s Textbook page)
Lecture code at https://github.com/cs61/cs61-lectures/

Last time

We saw how some C++ objects are represented as bytes in computer memory. On an x86-64 machine, int objects take 4 bytes to represent. These bytes are stored in memory, which is an array of bytes, each of which has an integer address. Integers’ bytes are stored in 4 consecutive addresses, with the lowest-valued place (the ones place) in the lowest-addressed byte, and the highest-valued place (the 2²⁴s place) in the highest-addressed byte. We also learned that pointers on x86-64 machines take 8 bytes to represent, and are stored similarly. We saw that dynamically-allocated memory and local variables are given qualitatively different kinds of address when a program runs; the local variables have high addresses (like 0x7fff'ffff'f2da, near 2⁴⁷ = 0x8000'0000'0000), but dynamically allocated memory has lower addresses (like 0x5020'0000'0010 or 0x21e'f2b0, depending on sanitizer and Docker settings). We saw that computer arithmetic can overflow, because 4 bytes (32 bits) cannot represent all \infty integers, and got a hint of how overflow can make life complicated for library designers. We saw that arrays of integers are laid out contiguously in memory, with no gaps. And we got our first hints of undefined behavior, when we used print_bytes to print more memory than our program was allowed to access, and when we caused integer overflow using signed arithmetic. Our programs are compiled by default with a sanitizer that can catch many such errors, but we can turn the sanitizer off (with make SAN=0) to live crazy. We learned what assertions are. Finally, we showed a representation of the machine code that the processor actually runs. Different C++ source codes can generate identical machine code, and these functions behave identically; in fact, the processor can run machine code from anywhere, including images of Hello Kitty.

This time

We investigate the lifetime and layout of C++ objects, including how that impacts performance.

Abstract machine

C++ programs are written for an abstract machine defined by the C++ technical standard
The abstract machine says what C++ programs mean
- It also says which programs have no meaning!
But the abstract machine doesn’t exist in the world
A C++ compiler is a program that translates source code to machine instructions that run on a processor
- The output of a correct compiler is a program that has the same observable effects as the abstract machine
- Example observable effect: bytes printed by printf

Wiggle room

Sometimes the abstract machine is strict, sometimes loose
- Example strict requirement: sizeof(char) == 1
- Example strict requirement: After int x = 2;, the value of x is 2
- Example loose requirement: The numeric value of a pointer isn’t defined
- Example loose requirement: The as-if rule
- The compiler can transform its generated code however it wants, as long as the resulting program is no different in observable effect!

Rules for memory layout

The memory representation of an object x comprises sizeof(x) contiguous bytes starting at the address &x
All objects of the same type have the same size and memory layout
Every access made by a running program must be to a live object, meaning an object within its lifetime (having storage that has been allocated and has not yet been released)
Distinct objects that are live at the same time must occupy disjoint addresses
- The compiler and the operating system work together to enforce this

Disjoint objects and finite memory

How to implement the disjoint address requirement?
One way to solve this: give every object a new address!I*#&$!!!
…

Example programs

datarep2/locals.cc
datarep2/functionlocals.cc
datarep2/strings.cc
datarep2/stringify.cc
datarep2/std-stringify.cc

Results

Function local variables are stored in high addresses, in a region of memory called the stack. The portion of the stack allocated to a given function is called its stack frame. If two function executions in the same program thread are live at the same time, then they have disjoint stack frames, and the stack frame of the caller (the function that started executing first) has larger addresses than the stack frame of the callee. This is because stacks grow down. All space allocated to a function is reclaimed when the function returns. Since their space is reclaimed automatically, local objects are said to have automatic storage duration. Referring to an object with automatic storage duration after its function returns is undefined behavior and should crash your program. A function can return exactly one object; if a function wants to return data of variable size, it must use dynamically allocated memory. The standard C++ library has many datatypes that use dynamic memory as part of their implementation.