Data representation 2: Object representation

Abstract machine and hardware

Programming involves turning an idea into hardware instructions. This transformation happens in multiple steps, some you control and some controlled by other programs!

First you have an idea, like “I want to make a bouncy ball iPhone game.” The computer can’t (yet) understand that idea. So you transform the idea into a program, written in some programming language. This process is called programming.

A C++ program actually runs on an abstract machine. The behavior of this machine is defined by the C++ standard, a technical document. This document is supposed to be so precisely written as to have an exact mathematical meaning, defining exactly how every C++ program behaves. But the document can’t run programs!

C++ programs are meant to run on hardware, and the hardware determines what behavior we see. Mapping abstract machine behavior to instructions on real hardware is the task of the C++ compiler (and the standard library and operating system). A C++ compiler is correct if and only if it translates each correct program to instructions that simulate the expected behavior of the abstract machine.

This same rough series of transformations happens for any programming language, although some languages use interpreters rather than compilers.

Objects

The C++ abstract machine concerns the construction and modification of objects. An object is a region of memory that contains a value, such as the integer 12. (Specifically, “a region of data storage in the execution environment, the contents of which can represent values”.) Consider:

char global_ch = 65;
const char const_global_ch = 66;

void f() {
    char local_ch = 67;
    char* allocated_ch = new char(68)
    // C-style: `allocated_ch = (char*) malloc(sizeof(char)); *allocated_ch = 68;`
}

There are five objects here:

Objects never overlap: the C abstract machine requires that each of these objects occupies distinct memory.

Each object has a lifetime, which is called storage duration by the standard. There are three different kinds of lifetime.

An object can have many names. For example, here, local and *ptr refer to the same object:

void f() {
    int local = 1;
    int* ptr = &local;
}

The different names for an object are sometimes called aliases.

What happens when an object is uninitialized? The answer depends on its lifetime.

Objects with dynamic lifetime aren’t easy to use correctly. Dynamic lifetime causes many serious problems in C programs, including memory leaks, use-after-free, double-free, and so forth (more on these Thursday). Those serious problems cause undefined behavior and play a “disastrously central role” in “our ongoing computer security nightmare”.

But dynamic lifetime is critically important. Only with dynamic lifetime can you construct an object whose size isn’t known at compile time, or construct an object that outlives its creating function.

Memory layout

How does a hardware machine implement the three kinds of lifetime? We can use a program to find out. (See cs61-lectures/datarep2/mexplore.cc.)

Hardware implements C objects using memory (so called because it remembers object values). At a high level, a memory is a modifiable array of M bytes, where a byte is a number between 0 and 255 inclusive. That means that, for any number a between 0 and M–1, we can:

The number a is called an address, and since every memory address corresponds to a byte, this is a byte-addressable memory.

On old machines, such as old Macintoshes (pre-OS X), C programs worked directly with this kind of memory. It was a disaster: an incorrect program could overwrite memory belonging to any other running program. Modern machines avoid this problem; we'll see how in unit 4.

The compiler and operating system work together to put objects at different addresses. A program’s address space (which is the range of addresses accessible to a program) divides into regions called segments. The most important ones are:

Data layout

Memory stores bytes, but the C abstract machine refers to values of many types, some of which don’t fit in a single byte. The compiler, hardware, and standard together define how objects map to bytes. Each object uses a contiguous range of addresses (and thus bytes).

Since C is designed to help software interface with hardware devices, the C standard is transparent about how objects are stored. A C program can ask how big an object is using the sizeof keyword. sizeof(T) returns the number of bytes in the representation of an object of type T, and sizeof(x) returns the size of object x. The result of sizeof is a value of type size_t, which is an unsigned integer type large enough to hold any representable size. On 64-bit architectures, such as x86-64 (our focus in this course), size_t can hold numbers between 0 and 264–1.

Unsigned integer representation

A bit is the fundamental unit of digital information: it’s either 0 or 1.

C++ manages memory in units of bytes—8 contiguous bits, that together can represent numbers between 0 and 255. C’s unit for a byte is char: the abstract machine says a byte is stored in char. That means an unsigned char holds values [0, 255] inclusive.

The C++ standard actually doesn’t require that a byte hold 8 bits, and on some crazy machines from decades ago, bytes could hold nine bits! (!?)

Other integer types can hold more values. On x86-64, an unsigned short can hold values [0, 65535] inclusive. An unsigned int can hold values [0, 4294967295] inclusive. And an unsigned long can hold values [0, 18446744073709551615] inclusive.

The abstract machine doesn’t specify how large integers are stored in memory—it’s the compiler and hardware’s job to make that choice. But modern computers make one of two basic choices, called little endian and big endian. Here’s how it works.

x86-64 uses little endian order. The Internet’s fundamental protocols, such as IP and TCP, use big endian order, which is therefore also called “network” byte order.

Signed integer representation

But how are signed integers represented in memory? What bit patterns should represent negative numbers?

The best representation for signed integers (and the choice made by x86-64) is two’s complement. Two’s complement representation is based on this principle: Addition and subtraction of signed integers shall use the same instructions as addition and subtraction of unsigned integers.

To see what this means, let’s think about what -x should mean when x is an unsigned integer. Wait, negative unsigned?! This isn’t an oxymoron because C++ uses modular arithmetic for unsigned integers: the result of an arithmetic operation on unsigned values is always taken modulo 2B, where B is the number of bits in the unsigned value type. Thus, on x86-64,

unsigned a = 0xFFFFFFFFU; // = 2^32 - 1
unsigned b = 0x00000002U;
assert(a + b == 0x00000001U); // True because 2^32 - 1 + 2 = 1 (mod 2^32)!

-x is simply the number that, when added to x, yields 0 (mod 2B). For example, when unsigned x = 0xFFFFFFFFU, then -x == 1U, since x + -x equals zero (mod 232).

To obtain -x, we flip all the bits in x (an operation written ~x) and then add 1. To see why, consider the bit representations. What is x + (~x + 1)? Well, (~x)i is 1 whenever xi is 0, and vice versa. That means that every bit of x + ~x is 1 (there are no carries), and x + ~x is the largest unsigned integer, with value 2B-1. If we add 1 to this, we get 2B. Which is 0 (mod 2B)! The highest “carry” bit is dropped, leaving zero.

Two’s complement arithmetic uses half of the unsigned integer representations for negative numbers. A two’s-complement signed integer with B bits has the following values:

The most significant bit is also called the “sign bit.”

Another way to think about two’s-complement is that, for B-bit integers, the most-significant bit has place value 2B–1 in unsigned arithmetic and negative 2B–1 in signed arithmetic. All other bits have the same place values in both kinds of arithmetic.

The two’s-complement bit pattern for x + y is the same whether x and y are considered as signed or unsigned values. For example, in 4-bit arithmetic, 5 has representation 0b0101, while the representation 0b1100 represents 12 if unsigned and –4 if signed (~0b1100 + 1 = 0b0011 + 1 == 4). Let’s add those bit patterns and see what we get:

  0b0101
+ 0b1100
--------
 0b10001 == 0b0001 (mod 2^4)

Note that this is the right answer for both signed and unsigned arithmetic: 5 + 12 = 17 = 1 (mod 16), and 5 + -4 = 1.

Subtraction and multiplication also produce the same results for unsigned arithmetic and signed two’s-complement arithmetic. (For instance, 5 * 12 = 60 = 12 (mod 16), and 5 * -4 = -20 = -4 (mod 16).) This is not true of division. (Consider dividing the 4-bit representation 0b1110 by 2. In signed arithmetic, 0b1110 represents -2, so 0b1110/2 == 0b1111 (-1); but in unsigned arithmetic, 0b1110 is 14, so 0b1110/2 == 0b0111 (7).) And, of course, it is not true of comparison. In signed 4-bit arithmetic, 0b1110 < 0, but in unsigned 4-bit arithmetic, 0b1110 > 0. This means that a C compiler for a two’s-complement machine can use a single add instruction for either signed or unsigned numbers, but it must generate different instruction patterns for signed and unsigned division (or less-than, or greater-than).

There are a couple quirks with C signed arithmetic. First, in two’s complement, there are more negative numbers than positive numbers. A representation with sign bit is 1, but every other bit 0, has no positive counterpart at the same bit width: for this number, -x == x. (In 4-bit arithmetic, -0b1000 == ~0b1000 + 1 == 0b0111 + 1 == 0b1000.) Second, and far worse: arithmetic overflow on signed integers is undefined behavior.