Data representation 2: Object representation

Abstract machine and hardware

Programming involves turning an idea into hardware instructions. This transformation happens in multiple steps, some you control and some controlled by other programs!

First you have an idea, like “I want to make a bouncy ball iPhone game.” The computer can’t (yet) understand that idea. So you transform the idea into a program, written in some programming language. This process is called programming.

A C++ program actually runs on an abstract machine. The behavior of this machine is defined by the C++ standard, a technical document. This document is supposed to be so precisely written as to have an exact mathematical meaning, defining exactly how every C++ program behaves. But the document can’t run programs!

C++ programs are meant to run on hardware, and the hardware determines what behavior we see. Mapping abstract machine behavior to instructions on real hardware is the task of the C++ compiler (and the standard library and operating system). A C++ compiler is correct if and only if it translates each correct program to instructions that simulate the expected behavior of the abstract machine.

This same rough series of transformations happens for any programming language, although some languages use interpreters rather than compilers.

Objects

The C++ abstract machine concerns the construction and modification of objects. An object is a region of memory that contains a value, such as the integer 12. (Specifically, “a region of data storage in the execution environment, the contents of which can represent values”.) Consider:

char global_ch = 65;
const char const_global_ch = 66;

void f() {
    char local_ch = 67;
    char* allocated_ch = new char(68)
    // C-style: `allocated_ch = (char*) malloc(sizeof(char)); *allocated_ch = 68;`
}

There are five objects here:

global_ch
const_global_ch
local_ch
allocated_ch
the anonymous memory allocated by new char and accessed by *allocated_ch

Objects never overlap: the C abstract machine requires that each of these objects occupies distinct memory.

Each object has a lifetime, which is called storage duration by the standard. There are three different kinds of lifetime.

static lifetime: The object lasts as long as the program runs. (Example: global_ch, const_global_ch)
automatic lifetime: The compiler allocates and destroys the object automatically. (local_ch, allocated_ch)
dynamic lifetime: The programmer allocates and destroys the object explicitly. (*allocated_ch)

An object can have many names. For example, here, local and *ptr refer to the same object:

void f() {
    int local = 1;
    int* ptr = &local;
}

The different names for an object are sometimes called aliases.

What happens when an object is uninitialized? The answer depends on its lifetime.

static lifetime (e.g., int global; at file scope): The object is initialized to 0.
automatic or dynamic lifetime (e.g., int local; in a function, or int* ptr = new int): The object is uninitialized and reading the object’s value before it is assigned causes undefined behavior.

Objects with dynamic lifetime aren’t easy to use correctly. Dynamic lifetime causes many serious problems in C programs, including memory leaks, use-after-free, double-free, and so forth (more on these Thursday). Those serious problems cause undefined behavior and play a “disastrously central role” in “our ongoing computer security nightmare”.

But dynamic lifetime is critically important. Only with dynamic lifetime can you construct an object whose size isn’t known at compile time, or construct an object that outlives its creating function.

Memory layout

How does a hardware machine implement the three kinds of lifetime? We can use a program to find out. (See cs61-lectures/datarep2/mexplore.cc.)

Hardware implements C objects using memory (so called because it remembers object values). At a high level, a memory is a modifiable array of M bytes, where a byte is a number between 0 and 255 inclusive. That means that, for any number a between 0 and M–1, we can:

Write a byte at address a.
Read the byte at address a (obtaining the most-recently-written value).

The number a is called an address, and since every memory address corresponds to a byte, this is a byte-addressable memory.

On old machines, such as old Macintoshes (pre-OS X), C programs worked directly with this kind of memory. It was a disaster: an incorrect program could overwrite memory belonging to any other running program. Modern machines avoid this problem; we'll see how in unit 4.

The compiler and operating system work together to put objects at different addresses. A program’s address space (which is the range of addresses accessible to a program) divides into regions called segments. The most important ones are:

Code (aka text). Contains instructions and constant static objects; unmodifiable; static lifetime.
Data. Modifiable; static lifetime.
Heap. Modifiable; dynamic lifetime.
Stack. Modifiable; automatic lifetime.

Data layout

Memory stores bytes, but the C abstract machine refers to values of many types, some of which don’t fit in a single byte. The compiler, hardware, and standard together define how objects map to bytes. Each object uses a contiguous range of addresses (and thus bytes).

Since C is designed to help software interface with hardware devices, the C standard is transparent about how objects are stored. A C program can ask how big an object is using the sizeof keyword. sizeof(T) returns the number of bytes in the representation of an object of type T, and sizeof(x) returns the size of object x. The result of sizeof is a value of type size_t, which is an unsigned integer type large enough to hold any representable size. On 64-bit architectures, such as x86-64 (our focus in this course), size_t can hold numbers between 0 and 2⁶⁴–1.

Unsigned integer representation

A bit is the fundamental unit of digital information: it’s either 0 or 1.

C++ manages memory in units of bytes—8 contiguous bits, that together can represent numbers between 0 and 255. C’s unit for a byte is char: the abstract machine says a byte is stored in char. That means an unsigned char holds values [0, 255] inclusive.

The C++ standard actually doesn’t require that a byte hold 8 bits, and on some crazy machines from decades ago, bytes could hold nine bits! (!?)

Other integer types can hold more values. On x86-64, an unsigned short can hold values [0, 65535] inclusive. An unsigned int can hold values [0, 4294967295] inclusive. And an unsigned long can hold values [0, 18446744073709551615] inclusive.

The abstract machine doesn’t specify how large integers are stored in memory—it’s the compiler and hardware’s job to make that choice. But modern computers make one of two basic choices, called little endian and big endian. Here’s how it works.

Write the large integer in hexadecimal format, including all leading zeros. For example, the unsigned int value 65534 would be written 0x0000FFFE. There will be twice as many hexadecimal digits as sizeof(TYPE).
Both little-endian and big-endian mode divide the integer into its component bytes, which are its digits in base 256. In our example, they are, from most to least significant, 0x00, 0x00, 0xFF, and 0xFE.
In little endian representation, the bytes are stored in memory from least to most significant. If our example was stored at address 0x30, we would have:
```
0x30: 0xFE    0x31: 0xFF    0x32: 0x00    0x33: 0x00
```
In big endian representation, the bytes are stored in memory from most to least significant, which is the reverse order.
```
0x30: 0x00    0x31: 0x00    0x32: 0xFF    0x33: 0xFE
```

x86-64 uses little endian order. The Internet’s fundamental protocols, such as IP and TCP, use big endian order, which is therefore also called “network” byte order.

Signed integer representation

But how are signed integers represented in memory? What bit patterns should represent negative numbers?

The best representation for signed integers (and the choice made by x86-64) is two’s complement. Two’s complement representation is based on this principle: Addition and subtraction of signed integers shall use the same instructions as addition and subtraction of unsigned integers.

To see what this means, let’s think about what -x should mean when x is an unsigned integer. Wait, negative unsigned?! This isn’t an oxymoron because C++ uses modular arithmetic for unsigned integers: the result of an arithmetic operation on unsigned values is always taken modulo 2^B, where B is the number of bits in the unsigned value type. Thus, on x86-64,

unsigned a = 0xFFFFFFFFU; // = 2^32 - 1
unsigned b = 0x00000002U;
assert(a + b == 0x00000001U); // True because 2^32 - 1 + 2 = 1 (mod 2^32)!

-x is simply the number that, when added to x, yields 0 (mod 2^B). For example, when unsigned x = 0xFFFFFFFFU, then -x == 1U, since x + -x equals zero (mod 2³²).

To obtain -x, we flip all the bits in x (an operation written ~x) and then add 1. To see why, consider the bit representations. What is x + (~x + 1)? Well, (~x)_i is 1 whenever x_i is 0, and vice versa. That means that every bit of x + ~x is 1 (there are no carries), and x + ~x is the largest unsigned integer, with value 2^B-1. If we add 1 to this, we get 2^B. Which is 0 (mod 2^B)! The highest “carry” bit is dropped, leaving zero.

Two’s complement arithmetic uses half of the unsigned integer representations for negative numbers. A two’s-complement signed integer with B bits has the following values:

If the most-significant bit is 1, the represented number is negative. Specifically, the represented number is –(~x + 1), where the outer negative sign is mathematical negation (not computer arithmetic).
If every bit is 0, the represented number is 0.
If the most-significant but is 0 but some other bit is 1, the represented number is positive.

The most significant bit is also called the “sign bit.”

Another way to think about two’s-complement is that, for B-bit integers, the most-significant bit has place value 2^B–1 in unsigned arithmetic and negative 2^B–1 in signed arithmetic. All other bits have the same place values in both kinds of arithmetic.

The two’s-complement bit pattern for x + y is the same whether x and y are considered as signed or unsigned values. For example, in 4-bit arithmetic, 5 has representation 0b0101, while the representation 0b1100 represents 12 if unsigned and –4 if signed (~0b1100 + 1 = 0b0011 + 1 == 4). Let’s add those bit patterns and see what we get:

  0b0101
+ 0b1100
--------
 0b10001 == 0b0001 (mod 2^4)

Note that this is the right answer for both signed and unsigned arithmetic: 5 + 12 = 17 = 1 (mod 16), and 5 + -4 = 1.

Subtraction and multiplication also produce the same results for unsigned arithmetic and signed two’s-complement arithmetic. (For instance, 5 * 12 = 60 = 12 (mod 16), and 5 * -4 = -20 = -4 (mod 16).) This is not true of division. (Consider dividing the 4-bit representation 0b1110 by 2. In signed arithmetic, 0b1110 represents -2, so 0b1110/2 == 0b1111 (-1); but in unsigned arithmetic, 0b1110 is 14, so 0b1110/2 == 0b0111 (7).) And, of course, it is not true of comparison. In signed 4-bit arithmetic, 0b1110 < 0, but in unsigned 4-bit arithmetic, 0b1110 > 0. This means that a C compiler for a two’s-complement machine can use a single add instruction for either signed or unsigned numbers, but it must generate different instruction patterns for signed and unsigned division (or less-than, or greater-than).

There are a couple quirks with C signed arithmetic. First, in two’s complement, there are more negative numbers than positive numbers. A representation with sign bit is 1, but every other bit 0, has no positive counterpart at the same bit width: for this number, -x == x. (In 4-bit arithmetic, -0b1000 == ~0b1000 + 1 == 0b0111 + 1 == 0b1000.) Second, and far worse: arithmetic overflow on signed integers is undefined behavior.