Abstract machine and hardware
Programming involves turning an idea into hardware instructions. This transformation happens in multiple steps, some you control and some controlled by other programs!
First you have an idea, like “I want to make a bouncy ball iPhone game.” The computer can’t (yet) understand that idea. So you transform the idea into a program, written in some programming language. This process is called programming.
A C++ program actually runs on an abstract machine. The behavior of this machine is defined by the C++ standard, a technical document. This document is supposed to be so precisely written as to have an exact mathematical meaning, defining exactly how every C++ program behaves. But the document can’t run programs!
C++ programs are meant to run on hardware, and the hardware determines what behavior we see. Mapping abstract machine behavior to instructions on real hardware is the task of the C++ compiler (and the standard library and operating system). A C++ compiler is correct if and only if it translates each correct program to instructions that simulate the expected behavior of the abstract machine.
This same rough series of transformations happens for any programming language, although some languages use interpreters rather than compilers.
Objects
The C++ abstract machine concerns the construction and modification of objects. An object is a region of memory that contains a value, such as the integer 12. (Specifically, “a region of data storage in the execution environment, the contents of which can represent values”.) Consider:
char global_ch = 65;
const char const_global_ch = 66;
void f() {
char local_ch = 67;
char* allocated_ch = new char(68)
// C-style: `allocated_ch = (char*) malloc(sizeof(char)); *allocated_ch = 68;`
}
There are five objects here:
global_ch
const_global_ch
local_ch
allocated_ch
- the anonymous memory allocated by
new char
and accessed by*allocated_ch
Objects never overlap: the C abstract machine requires that each of these objects occupies distinct memory.
Each object has a lifetime, which is called storage duration by the standard. There are three different kinds of lifetime.
- static lifetime: The object lasts as long as the program runs.
(Example:
global_ch
,const_global_ch
) - automatic lifetime: The compiler allocates and destroys the
object automatically. (
local_ch
,allocated_ch
) - dynamic lifetime: The programmer allocates and destroys the
object explicitly. (
*allocated_ch
)
An object can have many names. For example, here, local
and
*ptr
refer to the same object:
void f() {
int local = 1;
int* ptr = &local;
}
The different names for an object are sometimes called aliases.
What happens when an object is uninitialized? The answer depends on its lifetime.
- static lifetime (e.g.,
int global;
at file scope): The object is initialized to 0. - automatic or dynamic lifetime (e.g.,
int local;
in a function, orint* ptr = new int
): The object is uninitialized and reading the object’s value before it is assigned causes undefined behavior.
Objects with dynamic lifetime aren’t easy to use correctly. Dynamic lifetime causes many serious problems in C programs, including memory leaks, use-after-free, double-free, and so forth (more on these Thursday). Those serious problems cause undefined behavior and play a “disastrously central role” in “our ongoing computer security nightmare”.
But dynamic lifetime is critically important. Only with dynamic lifetime can you construct an object whose size isn’t known at compile time, or construct an object that outlives its creating function.
Memory layout
How does a hardware machine implement the three kinds of lifetime? We can use a program to find out. (See cs61-lectures/datarep2/mexplore.cc.)
Hardware implements C objects using memory (so called because it remembers object values). At a high level, a memory is a modifiable array of M bytes, where a byte is a number between 0 and 255 inclusive. That means that, for any number a between 0 and M–1, we can:
- Write a byte at address a.
- Read the byte at address a (obtaining the most-recently-written value).
The number a is called an address, and since every memory address corresponds to a byte, this is a byte-addressable memory.
On old machines, such as old Macintoshes (pre-OS X), C programs worked directly with this kind of memory. It was a disaster: an incorrect program could overwrite memory belonging to any other running program. Modern machines avoid this problem; we'll see how in unit 4.
The compiler and operating system work together to put objects at different addresses. A program’s address space (which is the range of addresses accessible to a program) divides into regions called segments. The most important ones are:
- Code (aka text). Contains instructions and constant static objects; unmodifiable; static lifetime.
- Data. Modifiable; static lifetime.
- Heap. Modifiable; dynamic lifetime.
- Stack. Modifiable; automatic lifetime.
Data layout
Memory stores bytes, but the C abstract machine refers to values of many types, some of which don’t fit in a single byte. The compiler, hardware, and standard together define how objects map to bytes. Each object uses a contiguous range of addresses (and thus bytes).
Since C is designed to help software interface with hardware devices,
the C standard is transparent about how objects are stored. A C
program can ask how big an object is using the sizeof
keyword.
sizeof(T)
returns the number of bytes in the representation of an
object of type T
, and sizeof(x)
returns the size of object
x
. The result of sizeof
is a value of type size_t
, which is
an unsigned integer type large enough to hold any representable size.
On 64-bit architectures, such as x86-64 (our focus in this course),
size_t
can hold numbers between 0 and 264–1.
Unsigned integer representation
A bit is the fundamental unit of digital information: it’s either 0 or 1.
C++ manages memory in units of bytes—8 contiguous bits, that together can
represent numbers between 0 and 255. C’s unit for a byte is char
: the
abstract machine says a byte is stored in char
. That means an unsigned
char
holds values [0, 255] inclusive.
The C++ standard actually doesn’t require that a byte hold 8 bits, and on some crazy machines from decades ago, bytes could hold nine bits! (!?)
Other integer types can hold more values. On x86-64, an unsigned
short
can hold values [0, 65535] inclusive. An unsigned int
can
hold values [0, 4294967295] inclusive. And an unsigned long
can
hold values [0, 18446744073709551615] inclusive.
The abstract machine doesn’t specify how large integers are stored in memory—it’s the compiler and hardware’s job to make that choice. But modern computers make one of two basic choices, called little endian and big endian. Here’s how it works.
- Write the large integer in hexadecimal format, including all leading
zeros. For example, the
unsigned int
value 65534 would be written0x0000FFFE
. There will be twice as many hexadecimal digits assizeof(TYPE)
. - Both little-endian and big-endian mode divide the integer into its component bytes, which are its digits in base 256. In our example, they are, from most to least significant, 0x00, 0x00, 0xFF, and 0xFE.
In little endian representation, the bytes are stored in memory from least to most significant. If our example was stored at address 0x30, we would have:
0x30: 0xFE 0x31: 0xFF 0x32: 0x00 0x33: 0x00
In big endian representation, the bytes are stored in memory from most to least significant, which is the reverse order.
0x30: 0x00 0x31: 0x00 0x32: 0xFF 0x33: 0xFE
x86-64 uses little endian order. The Internet’s fundamental protocols, such as IP and TCP, use big endian order, which is therefore also called “network” byte order.
Signed integer representation
But how are signed integers represented in memory? What bit patterns should represent negative numbers?
The best representation for signed integers (and the choice made by x86-64) is two’s complement. Two’s complement representation is based on this principle: Addition and subtraction of signed integers shall use the same instructions as addition and subtraction of unsigned integers.
To see what this means, let’s think about what -x
should mean when
x
is an unsigned integer. Wait, negative unsigned?! This isn’t an
oxymoron because C++ uses modular arithmetic for unsigned integers: the result
of an arithmetic operation on unsigned values is always taken modulo
2B, where B is the number of bits in the unsigned value
type. Thus, on
x86-64,
unsigned a = 0xFFFFFFFFU; // = 2^32 - 1
unsigned b = 0x00000002U;
assert(a + b == 0x00000001U); // True because 2^32 - 1 + 2 = 1 (mod 2^32)!
-x
is simply the number that, when added to x
, yields 0 (mod 2B). For
example, when unsigned x = 0xFFFFFFFFU
, then -x == 1U
, since x + -x
equals zero (mod 232).
To obtain -x
, we flip all the bits in x
(an operation written ~x
) and
then add 1. To see why, consider the bit representations. What is x + (~x +
1)
? Well, (~x)
i
is 1 whenever x
i
is 0, and vice
versa. That means that every bit of x + ~x
is 1 (there are no carries),
and x + ~x
is the largest unsigned integer, with value 2B-1. If
we add 1 to this, we get 2B. Which is 0 (mod 2B)!
The highest “carry” bit is dropped, leaving zero.
Two’s complement arithmetic uses half of the unsigned integer representations for negative numbers. A two’s-complement signed integer with B bits has the following values:
- If the most-significant bit is 1, the represented number is
negative. Specifically, the represented number is –
(~x + 1)
, where the outer negative sign is mathematical negation (not computer arithmetic). - If every bit is 0, the represented number is 0.
- If the most-significant but is 0 but some other bit is 1, the represented number is positive.
The most significant bit is also called the “sign bit.”
Another way to think about two’s-complement is that, for B-bit integers, the most-significant bit has place value 2B–1 in unsigned arithmetic and negative 2B–1 in signed arithmetic. All other bits have the same place values in both kinds of arithmetic.
The two’s-complement bit pattern for x + y
is the same whether x
and y
are considered as signed or unsigned values. For example, in
4-bit arithmetic, 5 has representation 0b0101
, while the
representation 0b1100
represents 12 if unsigned and –4 if signed
(~0b1100 + 1
= 0b0011 + 1
== 4). Let’s add those bit patterns
and see what we get:
0b0101
+ 0b1100
--------
0b10001 == 0b0001 (mod 2^4)
Note that this is the right answer for both signed and unsigned arithmetic: 5 + 12 = 17 = 1 (mod 16), and 5 + -4 = 1.
Subtraction and multiplication also produce the same results for
unsigned arithmetic and signed two’s-complement arithmetic. (For
instance, 5 * 12 = 60 = 12 (mod 16), and 5 * -4 = -20 = -4 (mod 16).)
This is not true of division. (Consider dividing the 4-bit
representation 0b1110
by 2. In signed arithmetic, 0b1110
represents -2, so 0b1110/2 == 0b1111
(-1); but in unsigned
arithmetic, 0b1110
is 14, so 0b1110/2 == 0b0111
(7).) And, of
course, it is not true of comparison. In signed 4-bit arithmetic,
0b1110 < 0
, but in unsigned 4-bit arithmetic, 0b1110 > 0
. This
means that a C compiler for a two’s-complement machine can use a single
add
instruction for either signed or unsigned numbers, but it must
generate different instruction patterns for signed and unsigned division
(or less-than, or greater-than).
There are a couple quirks with C signed arithmetic. First, in two’s
complement, there are more negative numbers than positive numbers. A
representation with sign bit is 1, but every other bit 0, has no
positive counterpart at the same bit width: for this number, -x ==
x
. (In 4-bit arithmetic, -0b1000 == ~0b1000 + 1 == 0b0111 + 1 ==
0b1000
.) Second, and far worse: arithmetic overflow on signed
integers is undefined behavior.