## Abstract machine and hardware

Programming involves turning an idea into hardware instructions. This transformation happens in multiple steps, some you control and some controlled by other programs!

First you have an *idea*, like “I want to make a bouncy ball iPhone game.” The
computer can’t (yet) understand that idea. So you transform the idea into a
*program*, written in some *programming language*. This process is called
*programming.*

A C++ program actually runs on an *abstract machine*. The behavior of this
machine is defined by the C++
standard,
a technical document. This document is supposed to be so precisely written
as to have an exact mathematical meaning, defining exactly how every C++
program behaves. But the document can’t run programs!

C++ programs are meant to run on *hardware*, and the hardware determines what
behavior we see. Mapping abstract machine behavior to instructions on real
hardware is the task of the C++ **compiler** (and the standard library and
operating system). A C++ compiler is correct if and only if it translates each
correct program to instructions that simulate the expected behavior of the
abstract machine.

This same rough series of transformations happens for any programming
language, although some languages use *interpreters* rather than compilers.

## Objects

The C++ abstract machine concerns the construction and modification of
*objects*. An object is a region of memory that contains a value, such
as the integer 12. (Specifically, “a region of data storage in the
execution environment, the contents of which can represent values”.)
Consider:

```
char global_ch = 65;
const char const_global_ch = 66;
void f() {
char local_ch = 67;
char* allocated_ch = new char(68)
// C-style: `allocated_ch = (char*) malloc(sizeof(char)); *allocated_ch = 68;`
}
```

There are five objects here:

`global_ch`

`const_global_ch`

`local_ch`

`allocated_ch`

- the anonymous memory allocated by
`new char`

and accessed by`*allocated_ch`

Objects never overlap: the C abstract machine requires that each of these objects occupies distinct memory.

Each object has a *lifetime*, which is called *storage duration* by the
standard. There are three different kinds of lifetime.

**static**lifetime: The object lasts as long as the program runs. (Example:`global_ch`

,`const_global_ch`

)**automatic**lifetime: The compiler allocates and destroys the object automatically. (`local_ch`

,`allocated_ch`

)**dynamic**lifetime: The programmer allocates and destroys the object explicitly. (`*allocated_ch`

)

An object can have many names. For example, here, `local`

and
`*ptr`

refer to the same object:

```
void f() {
int local = 1;
int* ptr = &local;
}
```

The different names for an object are sometimes called *aliases*.

What happens when an object is uninitialized? The answer depends on its lifetime.

- static lifetime (e.g.,
`int global;`

at file scope): The object is initialized to 0. - automatic or dynamic lifetime (e.g.,
`int local;`

in a function, or`int* ptr = new int`

): The object is*uninitialized*and reading the object’s value before it is assigned causes undefined behavior.

Objects with dynamic lifetime aren’t easy to use correctly. Dynamic lifetime causes many serious problems in C programs, including memory leaks, use-after-free, double-free, and so forth (more on these Thursday). Those serious problems cause undefined behavior and play a “disastrously central role” in “our ongoing computer security nightmare”.

But dynamic lifetime is critically important. Only with dynamic lifetime can you construct an object whose size isn’t known at compile time, or construct an object that outlives its creating function.

## Memory layout

How does a hardware machine implement the three kinds of lifetime? We can use a program to find out. (See cs61-lectures/datarep2/mexplore.cc.)

Hardware implements C objects using *memory* (so called because it remembers
object values). At a high level, a memory is a modifiable array of *M*
**bytes**, where a byte is a number between 0 and 255 inclusive. That means
that, for any number *a* between 0 and *M*–1, we can:

- Write a byte at address
*a*. - Read the byte at address
*a*(obtaining the most-recently-written value).

The number *a* is called an **address**, and since every memory address
corresponds to a byte, this is a *byte-addressable memory.*

On old machines, such as old Macintoshes (pre-OS X), C programs worked directly with this kind of memory. It was a disaster: an incorrect program could overwrite memory belonging to any other running program. Modern machines avoid this problem; we'll see how in unit 4.

The compiler and operating system work together to put objects at
different addresses. A program’s *address space* (which is the range of
addresses accessible to a program) divides into regions called
**segments**. The most important ones are:

- Code (aka text). Contains instructions and constant static objects; unmodifiable; static lifetime.
- Data. Modifiable; static lifetime.
- Heap. Modifiable; dynamic lifetime.
- Stack. Modifiable; automatic lifetime.

## Data layout

Memory stores bytes, but the C abstract machine refers to values of many types, some of which don’t fit in a single byte. The compiler, hardware, and standard together define how objects map to bytes. Each object uses a contiguous range of addresses (and thus bytes).

Since C is designed to help software interface with hardware devices,
the C standard is transparent about how objects are stored. A C
program can ask how big an object is using the `sizeof`

keyword.
`sizeof(T)`

returns the number of bytes in the representation of an
object of type `T`

, and `sizeof(x)`

returns the size of object
`x`

. The result of `sizeof`

is a value of type `size_t`

, which is
an unsigned integer type large enough to hold any representable size.
On 64-bit architectures, such as x86-64 (our focus in this course),
`size_t`

can hold numbers between 0 and 2^{64}–1.

## Unsigned integer representation

A bit is the fundamental unit of digital information: it’s either 0 or 1.

C++ manages memory in units of bytes—8 contiguous bits, that together can
represent numbers between 0 and 255. C’s unit for a byte is `char`

: the
abstract machine says a byte is stored in `char`

. That means an ```
unsigned
char
```

holds values [0, 255] inclusive.

The C++ standard actually doesn’t

requirethat a byte hold 8 bits, and on some crazy machines from decades ago, bytes could holdninebits! (!?)

Other integer types can hold more values. On x86-64, an ```
unsigned
short
```

can hold values [0, 65535] inclusive. An `unsigned int`

can
hold values [0, 4294967295] inclusive. And an `unsigned long`

can
hold values [0, 18446744073709551615] inclusive.

The abstract machine doesn’t specify how large integers are stored in
memory—it’s the compiler and hardware’s job to make that choice. But
modern computers make one of two basic choices, called **little endian**
and **big endian**. Here’s how it works.

- Write the large integer in hexadecimal format, including all leading
zeros. For example, the
`unsigned int`

value 65534 would be written`0x0000FFFE`

. There will be twice as many hexadecimal digits as`sizeof(TYPE)`

. - Both little-endian and big-endian mode divide the integer into its component bytes, which are its digits in base 256. In our example, they are, from most to least significant, 0x00, 0x00, 0xFF, and 0xFE.
In

*little endian*representation, the bytes are stored in memory from*least to most*significant. If our example was stored at address 0x30, we would have:`0x30: 0xFE 0x31: 0xFF 0x32: 0x00 0x33: 0x00`

In

*big endian*representation, the bytes are stored in memory from*most to least*significant, which is the reverse order.`0x30: 0x00 0x31: 0x00 0x32: 0xFF 0x33: 0xFE`

x86-64 uses little endian order. The Internet’s fundamental protocols, such as IP and TCP, use big endian order, which is therefore also called “network” byte order.

## Signed integer representation

But how are signed integers represented in memory? What bit patterns should represent negative numbers?

The best representation for signed integers (and the choice made by
x86-64) is **two’s complement**. Two’s complement representation is based
on this principle: *Addition and subtraction of signed integers shall
use the same instructions as addition and subtraction of unsigned
integers.*

To see what this means, let’s think about what `-x`

should mean when
`x`

is an unsigned integer. Wait, negative unsigned?! This isn’t an
oxymoron because C++ uses *modular arithmetic* for unsigned integers: the result
of an arithmetic operation on unsigned values is always taken modulo
2^{B}, where *B* is the number of bits in the unsigned value
type. Thus, on
x86-64,

```
unsigned a = 0xFFFFFFFFU; // = 2^32 - 1
unsigned b = 0x00000002U;
assert(a + b == 0x00000001U); // True because 2^32 - 1 + 2 = 1 (mod 2^32)!
```

`-x`

is simply the number that, when added to `x`

, yields 0 (mod 2^{B}). For
example, when `unsigned x = 0xFFFFFFFFU`

, then `-x == 1U`

, since `x + -x`

equals zero (mod 2^{32}).

To obtain `-x`

, we flip all the bits in `x`

(an operation written `~x`

) and
then add 1. To see why, consider the bit representations. What is ```
x + (~x +
1)
```

? Well, `(~x)`

_{i} is 1 whenever `x`

_{i} is 0, and vice
versa. That means that *every bit* of `x + ~x`

is 1 (there are no carries),
and `x + ~x`

is the largest unsigned integer, with value 2^{B}-1. If
we add 1 to this, we get 2^{B}. Which is 0 (mod 2^{B})!
The highest “carry” bit is dropped, leaving zero.

Two’s complement arithmetic uses half of the unsigned integer
representations for negative numbers. A two’s-complement signed integer
with *B* bits has the following values:

- If the most-significant bit is 1, the represented number is
negative. Specifically, the represented number is –
`(~x + 1)`

, where the outer negative sign is mathematical negation (not computer arithmetic). - If every bit is 0, the represented number is 0.
- If the most-significant but is 0 but some other bit is 1, the represented number is positive.

The most significant bit is also called the “sign bit.”

Another way to think about two’s-complement is that, for *B*-bit
integers, the most-significant bit has place value 2^{B–1} in
unsigned arithmetic and **negative** 2^{B–1} in signed
arithmetic. All other bits have the same place values in both kinds of
arithmetic.

The two’s-complement bit pattern for `x + y`

is the same whether `x`

and `y`

are considered as signed or unsigned values. For example, in
4-bit arithmetic, 5 has representation `0b0101`

, while the
representation `0b1100`

represents 12 if unsigned and –4 if signed
(`~0b1100 + 1`

= `0b0011 + 1`

== 4). Let’s add those bit patterns
and see what we get:

```
0b0101
+ 0b1100
--------
0b10001 == 0b0001 (mod 2^4)
```

Note that this is the right answer for *both signed and unsigned
arithmetic*: 5 + 12 = 17 = 1 (mod 16), and 5 + -4 = 1.

Subtraction and multiplication also produce the same results for
unsigned arithmetic and signed two’s-complement arithmetic. (For
instance, 5 * 12 = 60 = 12 (mod 16), and 5 * -4 = -20 = -4 (mod 16).)
This is *not* true of division. (Consider dividing the 4-bit
representation `0b1110`

by 2. In signed arithmetic, `0b1110`

represents -2, so `0b1110/2 == 0b1111`

(-1); but in unsigned
arithmetic, `0b1110`

is 14, so `0b1110/2 == 0b0111`

(7).) And, of
course, it is *not* true of comparison. In signed 4-bit arithmetic,
`0b1110 < 0`

, but in unsigned 4-bit arithmetic, `0b1110 > 0`

. This
means that a C compiler for a two’s-complement machine can use a single
`add`

instruction for either signed or unsigned numbers, but it must
generate different instruction patterns for signed and unsigned division
(or less-than, or greater-than).

There are a couple quirks with C signed arithmetic. First, in two’s
complement, *there are more negative numbers than positive numbers.* A
representation with sign bit is 1, but every other bit 0, has no
positive counterpart at the same bit width: for this number, ```
-x ==
x
```

. (In 4-bit arithmetic, ```
-0b1000 == ~0b1000 + 1 == 0b0111 + 1 ==
0b1000
```

.) Second, and far worse: *arithmetic overflow on signed
integers is undefined behavior.*