Data representation 1: Introduction

This course is about learning how computers work, from the perspective of systems software: what makes programs work fast or slow, and how properties of the machines we program impact the programs we write. We want to communicate ideas, tools, and an experimental approach.

The course divides into six units:

Data representation
Assembly & machine programming
Storage & caching
Kernel programming
Process management
Concurrency

The first unit, data representation, begins today. It’s all about how different forms of data can be represented in terms the computer can understand. But this is a teaser lecture too, so we’ll touch on a bunch of concepts from later units.

Lite Brite memory

Lite Brite is a pretty fun old-school toy. It’s a big black backlit pegboard coupled with a supply of colored pegs, in a limited set of colors. You can plug in the pegs to make all kinds of designs.

Computer memory is a lot like a Lite Brite. A computer’s memory is like a vast pegboard where each slot holds one of 256 different colors. The colors are numbered 0 through 255, so each slot holds one byte. (A byte is a number between 0 and 255, inclusive.)

A slot of computer memory is identified by its address. On a computer with M bytes of memory, and therefore M slots, you can think of the address as a number between 0 and M−1. My laptop has 16 gibibytes of memory, so M = 16×2³⁰ = 2³⁴ = 17,179,869,184 = 0x4'0000'0000—a very large number!

       0     1     2     3     4                         2^34 - 1    <- addresses
    +-----+-----+-----+-----+-----+--     --+-----+-----+-----+
    |     |     |     |     |     |   ...   |     |     |     |      <- values
    +-----+-----+-----+-----+-----+--     --+-----+-----+-----+

The problem of data representation is the problem of representing all the concepts we might want to use in programming—integers, fractions, real numbers, sets, pictures, texts, buildings, animal species, relationships—using the limited medium of addresses and bytes.

Powers of ten and powers of two. Digital computers love the number two and all powers of two. The electronics of digital computers are based on the bit, the smallest unit of storage, which a base-two digit: either 0 or 1. More complicated objects are represented by collections of bits. This choice has many scale and error-correction advantages. It also refracts upwards to larger choices, and even into terminology. Memory chips, for example, have capacities based on large powers of two, such as 2³⁰ bytes. Since 2¹⁰ = 1024 is pretty close to 1,000, 2²⁰ = 1,048,576 is pretty close to a million, and 2³⁰ = 1,073,741,824 is pretty close to a billion, it’s common to refer to 2³⁰ bytes of memory as “a gigabyte,” even though that actually means 10⁹ = 1,000,000,000 bytes. But when trying to be more precise, it’s better to use terms that explicitly signal the use of powers of two, such as gibibyte: the “-bi-” component means “binary.”

Virtual memory. Modern computers actually abstract their memory spaces using a technique called virtual memory. The lowest-level kind of address, called a physical address, really does take on values between 0 and M−1. However, even on a 16GB machine like my laptop, the addresses we see in programs can take on values like 0x7ffe'ea2c'aa67 that are much larger than M−1 = 0x3'ffff'ffff. The addresses used in programs are called virtual addresses. They’re incredibly useful for protection: since different running programs have logically independent address spaces, it’s much less likely that a bug in one program will crash the whole machine. We’ll learn about virtual memory in much more depth in Unit 4; the distinction between virtual and physical addresses is not as critical for data representation.

Representing integers

How would you store the following integers in a computer’s memory, starting at address 1024?

0, 1, and 2 are easy enough:

0
===

       0     1     2              1024  1025  1026
    +-----+-----+-----+--     --+-----+-----+-----+--
    |     |     |     |   ...   |  0  |     |     |   ...
    +-----+-----+-----+--     --+-----+-----+-----+--


1
===

       0     1     2              1024  1025  1026
    +-----+-----+-----+--     --+-----+-----+-----+--
    |     |     |     |   ...   |  1  |     |     |   ...
    +-----+-----+-----+--     --+-----+-----+-----+--


2
===

       0     1     2              1024  1025  1026
    +-----+-----+-----+--     --+-----+-----+-----+--
    |     |     |     |   ...   |  2  |     |     |   ...
    +-----+-----+-----+--     --+-----+-----+-----+--

But 257 just doesn’t fit. A single byte takes on a value between 0 and 255: there’s no such color as 257 on this Lite Brite. To represent complex values in memory, we must use multiple bytes. But how?

A natural way involves place notation, just like we see in decimal numbers. In decimal, the number 257 is written using three digits, the meanings of which are determined both by the digit and by their place in the overall number:

\[ 257 = 2\times10^2 + 5\times10^1 + 7\times10^0 \]

The computer can use the same trick, but instead of base ten, we use base 256.

257
===

       0     1     2              1024  1025  1026
    +-----+-----+-----+--     --+-----+-----+-----+--
    |     |     |     |   ...   |  1  |  1  |     |   ...
    +-----+-----+-----+--     --+-----+-----+-----+--

\[ 257 = 1\times256^1 + 1\times256^0 \]

Two adjacent bytes can represent numbers between 0 and \(255\times256+255 = 65535 = 2^{16}-1\), inclusive. A number larger than this would take three or more bytes.

In practice computers are often fastest at dealing with fixed-length numbers, rather than variable-length numbers, and processor internals are organized around a fixed word size. A word is the natural unit of data used by a processor design. In most modern processors, this natural unit is 8 bytes or 64 bits, because this is the power-of-two number of bytes big enough to hold those processors’ memory addresses. Many older processors could access less memory and had correspondingly smaller word sizes, such as 4 bytes (32 bits). If we reserved eight bytes to hold 257, it would look like this:

257
===

       0     1     2              1024  1025  1026  1027  1028  1029  1030  1031
    +-----+-----+-----+--     --+-----+-----+-----+-----+-----+-----+-----+-----+--
    |     |     |     |   ...   |  1  |  1  |  0  |  0  |  0  |  0  |  0  |  0  |   ...
    +-----+-----+-----+--     --+-----+-----+-----+-----+-----+-----+-----+-----+--

Eight contiguous bytes used in place-value notation can represent any number between 0 and 18,446,744,073,709,551,615, inclusive.

Foreshadowing. We put the ones place, the least significant byte, on the left. This is the opposite of how decimal numbers are written; they put the most significant digit on the left. Some computers actually store multi-byte integers similarly, with the most significant byte stored in the smallest address! We’ll discuss this more soon.

Representing sequences of integers

OK, how should we represent a sequence of integers? One natural way is to allocate the integers by putting them next to each other—by allocating them in memory sequentially, in an array. Here, we put the integers 0, 1, and 257 next to each other, starting at address 1008:


    1008                            1016                            1024
  --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+--
    | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
  --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+--

Access order

Say that you have an array of N integers, and you access each of those integers in order, accessing each integer exactly once. Does the order matter?

Computer memory is random-access memory (RAM), which means that a program can access any bytes of memory in any order—it’s not, for example, required to read memory in ascending order by address. But if we run experiments, we can see that even in RAM, different access orders have very different performance characteristics.

Our testaccess program sums up all the integers in an array of N integers, using an access order based on its arguments, and prints the resulting delay. Here’s the result of a couple experiments on accessing 10,000,000 items in three orders, “up” order (sequential: elements 0, 1, 2, 3, …), “down” order (reverse sequential: N, N−1, N−2, …), and “random” order (as it sounds).

order	trial 1	trial 2	trial 3
`-u`, up	8.9ms	7.9ms	7.4ms
`-d`, down	9.2ms	8.9ms	10.6ms
`-r`, random	316.8ms	352.0ms	360.8ms

Wow! Down order is just a bit slower than up, but random order seems about 40 times slower. Why?

Random order is defeating many of the internal architectural optimizations that make memory access fast on modern machines. Sequential order, since it’s more predictable, is much easier to optimize.

Foreshadowing. This part of the lecture is a teaser for Unit 3, Storage, where we cover access patterns and caching, including the processor caches that explain this phenomenon, in much more depth.

Linked list representation

TBA

Linked list access order

TBA

Add

TBA