Ancient CS 61 Content Warning!!!!!1!!!
This is not the current version of the class.
This site was automatically translated from a wiki. The translation may have introduced mistakes (and the content might have been wrong to begin with).

Storage 2: Cache concepts

A cache is a small amount of fast storage used to optimize access to a larger amount of slower storage.

Caching is one of the most fundamental optimization strategies that exists for systems. Many, many system optimizations boil down to “add a cache” or “improve a cache” or “optimize your use of a cache.”

Here are some examples of concrete caches.

We use these terms to refer to the contents of a cache.

We say that a slot “contains” a block (or, equivalently, that block’s address) if the slot contains a version of the corresponding data. We say a slot is empty if it contains no block, and full if it contains some block. Furthermore we say a full slot is clean if the slot’s data has not been modified in the cache since the block was read, and dirty if it has been modified since it was read.

The special term cache line is used to refer to slots in a processor cache, but you will often hear it used (ambiguously) to refer to blocks. In common usage, “cache line” means “processor cache block or slot.” We will say “processor cache block” (or “slot”) when we need to be unambiguous.

We use these terms to refer to accesses to a cache.

The fraction of accesses that are hits is called the hit rate. High hit rates are good.

Cache memories are described in CS:APP3e §6.4 et al., but that presentation is specific to processor caches. Our presentation attempts to be more general.

Cache optimizations

Caches support many conceptually different optimizations. Three of the most basic are eliminating redundant writes, batching, and prefetching.

Eliminating redundant writes

In this strategy, the cache combines multiple writes to a block of storage. Rather than changing the underlying slower storage medium on each write to a cached block, the cache coalesces writes into a single access to slower storage (or fewer accesses to slower storage).

Consider the following C code:

extern int x;
for (x = 0; x < 10000; ++x) {
    ... [assume this doesn’t access x] ...
}

Without a processor cache, each write to x would cause an expensive write access to primary memory. The processor cache makes this much faster by eliminating these redundant writes; it can make this code behave something like the following:

extern int x;
int register_x;
for (register_x = 0; register_x < 10000; ++register_x) {
    ...
}
x = register_x;

Now all the redundant writes are coalesced into one.

The buffer cache performs a similar function for disk writes. Hard disk and flash technologies have minimum write sizes: disks are physically unable to write one byte at a time. (For instance, hard disks’ reads and writes occur in units called sectors, which are aligned units of 512 bytes each, though the operating system or file system may impose even greater minimum write requirements depending on configuration.) If a program writes one byte at a time, and requires that each write reach the disk (using, for instance, the O_DIRECT flag), the operating system must turn each one-byte write into at least a 512-byte write including all other relevant data in the corresponding sector. This can lead to a write amplification of 512x or more, since each one-byte write expands to at least 512 bytes sent to disk. When allowed to do its job, the buffer cache will absorb many one-byte writes before contacting the slower storage medium (the disk), eliminating redundant writes altogether.

Batching

Batching is an optimization for both writes and reads. It improves throughput, but not latency, by grouping many requests into one: for instance, combining many writes into a single write system call, or combining many reads into a single read system call. This requires that the underlying, slower medium support different sizes of request.

Batching works because often the overhead of processing a request has two components: a cost per request and a cost per unit of data (e.g., per byte). Batching can optimize a request type when that request type has high cost per request but low cost per unit of data. For instance, the read system call has high cost per request but—especially when the buffer cache is full—very low cost per byte.

Prefetching

Prefetching is an optimization for reads. It uses past accesses to predict future accesses. Prefetching fills the cache with data that has not been requested yet, but that the cache believes will be accessed soon. If the cache predicts accesses effectively, then most read requests will “hit” the cache and be serviced quickly.

Costs

Each of these strategies has costs as well as benefits.

Eliminating redundant writes can have a correctness cost (or consequence): data in the cache may be more up to date (contain later values) than the underlying slower storage. This matters for volatile caches running on top of stable storage, such as the buffer cache and stdio caches. Primary memory is volatile; if some of your writes are hanging out in volatile caches, they can be lost if the computer loses power at an inopportune time.

Prefetching has an opportunity cost. Prefetching data that isn’t actually needed later can be expensive on its own (it wastes time and energy), and it can crowd out more useful data from the cache.

Coherence

A cache is called coherent if the cache contents always contain the most up-to-date version of the underlying block. To be more precise, a coherent cache is never less up-to-date than the underlying slower storage (it might be more up-to-date, if the cache is eliminating redundant writes).

A cache is called incoherent if the cache contents might be less up-to-date than the underlying stable storage.

The processor cache and the buffer cache are coherent; stdio caches are not. You can kind of see your local git repositories as caches of a master repository hosted elsewhere: considered that way, your local git repositories are incoherent caches.

Eviction

Eviction is the process of dropping some cached blocks from the cache to make room for newly-required blocks. Caches have finite space by definition (“a small amount of faster storage…”).

A generic algorithm for loading data into a cache might look like this:

cache_slot_t accesscache(address_t addr) {
    assert(addr is valid);
    if (addr is already present in some cache slot, i) {
        return slot i;
    } else if (there exists an empty slot i) {
        load the block containing addr into slot i;
        return slot i;
    } else {
        i = evictionpolicy();
        if (slot i is dirty) {
            write contents of slot i to underlying storage at the corresponding address;
        }
        load the block containing addr into slot i;
        return slot i;
    }
}

The goal of the evictionpolicy is to pick the “best” data to evict, meaning the data whose absence from the cache will cause the least problems in the future. This is difficult or even impossible without knowledge of what specific accesses will happen! A bad eviction policy can cause real problems; for instance, it can cause a problem called “thrashing,” where the cache continually evicts data that is required quickly thereafter, and so loaded back. Thrashing access patterns have low hit rates, and cause enormously bad performance.

Here are some interesting possible eviction policies (implementations of the evictionpolicy() function):

Examples

Consider the following reference string (a sequence of addresses that represent accesses): 1 2 3 4 1 2 5 1 2 3 4 5.

How do the FIFO and LRU eviction policies work on this reference string, in a cache with 3 slots? We’ll write this in the following tabular style, where rows represent slots and columns represent accesses. I’ll write misses in bold*, hits in normal type.

1 2 3 4 1 2 5 1 2 3 4 5
FIFO (slot 1) 1* 4* 5* 5
(slot 2) 2* 1* 1 3*
(slot 3) 3* 2* 2 4*
1 2 3 4 1 2 5 1 2 3 4 5
LRU (slot 1) 1* 4* 5* 3*
(slot 2) 2* 1* 1 4*
(slot 3) 3* 2* 2 5*

The hit rates are 3/12 for FIFO and 2/12 for LRU. Neither is doing that great, but this is a short reference string, starting from a cold cache (a cache whose contents aren’t useful—here, the contents are all empty).

If we make the cache bigger, we should expect the hit rates to go up. Do they?

1 2 3 4 1 2 5 1 2 3 4 5
FIFO (slot 1) 1* 1 5* 4*
(slot 2) 2* 2 1* 5*
(slot 3) 3* 2*
(slot 4) 4* 3*
1 2 3 4 1 2 5 1 2 3 4 5
LRU (slot 1) 1* 1 1 5*
(slot 2) 2* 2 2
(slot 3) 3* 5* 4*
(slot 4) 4* 3*

The LRU hit rate did indeed go up, to 4/12, but the FIFO hit rate went down, to 2/12! More cache space, worse performance.

László Bélády

This is called Bélády’s anomaly: there are some cache eviction policies, reference strings, and cache sizes where increasing the cache size can lower the hit rate.

Not all eviction policies suffer from Bélády’s anomaly. LRU does not, for example, and the reason why is not too complicated. If two empty LRU caches with sizes N and N+1 process the same reference string, we can show by induction that at every step, every block in the N-slot cache is also in the N+1-slot cache! This is because if, at some step, slot i is evicted by the N-slot cache, then the N+1-slot cache will either evict slot i at that step or it will evict an even less-recently-used slot; and that less-recently-used slot cannot contain a block present in the N-slot cache, because the other blocks in the N-slot cache are more recently used than i, not less.

This doesn’t mean that LRU is always best, though. There are some access patterns for which LRU doesn’t make immediate sense—and common ones! Think sequential access. If a file is read sequentially, then the most recently accessed data (the block you just read) is a good choice to evict, because it won’t be read again.

There does exist an optimal eviction policy, too, also thanks to our friend Bélády. Unfortunately, it’s unimplementable. The provably-optimal eviction policy is to evict that block that will be needed furthest into the future. Here, for example, is Bélády’s optimal access policy running on our reference string and a 4-slot cache:

1 2 3 4 1 2 5 1 2 3 4 5
Optimal (slot 1) 1* 1 1 4*
(slot 2) 2* 2 2
(slot 3) 3* 3
(slot 4) 4* 5* 5

6/12 hit rate!