Storage 2: Cache concepts

A cache is a small amount of fast storage used to optimize access to a larger amount of slower storage.

Caching is one of the most fundamental optimization strategies that exists for systems. Many, many system optimizations boil down to “add a cache” or “improve a cache” or “optimize your use of a cache.”

Here are some examples of concrete caches.

Buffer cache: The cache used by the operating system to optimize access to stable storage media, such as hard disks and solid-state drives. The faster storage medium is the primary memory; the slower storage medium is the disk. The operating system services application system calls to file systems, such as read and write, using the buffer cache. Some specialized file accesses, such as O_SYNC writes and O_DIRECT reads, bypass the buffer cache (force it to access the disk each time). This cache is coherent.
Processor cache: The cache used by the processor to optimize access to primary memory. The faster storage medium is the processor caches (level-1 cache, level-2 cache, level-3 cache); the slower storage medium is primary memory. This cache is coherent.
Stdio cache: A cache used by application libraries, such as the C standard library, to optimize access to the buffer cache. The faster storage medium is one or mor blocks of primary memory in the application; the slower storage medium is the buffer cache. Although the buffer cache is also in primary memory, accessing the buffer cache requires system calls, which are relatively slow, so stdio caches speed up many data access patterns. This kind of cache is incoherent.

We use these terms to refer to the contents of a cache.

A slot is a space in the cache that can hold data.
A block is a unit of data on slow storage. Each block has an address on the slow storage.

We say that a slot “contains” a block (or, equivalently, that block’s address) if the slot contains a version of the corresponding data. We say a slot is empty if it contains no block, and full if it contains some block. Furthermore we say a full slot is clean if the slot’s data has not been modified in the cache since the block was read, and dirty if it has been modified since it was read.

The special term cache line is used to refer to slots in a processor cache, but you will often hear it used (ambiguously) to refer to blocks. In common usage, “cache line” means “processor cache block or slot.” We will say “processor cache block” (or “slot”) when we need to be unambiguous.

We use these terms to refer to accesses to a cache.

A hit is an access to an address that some slot contains. Hits are good!
A miss is an access to an address that no slot contains. Misses cause expensive accesses to the underlying slower storage, so they are bad!

The fraction of accesses that are hits is called the hit rate. High hit rates are good.

Cache memories are described in CS:APP3e §6.4 et al., but that presentation is specific to processor caches. Our presentation attempts to be more general.

Cache optimizations

Caches support many conceptually different optimizations. Three of the most basic are eliminating redundant writes, batching, and prefetching.

Eliminating redundant writes

In this strategy, the cache combines multiple writes to a block of storage. Rather than changing the underlying slower storage medium on each write to a cached block, the cache coalesces writes into a single access to slower storage (or fewer accesses to slower storage).

Consider the following C code:

extern int x;
for (x = 0; x < 10000; ++x) {
    ... [assume this doesn’t access x] ...
}

Without a processor cache, each write to x would cause an expensive write access to primary memory. The processor cache makes this much faster by eliminating these redundant writes; it can make this code behave something like the following:

extern int x;
int register_x;
for (register_x = 0; register_x < 10000; ++register_x) {
    ...
}
x = register_x;

Now all the redundant writes are coalesced into one.

The buffer cache performs a similar function for disk writes. Hard disk and flash technologies have minimum write sizes: disks are physically unable to write one byte at a time. (For instance, hard disks’ reads and writes occur in units called sectors, which are aligned units of 512 bytes each, though the operating system or file system may impose even greater minimum write requirements depending on configuration.) If a program writes one byte at a time, and requires that each write reach the disk (using, for instance, the O_DIRECT flag), the operating system must turn each one-byte write into at least a 512-byte write including all other relevant data in the corresponding sector. This can lead to a write amplification of 512x or more, since each one-byte write expands to at least 512 bytes sent to disk. When allowed to do its job, the buffer cache will absorb many one-byte writes before contacting the slower storage medium (the disk), eliminating redundant writes altogether.

Batching

Batching is an optimization for both writes and reads. It improves throughput, but not latency, by grouping many requests into one: for instance, combining many writes into a single write system call, or combining many reads into a single read system call. This requires that the underlying, slower medium support different sizes of request.

Batching works because often the overhead of processing a request has two components: a cost per request and a cost per unit of data (e.g., per byte). Batching can optimize a request type when that request type has high cost per request but low cost per unit of data. For instance, the read system call has high cost per request but—especially when the buffer cache is full—very low cost per byte.

Prefetching

Prefetching is an optimization for reads. It uses past accesses to predict future accesses. Prefetching fills the cache with data that has not been requested yet, but that the cache believes will be accessed soon. If the cache predicts accesses effectively, then most read requests will “hit” the cache and be serviced quickly.

Costs

Each of these strategies has costs as well as benefits.

Eliminating redundant writes can have a correctness cost (or consequence): data in the cache may be more up to date (contain later values) than the underlying slower storage. This matters for volatile caches running on top of stable storage, such as the buffer cache and stdio caches. Primary memory is volatile; if some of your writes are hanging out in volatile caches, they can be lost if the computer loses power at an inopportune time.

Prefetching has an opportunity cost. Prefetching data that isn’t actually needed later can be expensive on its own (it wastes time and energy), and it can crowd out more useful data from the cache.

Coherence

A cache is called coherent if the cache contents always contain the most up-to-date version of the underlying block. To be more precise, a coherent cache is never less up-to-date than the underlying slower storage (it might be more up-to-date, if the cache is eliminating redundant writes).

A cache is called incoherent if the cache contents might be less up-to-date than the underlying stable storage.

The processor cache and the buffer cache are coherent; stdio caches are not. You can kind of see your local git repositories as caches of a master repository hosted elsewhere: considered that way, your local git repositories are incoherent caches.

Eviction

Eviction is the process of dropping some cached blocks from the cache to make room for newly-required blocks. Caches have finite space by definition (“a small amount of faster storage…”).

A generic algorithm for loading data into a cache might look like this:

cache_slot_t accesscache(address_t addr) {
    assert(addr is valid);
    if (addr is already present in some cache slot, i) {
        return slot i;
    } else if (there exists an empty slot i) {
        load the block containing addr into slot i;
        return slot i;
    } else {
        i = evictionpolicy();
        if (slot i is dirty) {
            write contents of slot i to underlying storage at the corresponding address;
        }
        load the block containing addr into slot i;
        return slot i;
    }
}

The goal of the evictionpolicy is to pick the “best” data to evict, meaning the data whose absence from the cache will cause the least problems in the future. This is difficult or even impossible without knowledge of what specific accesses will happen! A bad eviction policy can cause real problems; for instance, it can cause a problem called “thrashing,” where the cache continually evicts data that is required quickly thereafter, and so loaded back. Thrashing access patterns have low hit rates, and cause enormously bad performance.

Here are some interesting possible eviction policies (implementations of the evictionpolicy() function):

LRU (Least Recently Used). Selects the slot whose data was accessed furthest in the past. To implement this policy, each slot needs to remember not only its contents (its block) and its address, but also the most recent time the slot was accessed (or at least an approximation thereof). We would change the accesscache function to update this time for every access.
LFU (Least Frequently Used). Selects the slot whose data was accessed least often in the past. To implement this policy, each slot needs an access counter. The access counter would generally apply to the slot, not the block—when a block is loaded into the cache, its access counter needs to be reset. Pure LFU is not a good policy, since the most recent block loaded has a small access count, but combinations of LFU and LRU can be useful.
Random. Selects a random slot to evict. This policy isn’t terrible, but variants can be far better, such as “power of two choices”: pick two random slots and select the worst of them (e.g., the least recently used).
FIFO (First In First Out). Selects the slot whose data was loaded furthest in the past. To implement this policy, you can either track slot load times or simply choose slots in round-robin order.

Examples

Consider the following reference string (a sequence of addresses that represent accesses): 1 2 3 4 1 2 5 1 2 3 4 5.

How do the FIFO and LRU eviction policies work on this reference string, in a cache with 3 slots? We’ll write this in the following tabular style, where rows represent slots and columns represent accesses. I’ll write misses in bold*, hits in normal type.

	1	2	3	4	1	2	5	1	2	3	4	5
FIFO (slot 1)	1*			4*			5*					5
(slot 2)		2*			1*			1		3*
(slot 3)			3*			2*			2		4*

	1	2	3	4	1	2	5	1	2	3	4	5
LRU (slot 1)	1*			4*			5*			3*
(slot 2)		2*			1*			1			4*
(slot 3)			3*			2*			2			5*

The hit rates are 3/12 for FIFO and 2/12 for LRU. Neither is doing that great, but this is a short reference string, starting from a cold cache (a cache whose contents aren’t useful—here, the contents are all empty).

If we make the cache bigger, we should expect the hit rates to go up. Do they?

	1	2	3	4	1	2	5	1	2	3	4	5
FIFO (slot 1)	1*				1		5*				4*
(slot 2)		2*				2		1*				5*
(slot 3)			3*						2*
(slot 4)				4*						3*

	1	2	3	4	1	2	5	1	2	3	4	5
LRU (slot 1)	1*				1			1				5*
(slot 2)		2*				2			2
(slot 3)			3*				5*				4*
(slot 4)				4*						3*

The LRU hit rate did indeed go up, to 4/12, but the FIFO hit rate went down, to 2/12! More cache space, worse performance.

László Bélády

This is called Bélády’s anomaly: there are some cache eviction policies, reference strings, and cache sizes where increasing the cache size can lower the hit rate.

Not all eviction policies suffer from Bélády’s anomaly. LRU does not, for example, and the reason why is not too complicated. If two empty LRU caches with sizes N and N+1 process the same reference string, we can show by induction that at every step, every block in the N-slot cache is also in the N+1-slot cache! This is because if, at some step, slot i is evicted by the N-slot cache, then the N+1-slot cache will either evict slot i at that step or it will evict an even less-recently-used slot; and that less-recently-used slot cannot contain a block present in the N-slot cache, because the other blocks in the N-slot cache are more recently used than i, not less.

This doesn’t mean that LRU is always best, though. There are some access patterns for which LRU doesn’t make immediate sense—and common ones! Think sequential access. If a file is read sequentially, then the most recently accessed data (the block you just read) is a good choice to evict, because it won’t be read again.

There does exist an optimal eviction policy, too, also thanks to our friend Bélády. Unfortunately, it’s unimplementable. The provably-optimal eviction policy is to evict that block that will be needed furthest into the future. Here, for example, is Bélády’s optimal access policy running on our reference string and a 4-slot cache:

	1	2	3	4	1	2	5	1	2	3	4	5
Optimal (slot 1)	1*				1			1			4*
(slot 2)		2*				2			2
(slot 3)			3*							3
(slot 4)				4*			5*					5

6/12 hit rate!