Storage 3: Stdio cache

Request costs

A simple conceptual formula lets us reason about the cost of an I/O request. We write:

C = NU + R

where

C is the overall cost of the request;
R is per-request cost: the cost of any request, no matter how much data is involved;
U is per-unit cost: the cost per unit of data requested (for example, the cost per byte read or written); and
N is the number of units in the request.

Different storage technologies have different balances between per-request and per-unit costs. The costs can be measured in seconds, to evaluate a request’s latency, or money, to evaluate its expense.

High-throughput, high-latency requests, such as 101Net or Sneakernet, have high per-request costs but very low per-unit costs.

Caches are useful when they lower the request costs seen by their users.

Cache benefits: How caches improve performance

We can break down the ways I/O caches improve application performance further into some basic components. Caches support many conceptually different optimizations. The basic ones are prefetching, batching, and eliminating redundant writes.

Prefetching

Prefetching is the primary optimization caches offer read requests.

To prefetch, a cache loads data from slower storage before the application needs it. Prefetching fills the cache with data that has not been requested yet, but that the cache believes will be accessed soon. If the cache predicts accesses effectively, then most read requests will hit and be serviced quickly.

For example, when an application opens a disk file for reading, the operating system might prefetch the file, or part of the file, into the buffer cache. This essentially predicts that the application will soon read the file—a pretty good assumption!

A cache that cannot prefetch offers little benefit for read requests. If prefetching doesn’t work, then every access requires the cache to load data from slower storage; this can only be slower than having the application access slower storage directly, bypassing the cache.

Batching

Batching is an optimization for both writes and reads. Batching improves throughput, but not latency.

To batch, a cache changes a large number of small requests into a smaller number of larger requests. That is, the cache absorbs many requests with small N (small numbers of units per request), and emits fewer requests with larger N. For example, a stdio cache might combine many small write requests into a single write system call with more bytes.

Batching is most effective when per-request cost R is high, because it makes fewer requests. (If you’re driving down the 101, you might as well take two disks.) It requires that the slower storage medium supports different sizes of request.

Write coalescing

Write coalescing, also called eliminating redundant writes, is an optimization for write requests.

If the same address is written multiple times, only the last-written value counts. To coalesce writes, the cache delays sending a write request to slower storage on the assumption that the data will be overwritten again.

Write coalescing is related to batching, but it works even if per-request cost is low, because it also reduces per-unit cost by writing fewer units.

Write coalescing is extremely useful for processor caches. Consider the following C code:

extern int x;
for (x = 0; x < 10000; ++x) {
    ... [assume this doesn’t access x] ...
}

Without a processor cache, each write to x would cause an expensive write access to primary memory. The processor cache makes this much faster by eliminating these redundant writes; it can make this code behave something like the following:

extern int x;
int register_x;
for (register_x = 0; register_x < 10000; ++register_x) {
    ...
}
x = register_x;

Now all the redundant writes are coalesced into one.

Write coalescing is also extremely useful for disks because, as we saw in Storage 1, disks’ minimum write sizes can cause redundant writes. The operating system’s buffer cache eliminates most of these redundant writes.

Costs

Each of these strategies has costs as well as benefits.

Write coalescing can have a correctness cost (or consequence): data in the cache may be more up to date (contain later values) than the underlying slower storage. This matters for volatile caches running on top of stable storage, such as the buffer cache and stdio caches. Primary memory is volatile; if some of your writes are hanging out in volatile caches, they can be lost if the computer loses power at an inopportune time.

Prefetching has an opportunity cost. Prefetching data that isn’t actually needed later can be expensive on its own (it wastes time and energy), and it can crowd out more useful data from the cache.

Cache coherence

Caches aim to give their users better performance without changing their users’ view of underlying storage. Caches aren’t allowed to make up data: a read cache should never return bogus data not present in the underlying storage, and a write cache should never scribble garbage into the underlying storage. But some caches do change what their users can see. This involves a property called coherence.

A coherent cache always provides its read users with the most up-to-date view of underlying storage. A coherent cache provides the same semantics as accessing the underlying storage directly.

An incoherent cache does not provide its read users with the most up-to-date view of underlying storage. In particular, an incoherent cache can return stale data that a concurrent write has changed.

The coherence.cc program in cs61-lectures/storage3 demonstrates that the stdio cache is incoherent: stdio does not update its cache when the underlying file is changed.

The buffer cache is coherent, however. Every access to the disk is made through the operating system kernel, through system calls, and an important part of the kernel’s job is to ensure that the buffer cache is kept coherent—that write system calls made by one application are visible to all other applications after they are made.

The processor cache is also coherent (up to the limitations of the machine’s memory model, an advanced topic).

The stdio application interface does give applications some control over its coherence. Applications can request that stdio not cache a given file with setvbuf(3). Applications can also mark a file’s cache as invalid with fflush(3), causing written data to be flushed to the operating system with a system call and causing prefetched data to be read again.

Random-access files, streams, and file positions

One of the Unix operating system’s big ideas was the unification of two different kinds of I/O into a single “file descriptor” abstraction. These kinds of I/O are called random-access files and streams. Metaphorically, a random-access file is like a book and a stream is like time itself.

A random-access file has a finite length, called its size. It is possible to efficiently skip around inside the file to read or write at any position, so random-access files are seekable. It is also often possible to map the file into memory (see below). A random-access file is like an array of bytes.

A stream is a possibly-infinite sequence of bytes. It is not possible to skip around inside a stream. A stream is more like a queue of bytes: a stream writer can only add more bytes to the end of the stream, and a stream reader can only read and remove bytes from the beginning of the stream. Streams are not seekable or mappable. The output of a program is generally a stream. (We used yes as an example.)

Some previous operating systems offered fundamentally different abstractions for different kinds of file and stream. Unix (like some other OSes) observed that many operations on random-access files and streams can have essentially similar semantics if random-access files are given an explicit feature called the file position. This is a file offset, maintained by the kernel, that defines the next byte to be read or written.

When a random-access file is opened, its file position is set to 0, the beginning of a file. A read system call that reads N bytes advances the file position by N, and similarly for write. An explicit seek system call is used to change the file position. When the file position reaches the end of the file, read will return 0.

Streams don’t have file positions—the seek system call will return -1 on streams. Instead, a read system call consumes the next N bytes from the stream, and a write system call adds N bytes to the end of the stream. When the read system call reaches the end of the stream (which only happens after the write end is closed), it will return 0.

What’s important here is that a sequence of read system calls will return the same results given a random-access file or a stream with the same contents. And similarly, a sequence of write system calls will produce the same results whether they are directed to a random-access file or a stream. Programs that simply read their inputs sequentially and write their outputs sequentially—which are most programs!—work identically on files and streams. This makes it easy to build very flexible programs that work in all kinds of situations.

(But special system calls that work on specific file offsets are useful for random-access files, and now they exist: see pread(2) and pwrite(2).)

File descriptor notes

Read caches

The r* programs in cs61-lectures/storage3 demonstrate different mechanisms for reading files. We looked at several:

r01-directsector reads the file (1) 512 bytes at a time (this unit is called a sector), (2) using system calls (read), (3) synchronously from the disk. The O_DIRECT flag disables the buffer cache, forcing synchronous reads; when O_DIRECT is on, the application can only read in units of 512 bytes.

On my laptop, r01-directsector can read about 7,000,000 bytes per second.
r02-sector reads the file (1) 512 bytes at a time, (2) using system calls, (3) asynchronously (allowing the operating system to use the buffer cache).

On my laptop, r02-sector can read somewhere around 360,000,000 bytes per second (with a lot of variability).
r07-stdioblock reads the file (1) 512 bytes at a time, (2) using stdio, (3) asynchronously.

On my laptop, r07-stdioblock can read somewhere around 480,000,000 bytes per second (with a lot of variability).
r03-byte reads the file (1) one byte at a time, (2) using system calls, (3) asynchronously.

On my laptop, r03-byte can read around 2,400,000 bytes per second.
r05-stdiobyte reads the file (1) one byte at a time, (2) using stdio, (3) asynchronously.

On my laptop, r05-stdiobyte can read around 260,000,000 bytes per second.

These numbers give us a lot of information about relative costs of different operations. Reading direct from disk is clearly much faster than writing direct to disk. (Of course this might also be impacted by the virtual machine on which we run.) And this also gives us some feeling for the cost of system calls: reading a byte at a time is about 150x slower than reading 512 bytes at a time. Assuming we knew the cost of the rest of the r* programs (i.e., the costs of running the loop and occasionally printing statistics, which are about the same for all programs), then this information would let us deduce the R and U components of system call costs.

Reverse reading

Some r* programs feature different access patterns. We looked at:

r08-revbyte reads the file (0) in reverse, (1) one byte at a time, (2) using system calls, (3) asynchronously. On my laptop, it reads about 1,400,000 bytes per second.
r09-stdiorevbyte reads the file (0) in reverse, (1) one byte at a time, (2) using stdio, (3) asynchronously. On my laptop, it reads about 3,500,000 bytes per second—more than twice as much!

We used strace to examine the system calls made by r09-stdiorevbyte. We discovered the stdio uses an aligned cache for reads. This means that, for reads, the stdio cache always aligns its cache so that the first offset in the cache is a multiple of the cache size, which is 4,096 bytes. This aligned cache is quite effective for forward reads and for reverse reads; in both cases, stdio’s prefetching works.

Stdio doesn’t work great for all access patterns. For example, for random one-byte reads distributed through in a large file, stdio will each time read 4,096 bytes, only one of which is likely to be useful. This incurs more per-unit costs than simply accessing the bytes directly using one-byte system calls.

The stdio cache is aligned for reads, but is it for writes? Check out the strace results for w10-stdiorevbyte to find out. You can do better than stdio here, at the cost of some complexity.

Memory mapping

Why is the stdio cache only 4,096 bytes? One could make it bigger, but very large stdio caches can cause problems by crowding out other data. Imagine an operating system where many programs are reading the same large file. (This isn’t crazy: shared library files, such as the C++ library, are large and read simultaneously by most or all programs.) If these programs each have independent copies of the file cached, those redundant copies would mean there’s less space available for the buffer cache and for other useful data.

The neat idea of memory-mapped I/O allows applications to cache files without redundancy by directly accessing the operating system’s buffer cache.

The relevant system calls are mmap(2) and munmap(2). mmap tells the operating system to place a region of a file into the application’s heap. But unlike with read, this placement involves no copying. Instead, the operating system plugs the relevant part of the buffer cache into that part of the application’s heap! The application is now sharing part of the buffer cache. The application can “read” or “write” the file simply by accessing that heap region.

r16-mmapbyte reads the file (1) one byte at a time, (2) using mmap, (3) asynchronously (the operating system takes care of prefetching, batching, and, for writes, write coalescing). On my laptop, it reads about 350,000,000 bytes per second—more even than stdio!

Memory mapping is very powerful, but it has limitations. Streams cannot be memory mapped: memory mapping only works for random-access files (and not even for all random-access files). Memory mapping is a little more dangerous; if your program has an error and modifies memory inappropriately, that might now corrupt a disk file. (If your program has such an error, it suffers from undefined behavior and could corrupt the file anyway, but memory mapping does make corruption slightly more likely.) Finally, memory mapping has far nastier error behavior. If the operating system encounters a disk error, such as “disk full”, then a write system call will return −1, which gives the program a chance to clean up. For a memory-mapped file, on the other hand, the program will observe a segmentation fault.

Advice

Memory-mapped I/O also offers applications an interesting interface for telling the operating system about future accesses to a file, which, if used judiciously, could let the operating system implement Bélády’s optimal algorithm for cache eviction. The relevant system call is madvise(2). madvise takes the address of a portion of a memory-mapped file, and some advice, such as MADV_WILLNEED (the application will need this data soon—prefetch it!) or MADV_SEQUENTIAL (the application will read this data sequentially—plan ahead for that!). For very large files and unusual access patterns, madvise can lead to nice performance improvements.

More recently, a similar interface has been introduced for regular system call I/O, posix_fadvise(2). (Mac OS X does not support this interface.)