Request costs
A simple conceptual formula lets us reason about the cost of an I/O request. We write:
where
- C is the overall cost of the request;
- R is per-request cost: the cost of any request, no matter how much data is involved;
- U is per-unit cost: the cost per unit of data requested (for example, the cost per byte read or written); and
- N is the number of units in the request.
Different storage technologies have different balances between per-request and per-unit costs. The costs can be measured in seconds, to evaluate a request’s latency, or money, to evaluate its expense.
High-throughput, high-latency requests, such as 101Net or Sneakernet, have high per-request costs but very low per-unit costs.
Caches are useful when they lower the request costs seen by their users.
Cache benefits: How caches improve performance
We can break down the ways I/O caches improve application performance further into some basic components. Caches support many conceptually different optimizations. The basic ones are prefetching, batching, and eliminating redundant writes.
Prefetching
Prefetching is the primary optimization caches offer read requests.
To prefetch, a cache loads data from slower storage before the application needs it. Prefetching fills the cache with data that has not been requested yet, but that the cache believes will be accessed soon. If the cache predicts accesses effectively, then most read requests will hit and be serviced quickly.
For example, when an application opens a disk file for reading, the operating system might prefetch the file, or part of the file, into the buffer cache. This essentially predicts that the application will soon read the file—a pretty good assumption!
A cache that cannot prefetch offers little benefit for read requests. If prefetching doesn’t work, then every access requires the cache to load data from slower storage; this can only be slower than having the application access slower storage directly, bypassing the cache.
Batching
Batching is an optimization for both writes and reads. Batching improves throughput, but not latency.
To batch, a cache changes a large number of small requests into a smaller number of larger requests. That is, the cache absorbs many requests with small N (small numbers of units per request), and emits fewer requests with larger N. For example, a stdio cache might combine many small write requests into a single write system call with more bytes.
Batching is most effective when per-request cost R is high, because it makes fewer requests. (If you’re driving down the 101, you might as well take two disks.) It requires that the slower storage medium supports different sizes of request.
Write coalescing
Write coalescing, also called eliminating redundant writes, is an optimization for write requests.
If the same address is written multiple times, only the last-written value counts. To coalesce writes, the cache delays sending a write request to slower storage on the assumption that the data will be overwritten again.
Write coalescing is related to batching, but it works even if per-request cost is low, because it also reduces per-unit cost by writing fewer units.
Write coalescing is extremely useful for processor caches. Consider the following C code:
extern int x;
for (x = 0; x < 10000; ++x) {
... [assume this doesn’t access x] ...
}
Without a processor cache, each write to x
would cause an expensive
write access to primary memory. The processor cache makes this much
faster by eliminating these redundant writes; it can make this code
behave something like the following:
extern int x;
int register_x;
for (register_x = 0; register_x < 10000; ++register_x) {
...
}
x = register_x;
Now all the redundant writes are coalesced into one.
Write coalescing is also extremely useful for disks because, as we saw in Storage 1, disks’ minimum write sizes can cause redundant writes. The operating system’s buffer cache eliminates most of these redundant writes.
Costs
Each of these strategies has costs as well as benefits.
Write coalescing can have a correctness cost (or consequence): data in the cache may be more up to date (contain later values) than the underlying slower storage. This matters for volatile caches running on top of stable storage, such as the buffer cache and stdio caches. Primary memory is volatile; if some of your writes are hanging out in volatile caches, they can be lost if the computer loses power at an inopportune time.
Prefetching has an opportunity cost. Prefetching data that isn’t actually needed later can be expensive on its own (it wastes time and energy), and it can crowd out more useful data from the cache.
Cache coherence
Caches aim to give their users better performance without changing their users’ view of underlying storage. Caches aren’t allowed to make up data: a read cache should never return bogus data not present in the underlying storage, and a write cache should never scribble garbage into the underlying storage. But some caches do change what their users can see. This involves a property called coherence.
A coherent cache always provides its read users with the most up-to-date view of underlying storage. A coherent cache provides the same semantics as accessing the underlying storage directly.
An incoherent cache does not provide its read users with the most up-to-date view of underlying storage. In particular, an incoherent cache can return stale data that a concurrent write has changed.
The coherence.cc
program in cs61-lectures/storage3
demonstrates that the
stdio cache is incoherent: stdio does not update its cache when the underlying
file is changed.
The buffer cache is coherent, however. Every access to the disk is made through the operating system kernel, through system calls, and an important part of the kernel’s job is to ensure that the buffer cache is kept coherent—that write system calls made by one application are visible to all other applications after they are made.
The processor cache is also coherent (up to the limitations of the machine’s memory model, an advanced topic).
The stdio application interface does give applications some control over its
coherence. Applications can request that stdio not cache a given file with
setvbuf(3)
. Applications can also mark a file’s cache as invalid with
fflush(3)
, causing written data to be flushed to the operating system with a
system call and causing prefetched data to be read again.
Random-access files, streams, and file positions
One of the Unix operating system’s big ideas was the unification of two different kinds of I/O into a single “file descriptor” abstraction. These kinds of I/O are called random-access files and streams. Metaphorically, a random-access file is like a book and a stream is like time itself.
A random-access file has a finite length, called its size. It is possible to efficiently skip around inside the file to read or write at any position, so random-access files are seekable. It is also often possible to map the file into memory (see below). A random-access file is like an array of bytes.
A stream is a possibly-infinite sequence of bytes. It is not possible to
skip around inside a stream. A stream is more like a queue of bytes: a
stream writer can only add more bytes to the end of the stream, and a stream
reader can only read and remove bytes from the beginning of the stream.
Streams are not seekable or mappable. The output of a program is generally a
stream. (We used yes
as an example.)
Some previous operating systems offered fundamentally different abstractions for different kinds of file and stream. Unix (like some other OSes) observed that many operations on random-access files and streams can have essentially similar semantics if random-access files are given an explicit feature called the file position. This is a file offset, maintained by the kernel, that defines the next byte to be read or written.
When a random-access file is opened, its file position is set to 0, the
beginning of a file. A read
system call that reads N bytes advances the
file position by N, and similarly for write
. An explicit seek
system
call is used to change the file position. When the file position reaches the
end of the file, read
will return 0
.
Streams don’t have file positions—the seek
system call will return -1
on
streams. Instead, a read
system call consumes the next N bytes from the
stream, and a write
system call adds N bytes to the end of the stream.
When the read
system call reaches the end of the stream (which only happens
after the write end is closed), it will return 0
.
What’s important here is that a sequence of read
system calls will return
the same results given a random-access file or a stream with the same
contents. And similarly, a sequence of write
system calls will produce the
same results whether they are directed to a random-access file or a stream.
Programs that simply read their inputs sequentially and write their outputs
sequentially—which are most programs!—work identically on files and streams.
This makes it easy to build very flexible programs that work in all kinds of
situations.
(But special system calls that work on specific file offsets are useful for
random-access files, and now they exist: see pread(2)
and pwrite(2)
.)
Read caches
The r*
programs in cs61-lectures/storage3
demonstrate different mechanisms
for reading files. We looked at several:
r01-directsector
reads the file (1) 512 bytes at a time (this unit is called a sector), (2) using system calls (read
), (3) synchronously from the disk. TheO_DIRECT
flag disables the buffer cache, forcing synchronous reads; whenO_DIRECT
is on, the application can only read in units of 512 bytes.On my laptop,
r01-directsector
can read about 7,000,000 bytes per second.r02-sector
reads the file (1) 512 bytes at a time, (2) using system calls, (3) asynchronously (allowing the operating system to use the buffer cache).On my laptop,
r02-sector
can read somewhere around 360,000,000 bytes per second (with a lot of variability).r07-stdioblock
reads the file (1) 512 bytes at a time, (2) using stdio, (3) asynchronously.On my laptop,
r07-stdioblock
can read somewhere around 480,000,000 bytes per second (with a lot of variability).r03-byte
reads the file (1) one byte at a time, (2) using system calls, (3) asynchronously.On my laptop,
r03-byte
can read around 2,400,000 bytes per second.r05-stdiobyte
reads the file (1) one byte at a time, (2) using stdio, (3) asynchronously.On my laptop,
r05-stdiobyte
can read around 260,000,000 bytes per second.
These numbers give us a lot of information about relative costs of different
operations. Reading direct from disk is clearly much faster than writing
direct to disk. (Of course this might also be impacted by the virtual machine
on which we run.) And this also gives us some feeling for the cost of system
calls: reading a byte at a time is about 150x slower than reading 512 bytes at
a time. Assuming we knew the cost of the rest of the r*
programs (i.e., the
costs of running the loop and occasionally printing statistics, which are
about the same for all programs), then this information would let us deduce
the R and U components of system call costs.
Reverse reading
Some r*
programs feature different access patterns. We looked at:
r08-revbyte
reads the file (0) in reverse, (1) one byte at a time, (2) using system calls, (3) asynchronously. On my laptop, it reads about 1,400,000 bytes per second.r09-stdiorevbyte
reads the file (0) in reverse, (1) one byte at a time, (2) using stdio, (3) asynchronously. On my laptop, it reads about 3,500,000 bytes per second—more than twice as much!
We used strace to examine the system calls made by
r09-stdiorevbyte
. We discovered the stdio uses an aligned cache for
reads. This means that, for reads, the stdio cache always aligns its cache so
that the first offset in the cache is a multiple of the cache size, which is
4,096 bytes. This aligned cache is quite effective for forward reads and for
reverse reads; in both cases, stdio’s prefetching works.
Stdio doesn’t work great for all access patterns. For example, for random one-byte reads distributed through in a large file, stdio will each time read 4,096 bytes, only one of which is likely to be useful. This incurs more per-unit costs than simply accessing the bytes directly using one-byte system calls.
The stdio cache is aligned for reads, but is it for writes? Check out the
strace
results for w10-stdiorevbyte
to find out. You can do better than
stdio here, at the cost of some complexity.
Memory mapping
Why is the stdio cache only 4,096 bytes? One could make it bigger, but very large stdio caches can cause problems by crowding out other data. Imagine an operating system where many programs are reading the same large file. (This isn’t crazy: shared library files, such as the C++ library, are large and read simultaneously by most or all programs.) If these programs each have independent copies of the file cached, those redundant copies would mean there’s less space available for the buffer cache and for other useful data.
The neat idea of memory-mapped I/O allows applications to cache files without redundancy by directly accessing the operating system’s buffer cache.
The relevant system calls are mmap(2)
and munmap(2)
. mmap
tells the
operating system to place a region of a file into the application’s heap. But
unlike with read
, this placement involves no copying. Instead, the operating
system plugs the relevant part of the buffer cache into that part of the
application’s heap! The application is now sharing part of the buffer cache.
The application can “read” or “write” the file simply by accessing that heap
region.
r16-mmapbyte
reads the file (1) one byte at a time, (2) usingmmap
, (3) asynchronously (the operating system takes care of prefetching, batching, and, for writes, write coalescing). On my laptop, it reads about 350,000,000 bytes per second—more even than stdio!
Memory mapping is very powerful, but it has limitations. Streams cannot be
memory mapped: memory mapping only works for random-access files (and not even
for all random-access files). Memory mapping is a little more dangerous; if
your program has an error and modifies memory inappropriately, that might now
corrupt a disk file. (If your program has such an error, it suffers from
undefined behavior and could corrupt the file anyway, but memory mapping does
make corruption slightly more likely.) Finally, memory mapping has far nastier
error behavior. If the operating system encounters a disk error, such as “disk
full”, then a write
system call will return −1, which gives the
program a chance to clean up. For a memory-mapped file, on the other hand, the
program will observe a segmentation fault.
Advice
Memory-mapped I/O also offers applications an interesting interface for
telling the operating system about future accesses to a file, which, if used
judiciously, could let the operating system implement Bélády’s optimal
algorithm for cache eviction. The relevant system call is madvise(2)
.
madvise
takes the address of a portion of a memory-mapped file, and some
advice, such as MADV_WILLNEED
(the application will need this data
soon—prefetch it!) or MADV_SEQUENTIAL
(the application will read this data
sequentially—plan ahead for that!). For very large files and unusual access
patterns, madvise
can lead to nice performance improvements.
More recently, a similar interface has been introduced for regular system
call I/O, posix_fadvise(2)
. (Mac OS X does not support this interface.)