To compete with an existing caching scheme, as you'll do in this Unit's pset, it is helpful to understand the patterns in a sequence of references (a reference stream), how an existing cache changes that reference stream, and what opportunities exist for further performance optimization.
Toward that end, this section starts with an introduction to a very
useful tool, strace
, for gathering a program's reference stream to a
file pointer. It then leads you through several important types of
access patterns in the reference stream, and it has you classify
several different traces gathered by strace
. Finally, this section
ends by having you write code for some different operations involved
in coding a single-slot cache.
Running strace
strace
is a Linux program that’s great for debugging the system calls made
by another program.
You run strace
like this:
$ strace -o strace.out PROGRAMNAME ARGUMENTS...
This runs PROGRAMNAME
with the given ARGUMENTS
, but it simultaneously
snoops on all the system calls PROGRAMNAME
makes and writes a readable
summary of those system calls to strace.out
. Look at that output by running
less strace.out
.
The first thirty or more lines of an strace are boilerplate caused by program
startup. It’s usually pretty easy to tell where the boilerplate ends and
program execution begins; for example, you might look for the first open
calls that refer to data files, rather than program libraries.
Access patterns
An access pattern is a high-level description of a class of reference strings. That’s a bit technical: at a higher level, think of an access pattern as a general flavor of file or data access, described by the kinds of addresses that are accessed.
The different ...61
programs distributed in your pset directory
perform different kinds of access, and have different access
patterns. Some especially imporant access patterns are as follows:
Sequential access
A sequential access pattern accesses a contiguous increasing sequence of addresses with nothing skipped, like 1, 2, 3, 4, 5, 6… or 1009, 1010, 1011, 1012, 1013….
Random access
A random access pattern skips around in address space at random, like 1, 10403, 96, 2, 51934, ….
Reverse-sequential access
A reverse-sequential access pattern access a contiguous decreasing sequence of addresses with nothing skipped, like 60274, 60273, 60272, 60271, 60270, ….
Strided access
A strided access pattern accesses a sequence of addresses with a uniform skip between each referenced address. For example, sequence 1, 1001, 2001, 3001, 4001, 5001, … is a strided access pattern with a stride of +1000. Sequential access is a kind of strided access with stride +1, and reverse-sequential access has stride -1.
Variants and combinations
These terms describe whole access patterns, but they are often used more loosely, to describe parts of a reference string. For instance, we might see a reference string that starts with a sequential region, then skips ahead by tens of thousands of bytes and has another sequential region, then does some stuff that you don’t understand that you classify as “random.”
Access size
To speed up the performance of a program, caches often transform the reads and writes done by the program at a particular level of the memory hierarchy into a different set of reads and writes to the next higher level of the memory hierarchy.
For example, consider the processor cache, which tries to satisfy
simple reads and writes to main memory (DRAM technology) with faster
responses from a cache made of SRAM circuits (a memory technology that
is faster and more costly when compared to DRAM). The processor cache
mostly has requested sizes of 1, 2, 4, or 8 bytes (corresponding to
movb
, movw
, movl
, and movq
); but when a processor cache
experiences a miss (i.e., it hasn't cached the requested data), it
requests not a few bytes from main memory, but a large block of bytes
(e.g., a block of 64 or 128 bytes). This block is 64-byte or 128-byte
aligned, and it contains the requested data, which the cache returns
to complete the original mov*
request.
Let the requested size be the amount of data requested from above by the cache’s user, and the block size be the amount of data that the cache itself requests from its underlying storage. As the example shows, these access sizes need not be the same!
As an important side effect, access size transformations often transform a reference string without temporal locality into one with such locality. For instance, consider the following sequential access pattern, starting at address 0x401001. Then the processor requests bytes in this order:
Byte address | Cache line address (aligned by 64) |
---|---|
0x401001 | 0x401000 |
0x401002 | 0x401000 |
0x401003 | 0x401000 |
… | … |
0x40103e | 0x401000 |
0x40103f | 0x401000 |
0x401040 | 0x401040 |
0x401041 | 0x401040 |
0x401042 | 0x401040 |
… | … |
Although the requested byte addresses do not exhibit temporal locality (i.e., no repeated requests for the same byte address), the referenced cache lines do! The access-size transformation has turned spatial locality in the processor's reference string into temporal locality in the references in the cache. This is why single-slot caches can be effective in practice even on reference strings that don’t just access the same byte over and over and over again. In particular, if the processor requests 1 byte at a time from a sequential set of addresses and the cache reads 4096 bytes at once on the first miss, then every read but the first is fulfilled from the cache, saving 4095 accesses to slower storage!
A cache, unfortunately, doesn't always do the right thing. Consider the case where a user reads more than 4096 bytes in a single system call. Our single-slot cache that fetches only 4096 bytes at a time now causes the system to make multiple reads on the slower storage! Knowing the most common access patterns matters if you want to improve the average performance of your system!
Strace exercises
Test your knowledge by characterizing access patterns and cache types given some straces!
In the s05
directory, you’ll see a bunch of truncated strace output in files
straceNN.out
. For each output, characterize its access pattern, and
describe what kind of I/O cache the program might have implemented, if
any.
- Sequential, byte-at-a-time input, 4096-byte block output
- Sequential, 4096-byte block input, 1024-byte block output
- Sequential, 3-byte block input, 1024-byte block output
- 4096-byte stride, direct characterwise system calls
- Reverse-sequential, stdio (4096-byte single-block cache)
- Reverse-sequential, bytewise (no cache)
- Sequential, 4096-byte block input, byte-at-a-time output
- 100-byte stride, stdio
- Sequential, stdio (4096-byte block input and output)
- Sequential, 4096-byte block input, 4096-byte block output (could be stdio)
- 4096-byte stride, stdio (so it refreshes 4096 bytes on every call)
Single-slot cache
In the rest of section, we’re going to work on a specific representation of a single-slot I/O cache. The purpose is to explain such caches and to get you used to thinking in terms of file offsets. Here’s the representation.
struct io61_file {
int fd;
static constexpr off_t bufsize = 4096; // or whatever
unsigned char cbuf[bufsize];
off_t tag; // file offset of first byte in cache (0 when file is opened)
off_t end_tag; // file offset one past last valid byte in cache
off_t pos_tag; // file offset of next char to read in cache
};
pos_tag
refers to the current read or write position, equivalent to the
file's file offset.
Here are some facts about this cache representation.
tag <= end_tag
. This is an invariant: an expression that is always true for every cache. You could encode the invariant as an assertion:assert(tag <= end_tag)
.end_tag - tag <= bufsize
. Notice that the difference is NOT always equal tobufsize
. Allowing this permits us to encode things like ...- If
tag == end_tag
, then the cache’s slot is empty. An empty slot says that nothing is yet cached. tag <= pos_tag && pos_tag <= end_tag
. This is another invariant: thepos_tag
lies between thetag
andend_tag
, inclusive.- The file descriptor’s current file position equals
end_tag
for read caches (and seekable files). It equalstag
for write caches. - The next
io61_read
orio61_write
call should logically start at file offsetpos_tag
. That means the first byte read byio61_read
should come from file offsetpos_tag
, and similarly forio61_write
. - If
i >= tag
andi < end_tag
, then bytecbuf[i - tag]
contains the byte at file offseti
.
Fill the cache
Implement this function, which should fill the cache with data read
from the file. You will make a read
system call.
void io61_fill(io61_file* f) {
// Fill the read cache with new data, starting from file offset `end_tag`.
// Only called for read caches.
// Check invariants.
assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
assert(f->end_tag - f->pos_tag <= f->bufsize);
// Reset the cache to empty.
f->tag = f->pos_tag = f->end_tag;
// Recheck invariants (good practice!).
assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
assert(f->end_tag - f->pos_tag <= f->bufsize);
}
void io61_fill(io61_file* f) { // Fill the read cache with new data, starting from file offset `end_tag`. // Only called for read caches. // Check invariants. assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag); assert(f->end_tag - f->pos_tag <= f->bufsize); /* ANSWER */ // Reset the cache to empty. f->tag = f->pos_tag = f->end_tag; // Read data. ssize_t n = read(f->fd, f->cbuf, f->bufsize); if (n >= 0) { f->end_tag = f->tag + n; } // Recheck invariants (good practice!). assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag); assert(f->end_tag - f->pos_tag <= f->bufsize); }
Easy read
Implement io61_read
, assuming that the desired data is entirely contained
within the current cache slot.
ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
// Check invariants.
assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
assert(f->end_tag - f->pos_tag <= f->bufsize);
// The desired data is guaranteed to lie within this cache slot.
assert(sz <= f->bufsize && f->pos_tag + sz <= f->end_tag);
}
ssize_t io61_read(io61_file* f, char* buf, size_t sz) { // Check invariants. assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag); assert(f->end_tag - f->pos_tag <= f->bufsize); // The desired data is guaranteed to lie within this cache slot. assert(sz <= f->bufsize && f->pos_tag + sz <= f->end_tag); /* ANSWER */ memcpy(buf, &f->cbuf[f->pos_tag - f->tag], sz); f->pos_tag += sz; }
Full read
Implement io61_read
without that assumption. You will need to call
io61_fill
, and will use a loop. You may assume that no call to read
ever
returns an error.
Our tests do not check whether your IO61 library handles errors correctly. This means that you may assume that no
read
andwrite
system call returns a permanent error. But the best implementations will handle errors gracefully: if aread
system call returns a permanent error, thenio61_read
should return -1. (What to do with restartable errors is up to you, but most I/O libraries retry on encounteringEINTR
.)
ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
// Check invariants.
assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
assert(f->end_tag - f->pos_tag <= f->bufsize);
}
ssize_t io61_read(io61_file* f, char* buf, size_t sz) { // Check invariants. assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag); assert(f->end_tag - f->pos_tag <= f->bufsize); /* ANSWER */ size_t pos = 0; while (pos < sz) { if (f->pos_tag == f->end_tag) { io61_fill(f); if (f->pos_tag == f->end_tag) { break; } } // This would be faster if you used `memcpy`! buf[pos] = f->cbuf[f->pos_tag - f->tag]; ++f->pos_tag; ++pos; } return pos; }
Easy write
When reading from a cached file, the library fills the cache using system calls and empties it to the user. Writing to a cached file is the converse: the library fills the cache with user data and empties it using system calls.
For a read cache, the cache buffer region between pos_tag
and end_tag
contains data that has not yet been read, and the region after end_tag
(if
any) is invalid. There are several reasonable choices for write caches; in
these exercises we force pos_tag == end_tag
as an additional invariant.
Implement io61_write
, assuming that space for the desired data is
available within the current cache slot.
ssize_t io61_write(io61_file* f, const char* buf, size_t sz) {
// Check invariants.
assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
assert(f->end_tag - f->pos_tag <= f->bufsize);
// Write cache invariant.
assert(f->pos_tag == f->end_tag);
// The desired data is guaranteed to fit within this cache slot.
assert(sz <= f->bufsize && f->pos_tag + sz <= f->tag + f->bufsize);
}
ssize_t io61_write(io61_file* f, const char* buf, size_t sz) { // Check invariants. assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag); assert(f->end_tag - f->pos_tag <= f->bufsize); // Write cache invariant. assert(f->pos_tag == f->end_tag); // The desired data is guaranteed to lie within this cache slot. assert(sz <= f->bufsize && f->pos_tag + sz <= f->tag + f->bufsize); /* ANSWER */ memcpy(&f->cbuf[f->pos_tag - f->tag], buf, sz); f->pos_tag += sz; f->end_tag += sz; return sz; }
Flush
Implement a function that empties a write cache by flushing its data using a system call. As before, you may assume that the system call succeeds (all data is written).
void io61_flush(io61_file* f) {
// Check invariants.
assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
assert(f->end_tag - f->pos_tag <= f->bufsize);
// Write cache invariant.
assert(f->pos_tag == f->end_tag);
}
void io61_flush(io61_file* f) { // Check invariants. assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag); assert(f->end_tag - f->pos_tag <= f->bufsize); // Write cache invariant. assert(f->pos_tag == f->end_tag); /* ANSWER */ ssize_t n = write(f->fd, f->cbuf, f->pos_tag - f->tag); assert((size_t) n == f->pos_tag - f->tag); f->tag = f->pos_tag; }
Full write
Implement a function that implements a full write. You will call io61_flush
and use a loop.
ssize_t io61_write(io61_file* f, const char* buf, size_t sz) {
// Check invariants.
assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
assert(f->end_tag - f->pos_tag <= f->bufsize);
// Write cache invariant.
assert(f->pos_tag == f->end_tag);
}
ssize_t io61_write(io61_file* f, const char* buf, size_t sz) { // Check invariants. assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag); assert(f->end_tag - f->pos_tag <= f->bufsize); // Write cache invariant. assert(f->pos_tag == f->end_tag); /* ANSWER */ size_t pos = 0; while (pos < sz) { if (f->end_tag == f->tag + f->bufsize) { io61_flush(f); } // This would be faster if you used `memcpy`! f->cbuf[f->pos_tag - f->tag] = buf[pos]; ++f->pos_tag; ++f->end_tag; ++pos; } return pos; }
Seeks
Now, add seek to the mix. How should a seek be implemented for this single-slot cache?
Running blktrace
If you're interested in exploring the pattern of references made to a
physical hard disk, you can try running blktrace
, a Linux program
that helps debug the disk requests made by an operating system in the
course of executing other programs. This is below the level of system
calls: a single disk request might be made in response to multiple
system calls, and many system calls (such as those that access the
buffer cache) do not cause disk requests at all.
Install blktrace
with sudo yum install blktrace
(on Ubuntu), and run it
like this:
$ df . # figure out what disk you’re on
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 61897344 18080492 41107316 31% /
$ sudo sh -c "blktrace /dev/sda1 &" # puts machine-readable output in `sda1.blktrace.*`
$ PROGRAMNAME ARGUMENTS...
$ sudo killall blktrace
=== sda1 ===
CPU 0: 282 events, 14 KiB data
CPU 1: 658 events, 31 KiB data
Total: 940 events (dropped 0), 45 KiB data
$ blkparse sda1 | less # parses data collected by blktrace
The default output of blkparse
has everything anyone might want to know, and
is a result is quite hard to read. Try this command line to restrict the
output to disk requests (“-a issue
”, which show up as lines with “D
”) and
request completions (“-a complete
” and “C
”).
$ blkparse -a issue -a complete -f "%5T.%9t: %2a %3d: %8NB @%9S [%C]\n" sda1 | less
With those arguments, blkparse
output looks like this:
0.000055499: D R: 16384B @ 33902656 [bash]
0.073827694: C R: 16384B @ 33902656 [swapper/0]
0.074596596: D RM: 4096B @ 33556536 [bash]
0.090967020: C RM: 4096B @ 33556536 [swapper/0]
0.091078346: D RM: 4096B @ 33556528 [bash]
0.091270951: C RM: 4096B @ 33556528 [swapper/0]
0.091432386: D WS: 4096B @ 24284592 [bash]
0.091619438: C WS: 4096B @ 24284592 [swapper/0]
0.100106375: D WS: 61440B @ 17181720 [kworker/0:1H]
0.100526435: C WS: 61440B @ 17181720 [swapper/0]
0.100583894: D WS: 4096B @ 17181840 [jbd2/sda1-8]
| | | | | |
| | | | | |
| | | | | +--- responsible command
| | | | |
| | | | +----------- disk offset of read or write
| | | +-------------------- number of bytes read or written
| | |
| | +------------------------------ Read or Write (plus other info)
| +---------------------------------- D = request issued to device,
| C = request completed
|
+------------------------------------------- time of request