Storage Section 1: Single-slot cache

This section introduces you to the notion of an access pattern and a classification of different access patterns. It ends by walking through code for a single-slot cache—code that can be the foundation of your problem set.

Simultaneously enrolled college students are responsible for attending section live (in-person or via zoom). If you cannot attend any section live in a particular week, you may watch a recording (Canvas > Zoom > Cloud Recordings) instead. Please self-report your live attendance (or that you watched a recording) using this attendance form.

Terminology: Blocks, slots, and addresses

A cache is a small amount of fast storage that helps speed up access to slower underlying storage. The underlying storage often has larger capacity than the cache, meaning that it can hold more total data. When discussing caches in general, we use these terms:

A block is a unit of data storage. Both the underlying storage and the cache are divided into blocks. In some caches, blocks have fixed size; in others, blocks have variable size. A cache determines its block size, though in some cases it will use a natural block size taken from the underlying storage.
Each block of underlying storage has an address. We represent addresses abstractly as integers, though in a specific cache, an address might be a memory address, an offset into a file, a disk position, or even a web page’s URL.
Each block in the cache is a slot. A slot can either be empty or full. An empty slot has no data, while a full slot contains a cached version of a specific block of underlying storage. A full slot therefore has an associated address, which is the address of that block on underlying storage.

Access patterns

The sequence of addresses accessed in slower storage has enormous impact on cache effectiveness. For instance, repeated reads to the same address are easy to cache effectively (they exhibit temporal locality of reference), while a random access sequence might be impossible.

We treat these issues formally using reference strings and access patterns.

A reference string is an ordered sequence of addresses representing access requests. For instance, the reference string 0R, 1R, 2R, 3R represents a sequence of four read requests for blocks—first for block 0, then blocks 1, 2, and 3. The operation (R or W) is often omitted.

An access pattern is a description of a class of reference string. This section section will focus on:

Repeated access: The same address over and over, as in 1, 1, 1, 1, 1, ….
Sequential access: A contiguous increasing sequence of addresses, as in 1, 2, 3, 4, 5, 6… or 1009, 1010, 1011, 1012, 1013…, with few or no skipped or repeated addresses.
Reverse-sequential access: A contiguous decreasing sequence of addresses, as in 60274, 60273, 60272, 60271, 60270, ….
Random access: An apparently-unpredictable sequence of addresses spread over some range.

Real reference strings often contain multiple patterns as subsequences. For example, the reference string 0, 1, 2, 3, …, 39998, 39999, 40000, 100000, 99999, 99998, 99997, … 1 consists of sequential access (0–40000) followed by reverse-sequential access (100000–1).

EXERCISE. Give a reference string (or describe one at a high level) that contains three distinct subsequences of sequential access.

Offline note. The problem set also contains strided access: A monotonically increasing or decreasing sequence of addresses with uniform distance, as in 0, 4096, 8192, 12288, 16384, …. The distance between adjacent addresses is called the stride.

Sequential and reverse-sequential access are kinds of strided access with stride +1 and −1, respectively, but they are important enough to be assigned their own names. We are more likely to describe an access pattern as strided if it has a large stride.

File positions

Standard I/O caches are intended to speed up access to data read and written from file descriptors, which are the Unix representation of data streams and random-access files. System calls like read and write perform the reading and writing. Standard I/O caches can improve performance by reducing the number of system calls performed.

Besides transferring data from or to a file, read and write also affect the file descriptor’s file position. This is a numeric byte offset representing the current position in the file.¹ A read or write system call starts reading or writing at the current file position, and then increments the file position by the number of characters processed.

The kernel maintains all file positions. User processes can only observe or affect a file position by making a system call, such as read, write, or lseek. (lseek system call modifies the file position without transferring data, and is required to read or write a file out of sequential order.)

As a result, successive calls to read progress sequentially through the file. You can see this in the following example.

Right before executing read(fd, buf, 1) on a just-opened file (the file position is shown in red):
After read(fd, buf, 1):
After read(fd, buf, 2):
After read(fd, buf, 3):

`strace` exercises

Test your knowledge by characterizing access patterns and cache types given some system call traces produced by strace!

The strace program is run like this:

$ strace -o strace.out PROGRAMNAME ARGUMENTS...

This runs PROGRAMNAME with the given ARGUMENTS, but simultaneously snoops on all the system calls PROGRAMNAME makes and writes a readable summary of those system calls to strace.out.

Examine strace’s output using your favorite editor or by running less strace.out.

The first part of an strace comprises boilerplate caused by program startup. It’s usually pretty easy to tell where the boilerplate ends and program execution begins; for example, you might look for the first open calls that refer to data files, rather than program libraries; or the first accesses to standard input and standard output (file descriptors 0 and 1), which are not accessed during program startup. You can find out about any system call in an strace by reading its manual page (e.g., man 2 read), but it’s best to focus your attention on a handful of system calls that correspond to file access: open, read, write, lseek, close. (File descriptor notes)

Add more options to strace to focus its output on file-related system calls, or to trace any child processes created by PROGRAMNAME.

$ strace -e trace=open,openat,close,read,write -o strace.out PROGRAMNAME ARGUMENTS...
# Produce output only for those system calls
$ strace -f -o strace.out PROGRAMNAME ARGUMENTS...
# -f: Also trace child processes
$ strace -f -e trace=%process -o strace.out PROGRAMNAME ARGUMENTS...
# Produce output only for process-related system calls

Read man strace for much more information on how to run strace.

In the cs61-sections/storages1 directory, you’ll see a bunch of truncated strace output in files straceNN.out.

EXERCISE. Describe the access pattern for standard input in strace01.out. What block size is being accessed? What access pattern? Is a standard I/O cache likely in use by this program?

This is sequential access to 1-byte blocks. There is no standard I/O cache: standard I/O caches transform bytewise accesses into 4096-byte block accesses.

EXERCISE. Describe the access pattern for standard input in strace02.out. What block size is being accessed? What access pattern? Is a standard I/O cache likely in use by this program?

This is sequential access to 4096-byte blocks. It is very possible a standard I/O cache is in use, although it is also possible that the program is making explicit system calls to read 4096-byte blocks.

HOME EXERCISE. Describe the access patterns for standard output in strace01.out and strace02.out. Are they the same or different? What block sizes are being accessed? What access patterns? Is a standard I/O cache likely in use?

Both programs make the same write system calls to file descriptor 1, which you can easily verify by using the grep program to search for them: grep 'write(1,' strace0[12].out. (Or you might use grep '(1,' to search for all system calls with 1 as a first argument.) They are making sequential write accesses to 4096-byte blocks. A standard I/O cache could easily be in use; examining the system calls more closely, and comparing them with the results of strace -o strace.out ./w-stdiobyte > data and strace -o strace.out ./w-osblock -b 4096 > data (using the programs from lecture storage1), shows that strace02.out is more likely the result of a standard I/O cache: stdio generates an fstat call that is visible in strace02.out, but not strace01.out.

HOME EXERCISES. Describe the access patterns for the other strace*.out files.

Sequential, byte-at-a-time input, 4096-byte block output

Sequential, stdio (4096-byte block input and output)

Sequential, 3-byte block input, 1024-byte block output

4096-byte stride, direct characterwise system calls

Reverse-sequential, stdio (4096-byte single-block cache)

Reverse-sequential, bytewise (no cache)

Sequential, 4096-byte block input, byte-at-a-time output

100-byte stride, stdio

Sequential, 4096-byte block input, 1024-byte block output

Sequential, 4096-byte block input, 4096-byte block output (could be stdio)

4096-byte stride, stdio (so it refreshes 4096 bytes on every call)

Single-slot cache

A single-slot cache is, as the name suggests, a cache comprising a single slot. The standard I/O cache implemented in most operating systems’ standard I/O libraries has a single slot.

In the rest of section, we’re going to work on a specific representation of a single-slot I/O cache. The purpose is to explain such caches and to get you used to thinking in terms of file offsets.

Here’s the representation.

struct io61_fcache {
    int fd;

    // `bufsiz` is the cache block size
    static constexpr off_t bufsize = 4096;
    // Cached data is stored in `cbuf`
    unsigned char cbuf[bufsize];

    // The following “tags” are addresses—file offsets—that describe the cache’s contents.
    // `tag`: File offset of first byte of cached data (0 when file is opened).
    off_t tag;
    // `end_tag`: File offset one past the last byte of cached data (0 when file is opened).
    off_t end_tag;
    // `pos_tag`: Cache position: file offset of the cache.
    // In read caches, this is the file offset of the next character to be read.
    off_t pos_tag;
};

pos_tag refers to the effective file position for users of io61 functions. That is, the next io61_read or io61_write function should begin at the file offset represented by pos_tag. Note that this is often different from the actual file position!

EXERCISE. Given the above description of the representation, how can you tell if an io61_fcache is empty (contains no cached data)?

A cache is empty if tag == end_tag. This makes tags work similarly to iterators—a C++ standard library data structure d is empty when d.begin() == d.end().

EXERCISE. What data invariants might apply to this representation?

Here are two that follow immediately from the cache representation (we’ll see more below). These could, and probably should, be encoded as assertions!

tag <= end_tag.

end_tag - tag <= bufsize. The cache slot contains space for at most bufsize data bytes.

Example single-slot cache for reading

To demonstrate how this single-slot cache representation might work, we walk through the file position example with a small bufsize of 4 for clarity. Recall that the goal of a cache for reads is to produce equivalent results to a series of system calls, so our io61_reads should return the same bytes in the same order as the reads above.

The cache is initialized to an empty state, with fd 3, bufsiz 4, and all of tag, pos_tag, and end_tag equal to 0. When the first io61_read request is passed to the cache, it must fill its buffer with a call to read.
The read succeeds, filling all 4 bytes in cbuf. The end_tag is updated to indicate that the cache is full (contains 4 bytes of file data).
Now there is enough data in the cache to fulfill the io61_read request. The request starts from the offset corresponding to pos_tag, and must increment pos_tag according to the amount read (just as read updates the kernel’s file position by the amount read).
The next io61_read request can be satisfied entirely from the cache. No system calls occur, so the kernel’s file position is unchanged.
The cache contains just 1 byte of data after pos_tag, so the next io61_read request can only be partially satisfied at first.
Since short reads and writes are allowed by the specifications of io61_read and io61_write, it would be safe for io61_read to return now, with a short read of 1 byte! This still counts as equivalent to the version with system calls, because the same bytes are being returned in the same order.²
But it’s probably better to try to return as much data as the user requested. This requires refilling the cache. We update tag before refilling the cache (because the offset of the first cached byte would equal the file position);
and then, after the read succeeds, update end_tag (because the cache is filled).
Finally, the last 2 bytes can be returned to the user.

EXERCISE. Having worked through an example, can you add any more data invariants for this cache representation?

tag <= pos_tag <= end_tag. (For later parts of the pset, though, see the footnote³.)

The cache slot contains end_tag - tag bytes of valid data.

If tag == end_tag, then the cache slot is empty (contains no valid data).

If tag <= off < end_tag, then byte cbuf[off - tag] corresponds to the byte of file data at offset off.

The file descriptor’s file position equals end_tag for read caches of seekable files. (It equals tag for write caches of seekable files.)

A io61_read or io61_write call should begin accessing the file at the offset equal to pos_tag. Thus, the first byte read by io61_read should come from file offset pos_tag.

These are harder to check with assertions.

Fill read cache

We strongly recommend that you try these exercises yourself before looking at the solutions. As you implement, remember the invariants: we find the invariants and representation extremely good guidance for a correct implementation.

EXERCISE SS1. Implement this function, which should fill the cache with data read from the file. You will make a read system call.

void io61_fill(io61_fcache* f) {
    // Empty the cache entry, then fill it with new data.
    // The new data is taken from the file starting at file offset `end_tag`
    // (which equals the file position).
    // The cache position equals the first character read.
    // Only called for read caches.

    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Reset the cache to empty.
    f->tag = f->pos_tag = f->end_tag;










    // Recheck invariants (good practice!).
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);
}

SUB-EXERCISE. Which cache tags will need to change in this function, and by how much?

SUB-EXERCISE. Do you need to use memcpy in this function?

This function can change all of tag, pos_tag, and end_tag. It changes tag because the cache is emptied and read starting from end_tag, so the new tag equals the old end_tag. It changes end_tag to reflect the data read. It changes pos_tag to change the cache position.

There is no need for memcpy in this function; it can read data directly into the cache!

void io61_fill(io61_fcache* f) {
    // Fill the read cache with new data, starting from file offset `end_tag`.
    // Only called for read caches.

    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    /* ANSWER */
    // Reset the cache to empty.
    f->tag = f->pos_tag = f->end_tag;
    // Read data.
    ssize_t n = read(f->fd, f->cbuf, f->bufsize);
    if (n >= 0) {
        f->end_tag = f->tag + n;
    }

    // Recheck invariants (good practice!).
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);
}

Read from read cache (simple version)

EXERCISE SS2. Implement io61_read, assuming that the desired data is entirely contained within the current cache slot.

ssize_t io61_read(io61_fcache* f, unsigned char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // The desired data is guaranteed to lie within this cache slot.
    assert(sz <= f->bufsize && f->pos_tag + sz <= f->end_tag);





}

SUB-EXERCISE. Which cache tags will need to change in this function, and by how much?

SUB-EXERCISE. Do you need to use memcpy in this function?

ssize_t io61_read(io61_fcache* f, unsigned char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // The desired data is guaranteed to lie within this cache slot.
    assert(sz <= f->bufsize && f->pos_tag + sz <= f->end_tag);

    /* ANSWER */
    memcpy(buf, &f->cbuf[f->pos_tag - f->tag], sz);
    f->pos_tag += sz;
}

Read from read cache (full version)

EXERCISE SS3. Implement io61_read without that assumption. You will need to call io61_fill, and will use a loop. You may assume that no call to read ever returns an error.

Our tests do not check whether your IO61 library handles errors correctly. This means that you may assume that no read and write system call returns a permanent error. But the best implementations will handle errors gracefully: if a read system call returns a permanent error, then io61_read should return -1. (What to do with restartable errors is up to you, but most I/O libraries retry on encountering EINTR.)
ssize_t io61_read(io61_fcache* f, unsigned char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);









}
SUB-EXERCISE. Which cache tags will need to change in this function, and by how much?

SUB-EXERCISE. Do you need to use memcpy in this function?
ssize_t io61_read(io61_fcache* f, unsigned char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    /* ANSWER */
    size_t pos = 0;
    while (pos < sz) {
        if (f->pos_tag == f->end_tag) {
            io61_fill(f);
            if (f->pos_tag == f->end_tag) {
                break;
            }
        }

        // This would be faster if you used `memcpy`!
        buf[pos] = f->cbuf[f->pos_tag - f->tag];
        ++f->pos_tag;
        ++pos;
    }
    return pos;
}

Write to write cache (easy version)

When reading from a cached file, the library fills the cache using system calls and empties it to the user. Writing to a cached file is the converse: the library fills the cache with user data and empties it using system calls.

For a read cache, the cache buffer region between pos_tag and end_tag contains data that has not yet been read, and the region after end_tag (if any) is invalid. There are several reasonable choices for write caches; in these exercises we force pos_tag == end_tag as an additional invariant.

EXERCISE SS4. Implement io61_write, assuming that space for the desired data is available within the current cache slot.

ssize_t io61_write(io61_fcache* f, const unsigned char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);

    // The desired data is guaranteed to fit within this cache slot.
    assert(sz <= f->bufsize && f->pos_tag + sz <= f->tag + f->bufsize);






}

ssize_t io61_write(io61_fcache* f, const unsigned char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);

    // The desired data is guaranteed to lie within this cache slot.
    assert(sz <= f->bufsize && f->pos_tag + sz <= f->tag + f->bufsize);

    /* ANSWER */
    memcpy(&f->cbuf[f->pos_tag - f->tag], buf, sz);
    f->pos_tag += sz;
    f->end_tag += sz;
    return sz;
}

Flush write cache

EXERCISE SS5. Implement a function that empties a write cache by flushing its data using a system call. As before, you may assume that the system call succeeds (all data is written).

void io61_flush(io61_fcache* f) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);






}

void io61_flush(io61_fcache* f) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);

    /* ANSWER */
    ssize_t n = write(f->fd, f->cbuf, f->pos_tag - f->tag);
    assert((off_t) n == f->pos_tag - f->tag);
    f->tag = f->pos_tag;
}

Write to write cache (full version)

EXERCISE SS6. Implement a function that implements write without the assumption from Exercise SS4. You will call io61_flush and use a loop.

ssize_t io61_write(io61_fcache* f, const unsigned char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);






}

ssize_t io61_write(io61_fcache* f, const unsigned char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);

    /* ANSWER */
    size_t pos = 0;
    while (pos < sz) {
        if (f->end_tag == f->tag + f->bufsize) {
            io61_flush(f);
        }

        // This would be faster if you used `memcpy`!
        f->cbuf[f->pos_tag - f->tag] = buf[pos];
        ++f->pos_tag;
        ++f->end_tag;
        ++pos;
    }
    return pos;
}

Seeks

To continue with this implementation offline, try adding an implementation of io61_seek. How should this function change the representation? Think about the invariants!

Aside: Running `blktrace`

If you’re interested in exploring the pattern of references made to a physical drive (i.e., an SSD or a hard disk), you can try running blktrace, a Linux program that helps debug the drive requests made by an operating system in the course of executing other programs. This is below the level of system calls: a single drive request might be made in response to multiple system calls, and many system calls (such as those that access the buffer cache) do not cause drive requests at all.

Install blktrace with sudo yum install blktrace (on Ubuntu), and run it like this:

$ df .                                 # figure out what disk you’re on
Filesystem     1K-blocks     Used Available Use% Mounted on
/dev/sda1       61897344 18080492  41107316  31% /
$ sudo sh -c "blktrace /dev/sda1 &"    # puts machine-readable output in `sda1.blktrace.*`
$ PROGRAMNAME ARGUMENTS...
$ sudo killall blktrace
=== sda1 ===
  CPU  0:                  282 events,       14 KiB data
  CPU  1:                  658 events,       31 KiB data
  Total:                   940 events (dropped 0),       45 KiB data
$ blkparse sda1 | less   # parses data collected by blktrace

The default output of blkparse has everything anyone might want to know, and as a result is quite hard to read. Try this command line to restrict the output to disk requests (“-a issue”, which show up as lines with “D”) and request completions (“-a complete” and “C”).

$ blkparse -a issue -a complete -f "%5T.%9t: %2a %3d: %8NB @%9S [%C]\n" sda1 | less

With those arguments, blkparse output looks like this:

    0.000055499:  D   R:    16384B @ 33902656 [bash]
    0.073827694:  C   R:    16384B @ 33902656 [swapper/0]
    0.074596596:  D  RM:     4096B @ 33556536 [bash]
    0.090967020:  C  RM:     4096B @ 33556536 [swapper/0]
    0.091078346:  D  RM:     4096B @ 33556528 [bash]
    0.091270951:  C  RM:     4096B @ 33556528 [swapper/0]
    0.091432386:  D  WS:     4096B @ 24284592 [bash]
    0.091619438:  C  WS:     4096B @ 24284592 [swapper/0]
    0.100106375:  D  WS:    61440B @ 17181720 [kworker/0:1H]
    0.100526435:  C  WS:    61440B @ 17181720 [swapper/0]
    0.100583894:  D  WS:     4096B @ 17181840 [jbd2/sda1-8]
         |        |   |         |        |       |
         |        |   |         |        |       |
         |        |   |         |        |       +--- responsible command
         |        |   |         |        |
         |        |   |         |        +----------- disk offset of read or write
         |        |   |         +-------------------- number of bytes read or written
         |        |   |
         |        |   +------------------------------ Read or Write (plus other info)
         |        +---------------------------------- D = request issued to device,
         |                                            C = request completed
         |
         +------------------------------------------- time of request

File descriptors that represent infinite streams do not have file positions. ↩︎
It is not be allowed to return 0 as a “short read”, because 0 always means end of file. ↩︎
After adding support for io61_seek, some people break this invariant: they let pos_tag go outside the range of [tag, end_tag) after an io61_seek call. Since any character that is read or written must access an offset in that range, these people then check whether tag <= pos_tag <= end_tag inside io61_read and io61_write. If pos_tag has gone out of range, these functions must reinitialize the cache before they continue. ↩︎