Section 5: Access patterns

To compete with an existing caching scheme, as you'll do in this Unit's pset, it is helpful to understand the patterns in a sequence of references (a reference stream), how an existing cache changes that reference stream, and what opportunities exist for further performance optimization.

Toward that end, this section starts with an introduction to a very useful tool, strace, for gathering a program's reference stream to a file pointer. It then leads you through several important types of access patterns in the reference stream, and it has you classify several different traces gathered by strace. Finally, this section ends by having you write code for some different operations involved in coding a single-slot cache.

Running `strace`

strace is a Linux program that’s great for debugging the system calls made by another program.

You run strace like this:

$ strace -o strace.out PROGRAMNAME ARGUMENTS...

This runs PROGRAMNAME with the given ARGUMENTS, but it simultaneously snoops on all the system calls PROGRAMNAME makes and writes a readable summary of those system calls to strace.out. Look at that output by running less strace.out.

The first thirty or more lines of an strace are boilerplate caused by program startup. It’s usually pretty easy to tell where the boilerplate ends and program execution begins; for example, you might look for the first open calls that refer to data files, rather than program libraries.

Access patterns

An access pattern is a high-level description of a class of reference strings. That’s a bit technical: at a higher level, think of an access pattern as a general flavor of file or data access, described by the kinds of addresses that are accessed.

The different ...61 programs distributed in your pset directory perform different kinds of access, and have different access patterns. Some especially imporant access patterns are as follows:

Sequential access

A sequential access pattern accesses a contiguous increasing sequence of addresses with nothing skipped, like 1, 2, 3, 4, 5, 6… or 1009, 1010, 1011, 1012, 1013….

Random access

A random access pattern skips around in address space at random, like 1, 10403, 96, 2, 51934, ….

Reverse-sequential access

A reverse-sequential access pattern access a contiguous decreasing sequence of addresses with nothing skipped, like 60274, 60273, 60272, 60271, 60270, ….

Strided access

A strided access pattern accesses a sequence of addresses with a uniform skip between each referenced address. For example, sequence 1, 1001, 2001, 3001, 4001, 5001, … is a strided access pattern with a stride of +1000. Sequential access is a kind of strided access with stride +1, and reverse-sequential access has stride -1.

Variants and combinations

These terms describe whole access patterns, but they are often used more loosely, to describe parts of a reference string. For instance, we might see a reference string that starts with a sequential region, then skips ahead by tens of thousands of bytes and has another sequential region, then does some stuff that you don’t understand that you classify as “random.”

Access size

To speed up the performance of a program, caches often transform the reads and writes done by the program at a particular level of the memory hierarchy into a different set of reads and writes to the next higher level of the memory hierarchy.

For example, consider the processor cache, which tries to satisfy simple reads and writes to main memory (DRAM technology) with faster responses from a cache made of SRAM circuits (a memory technology that is faster and more costly when compared to DRAM). The processor cache mostly has requested sizes of 1, 2, 4, or 8 bytes (corresponding to movb, movw, movl, and movq); but when a processor cache experiences a miss (i.e., it hasn't cached the requested data), it requests not a few bytes from main memory, but a large block of bytes (e.g., a block of 64 or 128 bytes). This block is 64-byte or 128-byte aligned, and it contains the requested data, which the cache returns to complete the original mov* request.

Let the requested size be the amount of data requested from above by the cache’s user, and the block size be the amount of data that the cache itself requests from its underlying storage. As the example shows, these access sizes need not be the same!

As an important side effect, access size transformations often transform a reference string without temporal locality into one with such locality. For instance, consider the following sequential access pattern, starting at address 0x401001. Then the processor requests bytes in this order:

Byte address	Cache line address (aligned by 64)
0x401001	0x401000
0x401002	0x401000
0x401003	0x401000
…	…
0x40103e	0x401000
0x40103f	0x401000
0x401040	0x401040
0x401041	0x401040
0x401042	0x401040
…	…

Although the requested byte addresses do not exhibit temporal locality (i.e., no repeated requests for the same byte address), the referenced cache lines do! The access-size transformation has turned spatial locality in the processor's reference string into temporal locality in the references in the cache. This is why single-slot caches can be effective in practice even on reference strings that don’t just access the same byte over and over and over again. In particular, if the processor requests 1 byte at a time from a sequential set of addresses and the cache reads 4096 bytes at once on the first miss, then every read but the first is fulfilled from the cache, saving 4095 accesses to slower storage!

A cache, unfortunately, doesn't always do the right thing. Consider the case where a user reads more than 4096 bytes in a single system call. Our single-slot cache that fetches only 4096 bytes at a time now causes the system to make multiple reads on the slower storage! Knowing the most common access patterns matters if you want to improve the average performance of your system!

Strace exercises

Test your knowledge by characterizing access patterns and cache types given some straces!

In the s05 directory, you’ll see a bunch of truncated strace output in files straceNN.out. For each output, characterize its access pattern, and describe what kind of I/O cache the program might have implemented, if any.

Single-slot cache

In the rest of section, we’re going to work on a specific representation of a single-slot I/O cache. The purpose is to explain such caches and to get you used to thinking in terms of file offsets. Here’s the representation.

struct io61_file {
    int fd;
    static constexpr off_t bufsize = 4096; // or whatever
    unsigned char cbuf[bufsize];
    off_t tag;      // file offset of first byte in cache (0 when file is opened)
    off_t end_tag;  // file offset one past last valid byte in cache
    off_t pos_tag;  // file offset of next char to read in cache
};

pos_tag refers to the current read or write position, equivalent to the file's file offset.

Here are some facts about this cache representation.

tag <= end_tag. This is an invariant: an expression that is always true for every cache. You could encode the invariant as an assertion: assert(tag <= end_tag).
end_tag - tag <= bufsize. Notice that the difference is NOT always equal to bufsize. Allowing this permits us to encode things like ...
If tag == end_tag, then the cache’s slot is empty. An empty slot says that nothing is yet cached.
tag <= pos_tag && pos_tag <= end_tag. This is another invariant: the pos_tag lies between the tag and end_tag, inclusive.
The file descriptor’s current file position equals end_tag for read caches (and seekable files). It equals tag for write caches.
The next io61_read or io61_write call should logically start at file offset pos_tag. That means the first byte read by io61_read should come from file offset pos_tag, and similarly for io61_write.
If i >= tag and i < end_tag, then byte cbuf[i - tag] contains the byte at file offset i.

Fill the cache

Implement this function, which should fill the cache with data read from the file. You will make a read system call.

void io61_fill(io61_file* f) {
    // Fill the read cache with new data, starting from file offset `end_tag`.
    // Only called for read caches.

    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Reset the cache to empty.
    f->tag = f->pos_tag = f->end_tag;










    // Recheck invariants (good practice!).
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);
}

void io61_fill(io61_file* f) {
    // Fill the read cache with new data, starting from file offset `end_tag`.
    // Only called for read caches.

    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    /* ANSWER */
    // Reset the cache to empty.
    f->tag = f->pos_tag = f->end_tag;
    // Read data.
    ssize_t n = read(f->fd, f->cbuf, f->bufsize);
    if (n >= 0) {
        f->end_tag = f->tag + n;
    }

    // Recheck invariants (good practice!).
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);
}

Easy read

Implement io61_read, assuming that the desired data is entirely contained within the current cache slot.

ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // The desired data is guaranteed to lie within this cache slot.
    assert(sz <= f->bufsize && f->pos_tag + sz <= f->end_tag);





}

ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // The desired data is guaranteed to lie within this cache slot.
    assert(sz <= f->bufsize && f->pos_tag + sz <= f->end_tag);

    /* ANSWER */
    memcpy(buf, &f->cbuf[f->pos_tag - f->tag], sz);
    f->pos_tag += sz;
}

Full read

Implement io61_read without that assumption. You will need to call io61_fill, and will use a loop. You may assume that no call to read ever returns an error.

Our tests do not check whether your IO61 library handles errors correctly. This means that you may assume that no read and write system call returns a permanent error. But the best implementations will handle errors gracefully: if a read system call returns a permanent error, then io61_read should return -1. (What to do with restartable errors is up to you, but most I/O libraries retry on encountering EINTR.)

ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);









}

ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    /* ANSWER */
    size_t pos = 0;
    while (pos < sz) {
        if (f->pos_tag == f->end_tag) {
            io61_fill(f);
            if (f->pos_tag == f->end_tag) {
                break;
            }
        }

        // This would be faster if you used `memcpy`!
        buf[pos] = f->cbuf[f->pos_tag - f->tag];
        ++f->pos_tag;
        ++pos;
    }
    return pos;
}

Easy write

When reading from a cached file, the library fills the cache using system calls and empties it to the user. Writing to a cached file is the converse: the library fills the cache with user data and empties it using system calls.

For a read cache, the cache buffer region between pos_tag and end_tag contains data that has not yet been read, and the region after end_tag (if any) is invalid. There are several reasonable choices for write caches; in these exercises we force pos_tag == end_tag as an additional invariant.

Implement io61_write, assuming that space for the desired data is available within the current cache slot.

ssize_t io61_write(io61_file* f, const char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);

    // The desired data is guaranteed to fit within this cache slot.
    assert(sz <= f->bufsize && f->pos_tag + sz <= f->tag + f->bufsize);






}

ssize_t io61_write(io61_file* f, const char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);

    // The desired data is guaranteed to lie within this cache slot.
    assert(sz <= f->bufsize && f->pos_tag + sz <= f->tag + f->bufsize);

    /* ANSWER */
    memcpy(&f->cbuf[f->pos_tag - f->tag], buf, sz);
    f->pos_tag += sz;
    f->end_tag += sz;
    return sz;
}

Flush

Implement a function that empties a write cache by flushing its data using a system call. As before, you may assume that the system call succeeds (all data is written).

void io61_flush(io61_file* f) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);






}

void io61_flush(io61_file* f) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);

    /* ANSWER */
    ssize_t n = write(f->fd, f->cbuf, f->pos_tag - f->tag);
    assert((size_t) n == f->pos_tag - f->tag);
    f->tag = f->pos_tag;
}

Full write

Implement a function that implements a full write. You will call io61_flush and use a loop.

ssize_t io61_write(io61_file* f, const char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);






}

ssize_t io61_write(io61_file* f, const char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);

    /* ANSWER */
    size_t pos = 0;
    while (pos < sz) {
        if (f->end_tag == f->tag + f->bufsize) {
            io61_flush(f);
        }

        // This would be faster if you used `memcpy`!
        f->cbuf[f->pos_tag - f->tag] = buf[pos];
        ++f->pos_tag;
        ++f->end_tag;
        ++pos;
    }
    return pos;
}

Seeks

Now, add seek to the mix. How should a seek be implemented for this single-slot cache?

Running `blktrace`

If you're interested in exploring the pattern of references made to a physical hard disk, you can try running blktrace, a Linux program that helps debug the disk requests made by an operating system in the course of executing other programs. This is below the level of system calls: a single disk request might be made in response to multiple system calls, and many system calls (such as those that access the buffer cache) do not cause disk requests at all.

Install blktrace with sudo yum install blktrace (on Ubuntu), and run it like this:

$ df .                                 # figure out what disk you’re on
Filesystem     1K-blocks     Used Available Use% Mounted on
/dev/sda1       61897344 18080492  41107316  31% /
$ sudo sh -c "blktrace /dev/sda1 &"    # puts machine-readable output in `sda1.blktrace.*`
$ PROGRAMNAME ARGUMENTS...
$ sudo killall blktrace
=== sda1 ===
  CPU  0:                  282 events,       14 KiB data
  CPU  1:                  658 events,       31 KiB data
  Total:                   940 events (dropped 0),       45 KiB data
$ blkparse sda1 | less   # parses data collected by blktrace

The default output of blkparse has everything anyone might want to know, and is a result is quite hard to read. Try this command line to restrict the output to disk requests (“-a issue”, which show up as lines with “D”) and request completions (“-a complete” and “C”).

$ blkparse -a issue -a complete -f "%5T.%9t: %2a %3d: %8NB @%9S [%C]\n" sda1 | less

With those arguments, blkparse output looks like this:

    0.000055499:  D   R:    16384B @ 33902656 [bash]
    0.073827694:  C   R:    16384B @ 33902656 [swapper/0]
    0.074596596:  D  RM:     4096B @ 33556536 [bash]
    0.090967020:  C  RM:     4096B @ 33556536 [swapper/0]
    0.091078346:  D  RM:     4096B @ 33556528 [bash]
    0.091270951:  C  RM:     4096B @ 33556528 [swapper/0]
    0.091432386:  D  WS:     4096B @ 24284592 [bash]
    0.091619438:  C  WS:     4096B @ 24284592 [swapper/0]
    0.100106375:  D  WS:    61440B @ 17181720 [kworker/0:1H]
    0.100526435:  C  WS:    61440B @ 17181720 [swapper/0]
    0.100583894:  D  WS:     4096B @ 17181840 [jbd2/sda1-8]
         |        |   |         |        |       |
         |        |   |         |        |       |
         |        |   |         |        |       +--- responsible command
         |        |   |         |        |
         |        |   |         |        +----------- disk offset of read or write
         |        |   |         +-------------------- number of bytes read or written
         |        |   |
         |        |   +------------------------------ Read or Write (plus other info)
         |        +---------------------------------- D = request issued to device,
         |                                            C = request completed
         |
         +------------------------------------------- time of request