Section 3 – CS 61 2018

System calls

System calls are special “function calls” that are handled not by the calling program, but rather by the operating system. Programs interact with other programs, and with the outside world, by making system calls. Whenever a program prints something or accepts input, a system call was involved. (A program’s registers and memory are private to that program, so normal function calls and instruction executions do not interact with the outside world.)

System calls are invoked by special hardware instructions, such as the x86-64 syscall instruction. When the processor encounters a syscall, it saves the machine state and switches context to the kernel, which is the special operating system program that executes with all privilege, allowing it to control hardware. The kernel interprets the system call according to its rules and resumes the calling process when complete.

How to read an strace

strace is a Linux program that’s great for debugging the system calls made by another program.

You run strace like this:

$ strace -o strace.out PROGRAMNAME ARGUMENTS...

This runs PROGRAMNAME with the given ARGUMENTS, but it simultaneously snoops on all the system calls PROGRAMNAME makes and writes a readable summary of those system calls to strace.out. Look at that output by running less strace.out.

The first thirty or more lines of an strace are boilerplate caused by program startup. It’s usually pretty easy to tell where the boilerplate ends and program execution begins; for example, you might look for the first open calls that refer to data files, rather than program libraries.

How to read a blktrace

blktrace is a Linux program that helps debug the disk requests made by an operating system in the course of executing other programs. This is below the level of system calls: a single disk request might be made in response to multiple system calls, and many system calls (such as those that access the buffer cache) do not cause disk requests at all.

Install blktrace with sudo yum install blktrace (on Ubuntu), and run it like this:

$ df .                                 # figure out what disk you’re on
Filesystem     1K-blocks     Used Available Use% Mounted on
/dev/sda1       61897344 18080492  41107316  31% /
$ sudo sh -c "blktrace /dev/sda1 &"    # puts machine-readable output in `sda1.blktrace.*`
$ PROGRAMNAME ARGUMENTS...
$ sudo killall blktrace
=== sda1 ===
  CPU  0:                  282 events,       14 KiB data
  CPU  1:                  658 events,       31 KiB data
  Total:                   940 events (dropped 0),       45 KiB data
$ blkparse sda1 | less

The default output of blkparse has everything anyone might want to know, and is a result is quite hard to read. Try this command line to restrict the output to disk requests (“-a issue”, which show up as lines with “D”) and request completions (“-a complete” and “C”).

$ blkparse -a issue -a complete -f "%5T.%9t: %2a %3d: %8NB @%9S [%C]\n" sda1 | less

With those arguments, blkparse output looks like this:

    0.000055499:  D   R:    16384B @ 33902656 [bash]
    0.073827694:  C   R:    16384B @ 33902656 [swapper/0]
    0.074596596:  D  RM:     4096B @ 33556536 [bash]
    0.090967020:  C  RM:     4096B @ 33556536 [swapper/0]
    0.091078346:  D  RM:     4096B @ 33556528 [bash]
    0.091270951:  C  RM:     4096B @ 33556528 [swapper/0]
    0.091432386:  D  WS:     4096B @ 24284592 [bash]
    0.091619438:  C  WS:     4096B @ 24284592 [swapper/0]
    0.100106375:  D  WS:    61440B @ 17181720 [kworker/0:1H]
    0.100526435:  C  WS:    61440B @ 17181720 [swapper/0]
    0.100583894:  D  WS:     4096B @ 17181840 [jbd2/sda1-8]
         |        |   |         |        |       |
         |        |   |         |        |       |
         |        |   |         |        |       +--- responsible command
         |        |   |         |        |
         |        |   |         |        +----------- disk offset of read or write
         |        |   |         +-------------------- number of bytes read or written
         |        |   |
         |        |   +------------------------------ Read or Write (plus other info)
         |        +---------------------------------- D = request issued to device,
         |                                            C = request completed
         |
         +------------------------------------------- time of request

Access patterns

An access pattern is a high-level description of a class of reference strings. That’s a bit technical: at a higher level, think of an access pattern as a general flavor of file or data access, described by the kinds of addresses that are accessed.

The different ...61 programs in your problem set perform different kinds of access, and have different access patterns. Some especially imporant access patterns are as foll

Sequential access

A sequential access pattern accesses a contiguous increasing sequence of addresses with nothing skipped, like 1, 2, 3, 4, 5, 6… or 1009, 1010, 1011, 1012, 1013….

Random access

A random access pattern skips around in address space at random, like 1, 10403, 96, 2, 51934, ….

Reverse-sequential access

A reverse-sequential access pattern access a contiguous decreasing sequence of addresses with nothing skipped, like 60274, 60273, 60272, 60271, 60270, ….

Strided access

A strided access pattern accesses an increasing sequence of addresses, but with large uniform skips, like 1, 1001, 2001, 3001, 4001, 5001, …. The “stride” of an access pattern is the distance between adjacent accesses. Sequential access is a kind of strided access but with “stride 1.”

Variants and combinations

These terms describe whole access patterns, but they are often used more loosely, to describe parts of a reference string. For instance, we might see a reference string that starts with a sequential region, then skips ahead by tens of thousands of bytes and has another sequential region, then does some stuff that you don’t understand so you call it “random,” etc.

Access size

Many caches transform access sizes, or units of access, in the process of handling an access request. Let the requested size be the amount of data requested from above by the cache’s user, and the implementation size be the amount of data that the cache itself requests from its underlying storage. Then these sizes need not be the same!

For example, consider the processor cache. The processor cache mostly has requested sizes of 1, 2, 4, or 8 bytes (corresponding to movb, movw, movl, and movq); but when a processor cache experiences a miss, its implementation size is 64 or 128 bytes: it always loads 64 or 128 aligned bytes of memory from the primary memory, then answers the request using that region.

As an important side effect, access size transformations often transform a reference string effectively to have more repeated accesses. For instance, consider a sequential access pattern of bytes on the processor, starting at address 0x401001. Then the processor accesses bytes in this order:

Byte address	Cache line address (aligned by 64)
0x401001	0x401000
0x401002	0x401000
0x401003	0x401000
…	…
0x40103e	0x401000
0x40103f	0x401000
0x401040	0x401040
0x401041	0x401040
0x401042	0x401040
…	…

This is why single-slot caches can be effective in practice even on reference strings that don’t just access the same byte over and over and over again.

Strace exercises

Test your knowledge by characterizing access patterns and cache types given some straces!

In the s03 directory, you’ll see a bunch of truncated strace output in files straceNN.out. For each output, characterize its access pattern, and describe what kind of I/O cache the program might have implemented, if any.

Single-slot cache

In the rest of section, we’re going to work on a specific representation of a single-slot I/O cache. The purpose is to explain such caches and to get you used to thinking in terms of file offsets. Here’s the representation.

struct io61_file {
    int fd;
    static constexpr off_t bufsize = 4096; // or whatever
    unsigned char cbuf[bufsize];
    off_t tag;      // file offset of first byte in cache (0 when file is opened)
    off_t end_tag;  // file offset one past last valid byte in cache
    off_t pos_tag;  // file offset of next char to read in cache
};

pos_tag refers to the current read position. We include it to ensure that the single-slot cache can alter the access size and respond to any requested number of bytes.

Here are some facts about this cache representation.

tag <= end_tag. (This is an invariant: an expression that is always true for every cache. You could encode the invariant as an assertion: assert(tag <= end_tag).)
end_tag - tag <= bufsize.
If tag == end_tag, then the cache’s slot is empty.
tag <= pos_tag && pos_tag <= end_tag (the pos_tag lies between the tag and end_tag).
The file descriptor’s current file position equals end_tag for read caches (and seekable files). It equals tag for write caches.
The next io61_read or io61_write call should logically start at file offset pos_tag. That means the first byte read by io61_read should come from file offset pos_tag, and similarly for io61_write.
If i >= tag and i < end_tag, then byte cbuf[i - tag] contains the byte at file offset i.

Fill the cache

Implement this function, which should fill the cache with data read from the file. You will make a read system call.

void io61_fill(io61_file* f) {
    // Fill the read cache with new data, starting from file offset `end_tag`.
    // Only called for read caches.

    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Reset the cache to empty.
    f->tag = f->pos_tag = f->end_tag;










    // Recheck invariants (good practice!).
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);
}

void io61_fill(io61_file* f) {
    // Fill the read cache with new data, starting from file offset `end_tag`.
    // Only called for read caches.

    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    /* ANSWER */
    // Reset the cache to empty.
    f->tag = f->pos_tag = f->end_tag;
    // Read data.
    ssize_t n = read(f->fd, f->cbuf, f->bufsize);
    if (n >= 0) {
        f->end_tag = f->tag + n;
    }

    // Recheck invariants (good practice!).
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);
}

Easy read

Implement io61_read, assuming that the desired data is entirely contained within the current cache slot.

ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // The desired data is guaranteed to lie within this cache slot.
    assert(sz <= f->bufsize && f->pos_tag + sz <= f->end_tag);





}

ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // The desired data is guaranteed to lie within this cache slot.
    assert(sz <= f->bufsize && f->pos_tag + sz <= f->end_tag);

    /* ANSWER */
    memcpy(buf, &f->cbuf[f->pos_tag - f->tag], sz);
    f->pos_tag += sz;
}

Full read

Implement io61_read without that assumption. You will need to call io61_fill, and will use a loop. You may assume that no call to read ever returns an error.

Our tests do not check whether your IO61 library handles errors correctly. This means that you may assume that no read and write system call returns a permanent error. But the best implementations will handle errors gracefully: if a read system call returns a permanent error, then io61_read should return -1. (What to do with restartable errors is up to you, but most I/O libraries retry on encountering EINTR.)

ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);









}

ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    /* ANSWER */
    size_t pos = 0;
    while (pos < sz) {
        if (f->pos_tag == f->end_tag) {
            io61_fill(f);
            if (f->pos_tag == f->end_tag) {
                break;
            }
        }

        // This would be faster if you used `memcpy`!
        buf[pos] = f->cbuf[f->pos_tag - f->tag];
        ++f->pos_tag;
        ++pos;
    }
    return pos;
}

Easy write

When reading from a cached file, the library fills the cache using system calls and empties it to the user. Writing to a cached file is the converse: the library fills the cache with user data and empties it using system calls.

For a read cache, the cache buffer region between pos_tag and end_tag contains data that has not yet been read, and the region after end_tag (if any) is invalid. There are several reasonable choices for write caches; in these exercises we force pos_tag == end_tag as an additional invariant.

Implement io61_write, assuming that space for the desired data is available within the current cache slot.

ssize_t io61_write(io61_file* f, const char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);

    // The desired data is guaranteed to fit within this cache slot.
    assert(sz <= f->bufsize && f->pos_tag + sz <= f->tag + f->bufsize);






}

ssize_t io61_write(io61_file* f, const char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);

    // The desired data is guaranteed to lie within this cache slot.
    assert(sz <= f->bufsize && f->pos_tag + sz <= f->tag + f->bufsize);

    /* ANSWER */
    memcpy(&f->cbuf[f->pos_tag - f->tag], buf, sz);
    f->pos_tag += sz;
    f->end_tag += sz;
    return sz;
}

Flush

Implement a function that empties a write cache by flushing its data using a system call. As before, you may assume that the system call succeeds (all data is written).

void io61_flush(io61_file* f) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);






}

void io61_flush(io61_file* f) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);

    /* ANSWER */
    ssize_t n = write(f->fd, f->cbuf, f->pos_tag - f->tag);
    assert((size_t) n == f->pos_tag - f->tag);
    f->tag = f->pos_tag;
}

Full write

Implement a function that implements a full write. You will call io61_flush and use a loop.

ssize_t io61_write(io61_file* f, const char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);






}

ssize_t io61_write(io61_file* f, const char* buf, size_t sz) {
    // Check invariants.
    assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
    assert(f->end_tag - f->pos_tag <= f->bufsize);

    // Write cache invariant.
    assert(f->pos_tag == f->end_tag);

    /* ANSWER */
    size_t pos = 0;
    while (pos < sz) {
        if (f->end_tag == f->tag + f->bufsize) {
            io61_flush(f);
        }

        // This would be faster if you used `memcpy`!
        f->cbuf[f->pos_tag - f->tag] = buf[pos];
        ++f->pos_tag;
        ++f->end_tag;
        ++pos;
    }
    return pos;
}

Seeks

Now, add seek to the mix. How should a seek be implemented for this single-slot cache?