System calls
System calls are special “function calls” that are handled not by the calling program, but rather by the operating system. Programs interact with other programs, and with the outside world, by making system calls. Whenever a program prints something or accepts input, a system call was involved. (A program’s registers and memory are private to that program, so normal function calls and instruction executions do not interact with the outside world.)
System calls are invoked by special hardware instructions, such as the x86-64
syscall
instruction. When the processor encounters a syscall
, it saves the
machine state and switches context to the kernel, which is the special
operating system program that executes with all privilege, allowing it to
control hardware. The kernel interprets the system call according to its rules
and resumes the calling process when complete.
How to read an strace
strace
is a Linux program that’s great for debugging the system calls made
by another program.
You run strace
like this:
$ strace -o strace.out PROGRAMNAME ARGUMENTS...
This runs PROGRAMNAME
with the given ARGUMENTS
, but it simultaneously
snoops on all the system calls PROGRAMNAME
makes and writes a readable
summary of those system calls to strace.out
. Look at that output by running
less strace.out
.
The first thirty or more lines of an strace are boilerplate caused by program
startup. It’s usually pretty easy to tell where the boilerplate ends and
program execution begins; for example, you might look for the first open
calls that refer to data files, rather than program libraries.
How to read a blktrace
blktrace
is a Linux program that helps debug the disk requests made by an
operating system in the course of executing other programs. This is below the
level of system calls: a single disk request might be made in response to
multiple system calls, and many system calls (such as those that access the
buffer cache) do not cause disk requests at all.
Install blktrace
with sudo yum install blktrace
(on Ubuntu), and run it
like this:
$ df . # figure out what disk you’re on
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 61897344 18080492 41107316 31% /
$ sudo sh -c "blktrace /dev/sda1 &" # puts machine-readable output in `sda1.blktrace.*`
$ PROGRAMNAME ARGUMENTS...
$ sudo killall blktrace
=== sda1 ===
CPU 0: 282 events, 14 KiB data
CPU 1: 658 events, 31 KiB data
Total: 940 events (dropped 0), 45 KiB data
$ blkparse sda1 | less
The default output of blkparse
has everything anyone might want to know, and
is a result is quite hard to read. Try this command line to restrict the
output to disk requests (“-a issue
”, which show up as lines with “D
”) and
request completions (“-a complete
” and “C
”).
$ blkparse -a issue -a complete -f "%5T.%9t: %2a %3d: %8NB @%9S [%C]\n" sda1 | less
With those arguments, blkparse
output looks like this:
0.000055499: D R: 16384B @ 33902656 [bash]
0.073827694: C R: 16384B @ 33902656 [swapper/0]
0.074596596: D RM: 4096B @ 33556536 [bash]
0.090967020: C RM: 4096B @ 33556536 [swapper/0]
0.091078346: D RM: 4096B @ 33556528 [bash]
0.091270951: C RM: 4096B @ 33556528 [swapper/0]
0.091432386: D WS: 4096B @ 24284592 [bash]
0.091619438: C WS: 4096B @ 24284592 [swapper/0]
0.100106375: D WS: 61440B @ 17181720 [kworker/0:1H]
0.100526435: C WS: 61440B @ 17181720 [swapper/0]
0.100583894: D WS: 4096B @ 17181840 [jbd2/sda1-8]
| | | | | |
| | | | | |
| | | | | +--- responsible command
| | | | |
| | | | +----------- disk offset of read or write
| | | +-------------------- number of bytes read or written
| | |
| | +------------------------------ Read or Write (plus other info)
| +---------------------------------- D = request issued to device,
| C = request completed
|
+------------------------------------------- time of request
Access patterns
An access pattern is a high-level description of a class of reference strings. That’s a bit technical: at a higher level, think of an access pattern as a general flavor of file or data access, described by the kinds of addresses that are accessed.
The different ...61
programs in your problem set perform different kinds of
access, and have different access patterns. Some especially imporant access
patterns are as foll
Sequential access
A sequential access pattern accesses a contiguous increasing sequence of addresses with nothing skipped, like 1, 2, 3, 4, 5, 6… or 1009, 1010, 1011, 1012, 1013….
Random access
A random access pattern skips around in address space at random, like 1, 10403, 96, 2, 51934, ….
Reverse-sequential access
A reverse-sequential access pattern access a contiguous decreasing sequence of addresses with nothing skipped, like 60274, 60273, 60272, 60271, 60270, ….
Strided access
A strided access pattern accesses an increasing sequence of addresses, but with large uniform skips, like 1, 1001, 2001, 3001, 4001, 5001, …. The “stride” of an access pattern is the distance between adjacent accesses. Sequential access is a kind of strided access but with “stride 1.”
Variants and combinations
These terms describe whole access patterns, but they are often used more loosely, to describe parts of a reference string. For instance, we might see a reference string that starts with a sequential region, then skips ahead by tens of thousands of bytes and has another sequential region, then does some stuff that you don’t understand so you call it “random,” etc.
Access size
Many caches transform access sizes, or units of access, in the process of handling an access request. Let the requested size be the amount of data requested from above by the cache’s user, and the implementation size be the amount of data that the cache itself requests from its underlying storage. Then these sizes need not be the same!
For example, consider the processor cache. The processor cache mostly has
requested sizes of 1, 2, 4, or 8 bytes (corresponding to movb
, movw
,
movl
, and movq
); but when a processor cache experiences a miss, its
implementation size is 64 or 128 bytes: it always loads 64 or 128 aligned
bytes of memory from the primary memory, then answers the request using that
region.
As an important side effect, access size transformations often transform a reference string effectively to have more repeated accesses. For instance, consider a sequential access pattern of bytes on the processor, starting at address 0x401001. Then the processor accesses bytes in this order:
Byte address | Cache line address (aligned by 64) |
---|---|
0x401001 | 0x401000 |
0x401002 | 0x401000 |
0x401003 | 0x401000 |
… | … |
0x40103e | 0x401000 |
0x40103f | 0x401000 |
0x401040 | 0x401040 |
0x401041 | 0x401040 |
0x401042 | 0x401040 |
… | … |
This is why single-slot caches can be effective in practice even on reference strings that don’t just access the same byte over and over and over again.
Strace exercises
Test your knowledge by characterizing access patterns and cache types given some straces!
In the s03
directory, you’ll see a bunch of truncated strace output in files
straceNN.out
. For each output, characterize its access pattern, and
describe what kind of I/O cache the program might have implemented, if
any.
- Sequential, byte-at-a-time input, 4096-byte block output
- Sequential, 4096-byte block input, 1024-byte block output
- Sequential, 3-byte block input, 1024-byte block output
- 4096-byte stride, direct characterwise system calls
- Reverse-sequential, stdio (4096-byte single-block cache)
- Reverse-sequential, bytewise (no cache)
- Sequential, 4096-byte block input, byte-at-a-time output
- 100-byte stride, stdio
- Sequential, stdio (4096-byte block input and output)
- Sequential, 4096-byte block input, 4096-byte block output (could be stdio)
- 4096-byte stride, stdio (so it refreshes 4096 bytes on every call)
Single-slot cache
In the rest of section, we’re going to work on a specific representation of a single-slot I/O cache. The purpose is to explain such caches and to get you used to thinking in terms of file offsets. Here’s the representation.
struct io61_file {
int fd;
static constexpr off_t bufsize = 4096; // or whatever
unsigned char cbuf[bufsize];
off_t tag; // file offset of first byte in cache (0 when file is opened)
off_t end_tag; // file offset one past last valid byte in cache
off_t pos_tag; // file offset of next char to read in cache
};
pos_tag
refers to the current read position. We include it to ensure that
the single-slot cache can alter the access size and respond to any requested
number of bytes.
Here are some facts about this cache representation.
tag <= end_tag
. (This is an invariant: an expression that is always true for every cache. You could encode the invariant as an assertion:assert(tag <= end_tag)
.)end_tag - tag <= bufsize
.- If
tag == end_tag
, then the cache’s slot is empty. tag <= pos_tag && pos_tag <= end_tag
(thepos_tag
lies between thetag
andend_tag
).- The file descriptor’s current file position equals
end_tag
for read caches (and seekable files). It equalstag
for write caches. - The next
io61_read
orio61_write
call should logically start at file offsetpos_tag
. That means the first byte read byio61_read
should come from file offsetpos_tag
, and similarly forio61_write
. - If
i >= tag
andi < end_tag
, then bytecbuf[i - tag]
contains the byte at file offseti
.
Fill the cache
Implement this function, which should fill the cache with data read
from the file. You will make a read
system call.
void io61_fill(io61_file* f) {
// Fill the read cache with new data, starting from file offset `end_tag`.
// Only called for read caches.
// Check invariants.
assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
assert(f->end_tag - f->pos_tag <= f->bufsize);
// Reset the cache to empty.
f->tag = f->pos_tag = f->end_tag;
// Recheck invariants (good practice!).
assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
assert(f->end_tag - f->pos_tag <= f->bufsize);
}
void io61_fill(io61_file* f) { // Fill the read cache with new data, starting from file offset `end_tag`. // Only called for read caches. // Check invariants. assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag); assert(f->end_tag - f->pos_tag <= f->bufsize); /* ANSWER */ // Reset the cache to empty. f->tag = f->pos_tag = f->end_tag; // Read data. ssize_t n = read(f->fd, f->cbuf, f->bufsize); if (n >= 0) { f->end_tag = f->tag + n; } // Recheck invariants (good practice!). assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag); assert(f->end_tag - f->pos_tag <= f->bufsize); }
Easy read
Implement io61_read
, assuming that the desired data is entirely contained
within the current cache slot.
ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
// Check invariants.
assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
assert(f->end_tag - f->pos_tag <= f->bufsize);
// The desired data is guaranteed to lie within this cache slot.
assert(sz <= f->bufsize && f->pos_tag + sz <= f->end_tag);
}
ssize_t io61_read(io61_file* f, char* buf, size_t sz) { // Check invariants. assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag); assert(f->end_tag - f->pos_tag <= f->bufsize); // The desired data is guaranteed to lie within this cache slot. assert(sz <= f->bufsize && f->pos_tag + sz <= f->end_tag); /* ANSWER */ memcpy(buf, &f->cbuf[f->pos_tag - f->tag], sz); f->pos_tag += sz; }
Full read
Implement io61_read
without that assumption. You will need to call
io61_fill
, and will use a loop. You may assume that no call to read
ever
returns an error.
Our tests do not check whether your IO61 library handles errors correctly. This means that you may assume that no
read
andwrite
system call returns a permanent error. But the best implementations will handle errors gracefully: if aread
system call returns a permanent error, thenio61_read
should return -1. (What to do with restartable errors is up to you, but most I/O libraries retry on encounteringEINTR
.)
ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
// Check invariants.
assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
assert(f->end_tag - f->pos_tag <= f->bufsize);
}
ssize_t io61_read(io61_file* f, char* buf, size_t sz) { // Check invariants. assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag); assert(f->end_tag - f->pos_tag <= f->bufsize); /* ANSWER */ size_t pos = 0; while (pos < sz) { if (f->pos_tag == f->end_tag) { io61_fill(f); if (f->pos_tag == f->end_tag) { break; } } // This would be faster if you used `memcpy`! buf[pos] = f->cbuf[f->pos_tag - f->tag]; ++f->pos_tag; ++pos; } return pos; }
Easy write
When reading from a cached file, the library fills the cache using system calls and empties it to the user. Writing to a cached file is the converse: the library fills the cache with user data and empties it using system calls.
For a read cache, the cache buffer region between pos_tag
and end_tag
contains data that has not yet been read, and the region after end_tag
(if
any) is invalid. There are several reasonable choices for write caches; in
these exercises we force pos_tag == end_tag
as an additional invariant.
Implement io61_write
, assuming that space for the desired data is
available within the current cache slot.
ssize_t io61_write(io61_file* f, const char* buf, size_t sz) {
// Check invariants.
assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
assert(f->end_tag - f->pos_tag <= f->bufsize);
// Write cache invariant.
assert(f->pos_tag == f->end_tag);
// The desired data is guaranteed to fit within this cache slot.
assert(sz <= f->bufsize && f->pos_tag + sz <= f->tag + f->bufsize);
}
ssize_t io61_write(io61_file* f, const char* buf, size_t sz) { // Check invariants. assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag); assert(f->end_tag - f->pos_tag <= f->bufsize); // Write cache invariant. assert(f->pos_tag == f->end_tag); // The desired data is guaranteed to lie within this cache slot. assert(sz <= f->bufsize && f->pos_tag + sz <= f->tag + f->bufsize); /* ANSWER */ memcpy(&f->cbuf[f->pos_tag - f->tag], buf, sz); f->pos_tag += sz; f->end_tag += sz; return sz; }
Flush
Implement a function that empties a write cache by flushing its data using a system call. As before, you may assume that the system call succeeds (all data is written).
void io61_flush(io61_file* f) {
// Check invariants.
assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
assert(f->end_tag - f->pos_tag <= f->bufsize);
// Write cache invariant.
assert(f->pos_tag == f->end_tag);
}
void io61_flush(io61_file* f) { // Check invariants. assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag); assert(f->end_tag - f->pos_tag <= f->bufsize); // Write cache invariant. assert(f->pos_tag == f->end_tag); /* ANSWER */ ssize_t n = write(f->fd, f->cbuf, f->pos_tag - f->tag); assert((size_t) n == f->pos_tag - f->tag); f->tag = f->pos_tag; }
Full write
Implement a function that implements a full write. You will call io61_flush
and use a loop.
ssize_t io61_write(io61_file* f, const char* buf, size_t sz) {
// Check invariants.
assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag);
assert(f->end_tag - f->pos_tag <= f->bufsize);
// Write cache invariant.
assert(f->pos_tag == f->end_tag);
}
ssize_t io61_write(io61_file* f, const char* buf, size_t sz) { // Check invariants. assert(f->tag <= f->pos_tag && f->pos_tag <= f->end_tag); assert(f->end_tag - f->pos_tag <= f->bufsize); // Write cache invariant. assert(f->pos_tag == f->end_tag); /* ANSWER */ size_t pos = 0; while (pos < sz) { if (f->end_tag == f->tag + f->bufsize) { io61_flush(f); } // This would be faster if you used `memcpy`! f->cbuf[f->pos_tag - f->tag] = buf[pos]; ++f->pos_tag; ++f->end_tag; ++pos; } return pos; }
Seeks
Now, add seek to the mix. How should a seek be implemented for this single-slot cache?