Exercises: Storage – CS 61 2019

Exercises not as directly relevant to this year’s class are marked with ⚠️.

IO-1. I/O caching

Mary Ruefle, a poet who lives in Vermont, is working on her caching I/O library for CS 61. She wants to implement a cache with N slots. Since searching those slots might slow down her library, she writes a function that maps addresses to slots. Here’s some of her code.

#define SLOTSIZ 4096
struct io61_slot {
    char buf[SLOTSIZ];
    off_t pos; // = (off_t) -1 for empty slots
    ssize_t sz;
};

#define NSLOTS 64
struct io61_file {
    int fd;
    off_t pos; // current file position
    io61_slot slots[NSLOTS];
};

static inline int find_slot(off_t off) {
    return off % NSLOTS;
}

int io61_readc(io61_file* f) {
    int slotindex = find_slot(f->pos);
    io61_slot* s = &f->slots[slotindex];

    if (f->pos < s->pos || f->pos >= s->pos + s->sz) {
        // slot contains wrong data, need to refill it
        off_t new_pos = lseek(f->fd, f->pos, SEEK_SET);
        assert(new_pos != (off_t) -1); // only handle seekable files for now
        ssize_t r = read(f->fd, s->buf, SLOTSIZ);
        if (r == -1 || r == 0) {
            return EOF;
        }
        s->pos = f->pos;
        s->sz = r;
    }

    int ch = (unsigned char) s->buf[f->pos - s->pos];
    ++f->pos;
    return ch;
}

Before she can run and debug this code, Mary is led “to an emergency of feeling that … results in a poem.” She’ll return to CS61 and fix her implementation soon, but in the meantime, let’s answer some questions about it.

QUESTION IO-1A. True or false: Mary’s cache is a direct-mapped cache.

QUESTION IO-1B. What changes to Mary’s code could change your answer to Part A? Circle all that apply.

New code for find_slot (keeping io61_readc the same)
New code in io61_readc (keeping find_slot the same)
New code in io61_readc and new code for find_slot
None of the above

QUESTION IO-1C. Which problems would occur when Mary’s code was used to sequentially read a seekable file of size 2MiB (2×2²⁰ = 2097152 bytes) one character at a time? Circle all that apply.

Excessive process CPU usage (>10x stdio)
Many system calls to read data (>10x stdio)
Incorrect data (byte x read at a position where the file has byte y≠x)
Read too much data (more bytes read than file contains)
Read too little data (fewer bytes read than file contains)
Crash/undefined behavior
None of the above

QUESTION IO-1D. Which of these new implementations for find_slot would fix at least one of these problems with reading sequential files? Circle all that apply.

return (off * 2654435761) % NSLOTS; /* integer hash function from Stack Overflow */
return (off / SLOTSIZ) % NSLOTS;
return off & (NSLOTS - 1);
return 0;
return (off >> 12) & 0x3F;
None of the above

IO-2. Caches and reference strings

QUESTION IO-2A. True or false: A direct-mapped cache with N or more slots can handle any reference string containing ≤N distinct addresses with no misses except for cold misses.

QUESTION IO-2B. True or false: A fully-associative cache with N or more slots can handle any reference string containing ≤N distinct addresses with no misses except for cold misses.

Consider the following 5 reference strings.

Name	String
α	1
β	1, 2
γ	1, 2, 3, 4, 5
δ	2, 4
ε	5, 2, 4, 2

QUESTION IO-2C. Which of the strings might indicate a sequential access pattern? Circle all that apply.

α	β	γ	δ	ε	None of these

QUESTION IO-2D. Which of the strings might indicate a strided access pattern with stride >1? Circle all that apply.

α	β	γ	δ	ε	None of these

The remaining questions concern concatenated permutations of these five strings. For example, the permutation αβγδε refers to this reference string:

1, 1, 2, 1, 2, 3, 4, 5, 2, 4, 5, 2, 4, 2.

We pass such permutations through an initially-empty, fully-associative cache with 3 slots, and observe the numbers of hits.

QUESTION IO-2E. How many cold misses might a permutation observe? Circle all that apply.

0	1	2	3	4	5	Some other number

Under LRU eviction, the permutation αβεγδ observes 5 hits as follows. (We annotate each access with “h” for hit or “m” for miss.)

1m; 1h, 2m; 5m, 2h, 4m, 2h; 1m, 2h, 3m, 4m, 5m; 2m, 4h.

QUESTION IO-2F. How many hits does this permutation observe under FIFO eviction?

QUESTION IO-2G. Give a permutation that will observe 8 hits under LRU eviction, which is the maximum for any permutation. There are several possible answers. (Write your answer as a permutation of αβγδε. For partial credit, find a permutation that has 7 hits, etc.)

QUESTION IO-2H. Give a permutation that will observe 2 hits under LRU eviction, which is the minimum for any permutation. There is one unique answer. (Write your answer as a permutation of αβγδε. For partial credit, find a permutation that has 3 hits, etc.)

IO-3. Processor cache

The git version control system is based on commit hashes, which are 160-bit (20-byte) hash values used to identify commits. In this problem you’ll consider the processor cache behavior of several versions of a “grading server” that maps commits to grades. Here’s the first version:

struct commit_info {
    char hash[20];
    int grade[11];
};

commit_info* commits;
size_t N;

int get_grade1(const char* hash, int pset) {
    for (size_t i = 0; i != N; ++i) {
        if (memcmp(commits[i].hash, hash, 20) == 0) {
            return commits[i].grade[pset];
        }
    }
    return -1;
}

We will ask questions about the average number of distinct cache lines accessed by variants of get_grade(hash, pset). You should make the following assumptions:

The hash argument is uniformly drawn from the set of known commits. That is, the probability that hash equals the ith commit’s hash is 1/N.
Only count cache lines accessible via commits. Don’t worry about cache lines used for local variables, for parameters, for global variables, or for instructions. For instance, do not count the hash argument or the global-data cache line that stores the commits variable itself.
The commits pointer is 64-byte aligned and cache lines are 64 bytes long.
Commit hashes are mathematically indistinguishable from random numbers. Thus, the probability that two different hashes have the same initial k bits equals 1/2^k.
We’ll ignore small errors; N/2 and (N+1)/2 will be considered equivalent.

QUESTION IO-3A. What is the expected number of cache lines accessed by get_grade1, in terms of N?

The second version:

struct commit_info {
   char hash[20];
   int grade[11];
};

commit_info** commits;
size_t N;

int get_grade2(const char hash[20], int pset) {
    for (size_t i = 0; i != N; ++i) {
        if (memcmp(commits[i]->hash, hash, 20) == 0) {
            return commits[i]->grade[pset];
        }
    }
    return -1;
}

QUESTION IO-3B. What is the expected number of cache lines accessed by get_grade2, in terms of N?

The third version:

struct commit_info {
    char hash[20];
    int grade[11];
};

struct commit_hint {
    char hint[8];
    commit_info* commit;
};

commit_hint* commits;
size_t N;

int get_grade3(const char* hash, int pset) {
    for (size_t i = 0; i != N; ++i) {
        if (memcmp(commits[i].hint, hash, 8) == 0
            && memcmp(commits[i].commit->hash, hash, 20) == 0) {
            return commits[i].commit->grade[pset];
        }
    }
    return -1;
}

QUESTION IO-3C. What is the expected number of cache lines accessed by get_grade3, in terms of N? (You may assume that N≤2000.)

The fourth version is a hash table.

struct commit_info {
    char hash[20];
    int grade[11];
};

commit_info** commits;
size_t commits_hashsize;

int get_grade4(const char* hash, int pset) {
    // choose initial bucket
    size_t bucket;
    memcpy(&bucket, hash, sizeof(bucket));
    bucket = bucket % commits_hashsize;
    // search for the commit starting from that bucket
    while (commits[bucket] != nullptr) {
        if (memcmp(commits[bucket]->hash, hash, 20) == 0) {
            return commits[bucket]->grade[pset];
        }
        bucket = (bucket + 1) % commits_hashsize;
    }
    return -1;
}

QUESTION IO-3D. Assume that a call to get_grade4 encounters B - 1 expected hash collisions (i.e., examines B buckets total, including the bucket that actually contains hash). What is the expected number of cache lines accessed by get_grade4, in terms of N and B?

IO-4. IO caching and strace

Elif Batuman is investigating several program executables left behind by her ex-roommate Fyodor. She runs each executable under strace in the following way:

strace -o strace.txt ./EXECUTABLE files/text1meg.txt > files/out.txt

Help her figure out properties of these programs based on their system call traces.

QUESTION IO-4A. Program ./mysterya:

open("files/text1meg.txt", O_RDONLY)    = 3
brk(0)                                  = 0x8193000
brk(0x81b5000)                          = 0x81b5000
read(3, "A", 1)                         = 1
write(1, "A", 1)                        = 1
read(3, "\n", 1)                        = 1
write(1, "\n", 1)                       = 1
read(3, "A", 1)                         = 1
write(1, "A", 1)                        = 1
read(3, "'", 1)                         = 1
write(1, "'", 1)                        = 1
read(3, "s", 1)                         = 1
write(1, "s", 1)                        = 1
...

Circle at least one option in each column.

Sequential IO
Reverse sequential IO
Strided IO

No read cache
Unaligned read cache
Aligned read cache

No write cache
Write cache

Cache size 4096
Cache size 2048
Cache size 1024
Other

QUESTION IO-4B. Program ./mysteryb:

open("files/text1meg.txt", O_RDONLY)    = 3
brk(0)                                  = 0x96c5000
brk(0x96e6000)                          = 0x96e6000
read(3, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 2048) = 2048
write(1, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 2048) = 2048
read(3, "kad\nAkron\nAkron's\nAl\nAl's\nAla\nAl"..., 2048) = 2048
write(1, "kad\nAkron\nAkron's\nAl\nAl's\nAla\nAl"..., 2048) = 2048
...

Circle at least one option in each column.

Sequential IO
Reverse sequential IO
Strided IO

No read cache
Unaligned read cache
Aligned read cache

No write cache
Write cache

Cache size 4096
Cache size 2048
Cache size 1024
Other

QUESTION IO-4C. Program ./mysteryc:

open("files/text1meg.txt", O_RDONLY)    = 3
brk(0)                                  = 0x9064000
brk(0x9085000)                          = 0x9085000
fstat64(3, {st_mode=S_IFREG|0664, st_size=1048576, ...}) = 0
lseek(3, 1046528, SEEK_SET)             = 1046528
read(3, "ingau\nRheingau's\nRhenish\nRhianno"..., 2048) = 2048
write(1, "oR\ntlevesooR\ns'yenooR\nyenooR\ns't"..., 2048) = 2048
lseek(3, 1044480, SEEK_SET)             = 1044480
read(3, "Quinton\nQuinton's\nQuirinal\nQuisl"..., 2048) = 2048
write(1, "ehR\neehR\naehR\ns'hR\nhR\nsdlonyeR\ns"..., 2048) = 2048
lseek(3, 1042432, SEEK_SET)             = 1042432
read(3, "emyslid's\nPrensa\nPrensa's\nPrenti"..., 2048) = 2048
write(1, "\ns'nailitniuQ\nnailitniuQ\nnniuQ\ns"..., 2048) = 2048
lseek(3, 1040384, SEEK_SET)             = 1040384
read(3, "Pindar's\nPinkerton\nPinocchio\nPin"..., 2048) = 2048
write(1, "rP\ndilsymerP\ns'regnimerP\nregnime"..., 2048) = 2048
...

Circle at least one option in each column.

Sequential IO
Reverse sequential IO
Strided IO

No read cache
Unaligned read cache
Aligned read cache

No write cache
Write cache

Cache size 4096
Cache size 2048
Cache size 1024
Other

QUESTION IO-4D. Program ./mysteryd:

open("files/text1meg.txt", O_RDONLY)    = 3
brk(0)                                  = 0x9a0e000
brk(0x9a2f000)                          = 0x9a2f000
fstat64(3, {st_mode=S_IFREG|0664, st_size=1048576, ...}) = 0
lseek(3, 1048575, SEEK_SET)             = 1048575
read(3, "o", 2048)                      = 1
lseek(3, 1048574, SEEK_SET)             = 1048574
read(3, "Ro", 2048)                     = 2
lseek(3, 1048573, SEEK_SET)             = 1048573
read(3, "\nRo", 2048)                   = 3
...
lseek(3, 1046528, SEEK_SET)             = 1046528
read(3, "ingau\nRheingau's\nRhenish\nRhianno"..., 2048) = 2048
write(1, "oR\ntlevesooR\ns'yenooR\nyenooR\ns't"..., 2048) = 2048
lseek(3, 1046527, SEEK_SET)             = 1046527
read(3, "eingau\nRheingau's\nRhenish\nRhiann"..., 2048) = 2048
lseek(3, 1046526, SEEK_SET)             = 1046526
read(3, "heingau\nRheingau's\nRhenish\nRhian"..., 2048) = 2048
...

Circle at least one option in each column.

Sequential IO
Reverse sequential IO
Strided IO

No read cache
Unaligned read cache
Aligned read cache

No write cache
Write cache

Cache size 4096
Cache size 2048
Cache size 1024
Other

QUESTION IO-4E. Program ./mysterye:

open("files/text1meg.txt", O_RDONLY)    = 3
brk(0)                                  = 0x93e5000
brk(0x9407000)                          = 0x9407000
read(3, "A", 1)                         = 1
read(3, "\n", 1)                        = 1
read(3, "A", 1)                         = 1
...
read(3, "A", 1)                         = 1
read(3, "l", 1)                         = 1
write(1, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 1024) = 1024
read(3, "t", 1)                         = 1
read(3, "o", 1)                         = 1
read(3, "n", 1)                         = 1
...

Circle at least one option in each column.

Sequential IO
Reverse sequential IO
Strided IO

No read cache
Unaligned read cache
Aligned read cache

No write cache
Write cache

Cache size 4096
Cache size 2048
Cache size 1024
Other

QUESTION IO-4F. Program ./mysteryf:

open("files/text1meg.txt", O_RDONLY)    = 3
brk(0)                                  = 0x9281000
brk(0x92a3000)                          = 0x92a3000
read(3, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 4096) = 4096
write(1, "A", 1)                        = 1
write(1, "\n", 1)                       = 1
write(1, "A", 1)                        = 1
...
write(1, "A", 1)                        = 1
write(1, "l", 1)                        = 1
read(3, "ton's\nAludra\nAludra's\nAlva\nAlvar"..., 4096) = 4096
write(1, "t", 1)                        = 1
write(1, "o", 1)                        = 1
write(1, "n", 1)                        = 1
...

Circle at least one option in each column.

Sequential IO
Reverse sequential IO
Strided IO

No read cache
Unaligned read cache
Aligned read cache

No write cache
Write cache

Cache size 4096
Cache size 2048
Cache size 1024
Other

IO-5. Processor cache

The following questions use the following C definition for an NxM matrix (the matrix has N rows and M columns).

struct matrix {
    unsigned N;
    unsigned M;
    double elt[0];
};

matrix* matrix_create(unsigned N, unsigned M) {
    matrix* m = (matrix*) malloc(sizeof(matrix) + N * M * sizeof(double));
    m->N = N;
    m->M = M;
    for (size_t i = 0; i < N * M; ++i) {
        m->elt[i] = 0.0;
    }
    return m;
}

Typically, matrix data is stored in row-major order: element m_ij (at row i and column j) is stored in m->elt[i*m->M + j]. We might write this in C using an inline function:

inline double* melt1(matrix* m, unsigned i, unsigned j) {
    return &m->elt[i * m->M + j];
}

But that’s not the only possible method to store matrix data. Here are several more.

inline double* melt2(matrix* m, unsigned i, unsigned j) {
    return &m->elt[i + j * m->N];
}

inline double* melt3(matrix* m, unsigned i, unsigned j) {
    return &m->elt[i + ((m->N - i + j) % m->M) * m->N];
}

inline double* melt4(matrix* m, unsigned i, unsigned j) {
    return &m->elt[i + ((i + j) % m->M) * m->N];
}

inline double* melt5(matrix* m, unsigned i, unsigned j) {
    assert(m->M % 8 == 0);
    unsigned k = (i/8) * (m->M/8) + (j/8);
    return &m->elt[k*64 + (i % 8) * 8 + j % 8];
}

QUESTION IO-5A. Which method (of melt1–melt5) will have the best processor cache behavior if most matrix accesses use loops like this?

for (unsigned j = 0; j < 100; ++j) {
    for (unsigned i = 0; i < 100; ++i) {
        f(*melt(m, i, j));
    }
}

QUESTION IO-5B. Which method will have the best processor cache behavior if most matrix accesses use loops like this?

for (unsigned i = 0; i < 100; ++i) {
    f(*melt(m, i, i));
}

QUESTION IO-5C. Which method will have the best processor cache behavior if most matrix accesses use loops like this?

for (unsigned i = 0; i < 100; ++i) {
    for (unsigned j = 0; j < 100; ++j) {
        f(*melt(m, i, j));
    }
}

QUESTION IO-5D. Which method will have the best processor cache behavior if most matrix accesses use loops like this?

for (int di = -3; di <= 3; ++di) {
    for (int dj = -3; dj <= 3; ++dj) {
        f(*melt(m, I + di, J + dj));
    }
}

QUESTION IO-5E. Here is a matrix-multiply function in ikj order.

matrix* matrix_multiply(matrix* a, matrix* b) {
    assert(a->M == b->N);
    matrix* c = matrix_create(a->N, b->M);
    for (unsigned i = 0; i != a->N; ++i) {
        for (unsigned k = 0; k != a->M; ++k) {
            for (unsigned j = 0; j != b->M; ++j) {
                *melt(c, i, j) += *melt(a, i, k) * *melt(b, k, j);
            }
        }
    }
}

This loop order is cache-optimal when data is stored in melt1 order. What loop order is cache-optimal for melt2?

QUESTION IO-5F. You notice that accessing a matrix element using melt1 is very slow. After some debugging, it seems like the processor on which you are running code has a very slow integer multiply instruction. Briefly describe a change to struct matrix that would let you write a version of melt1 with no integer multiply instruction. You may add members, change sizes, or anything you like.

IO-6. Caching

Assume that we have a cache that holds four slots. Assume that each letter below indicates an access to a block. Answer the following questions as they pertain to the following sequence of accesses.

E D C B A E D A A A B C D E

QUESTION IO-6A. What is the hit rate assuming an LRU replacement policy?

QUESTION IO-6B. What pages will you have in the cache at the end of the run?

QUESTION IO-6C. What is the best possible hit rate attainable if you could see into the future?

IO-7. Caching

Intel and CrossPoint have announced a new persistent memory technology with performance approaching that of DRAM. Your job is to calculate some performance metrics to help system architectects decide how to best incorporate this new technology into their platform.

Let's say that it takes 64ns to access one (32-bit) word of main memory (DRAM) and 256ns to access one (32-bit) word of this new persistent memory, which we'll call NVM (non-volatile memory). The block size of the NVM is 256 bytes. The NVM designers are quite smart and although it takes a long time to access the first byte, when you are accessing NVM sequentially, the devices perform read ahead and stream data efficiently -- at 32 GB/second, which is identical to the bandwidth of DRAM.

QUESTION IO-7A. Let's say that we are performing random accesses of 32 bits (on a 32-bit processor). What fraction of the accesses must be to main memory (as opposed to NVM) to achieve performance within 10% of DRAM?

Let X be the fraction of accesses to DRAM: access time = 64X + 256(1-X). We want that to be <= 1.1*64 (within 10% of DRAM). So, 1.1*64 = 70.4. So, let's solve for: 64X + 256(1-X) = 70.4.
64X + 256 - 256X = 70.4.
(256X - 64X) = 256 - 70.4
192X = 186
X = 186/192
about .97
So, we need a hit rate in main memory of 97%

QUESTION IO-7B. Let's say that they write every byte of a 256 block in units of 32 bits. How much faster will write-back cache perform relative to a write-through cache? (An approximate order of magnitude will be sufficient; showing work can earn partial credit.)

QUESTION IO-7C. Why might you not want to use a write-back cache?

IO-8. Reference strings

The following questions concern the FIFO (First In First Out), LRU (Least Recently Used), and LFU (Least Frequently Used) cache eviction policies.

⚠️ LFU (not covered in 2019) evicts the item that was accessed least frequently.

Your answers should refer to seven-item reference strings made up of digits in the range 0–9. An example answer might be “1231231”. In each case, the reference string is processed by a 3-slot cache that’s initially empty.

QUESTION IO-8A. Give a reference string that has a 1/7 hit rate in all three policies.

QUESTION IO-8B. Give a reference string that has a 6/7 hit rate in all three policies.

QUESTION IO-8C. Give a reference string that has different hit rates under LRU and LFU policies, and compute the hit rates.

String:

LRU hit rate:

LFU hit rate:

We’re looking for a string whose least-recently used item is not its least-frequently used item.

String: 1123411

LRU hit rate: 2/7

1 1 2 3 4 1 1

α ① 1 ④

β ② ① 1

γ ③

LFU hit rate: 3/7

1 1 2 3 4 1 1

α ① 1 1 1

β ② ④

γ ③

QUESTION IO-8D. Give a reference string that has different hit rates under FIFO and LRU policies, and compute the hit rates.

String:

FIFO hit rate:

LRU hit rate:

We’re looking for a string where an item is reused after other items are loaded into the cache. This will make it less of a target for LRU eviction than FIFO eviction.

String: 1231411

FIFO hit rate: 2/7

1 2 3 1 4 1 1

α ① 1 ④

β ② ① 1

γ ③

LRU hit rate: 3/7

1 2 3 1 4 1 1

α ① 1 1 1

β ② ④

γ ③

QUESTION IO-8E. Now let's assume that you know a reference string in advance. Given a 3-slot cache and the following reference string, what caching algorithm discussed in class and/or exercises would produce the best hit rate, and would would that hit rate be?

“12341425321521”

IO-9. Caching: Access times and hit rates

Recall that x86-64 instructions can access memory in units of 1, 2, 4, or 8 bytes at a time. Assume we are running on an x86-64-like machine with 1024-byte cache lines. Our machine takes 32ns to access a unit if the cache hits, regardless of unit size. If the cache misses, an additional 8160ns are required to load the cache, for a total of 8192ns.

QUESTION IO-9A. What is the average access time per access to access all the data in a cache line as an array of 256 integers, starting from an empty cache?

QUESTION IO-9B. What unit size (1, 2, 4, or 8) minimizes the access time to access all data in a cache line, starting from an empty cache?

QUESTION IO-9C. What unit size (1, 2, 4, or 8) maximizes the hit rate to access all data in a cache line, starting from an empty cache?

IO-10. Single-slot cache code

Donald Duck is working on a single-slot cache for reading. He’s using the pos_tag/end_tag representation, which is:

struct io61_file {
   int fd;
   unsigned char cbuf[BUFSIZ];
   off_t tag;      // file offset of first character in cache (same as before)
   off_t end_tag;  // file offset one past last valid char in cache; end_tag - tag == old `csz`
   off_t pos_tag;  // file offset of next char to read in cache; pos_tag - tag == old `cpos`
};

Here’s our solution code; in case you want to scribble, the code is copied in the appendix.

 1.  ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
 2.      size_t pos = 0;
 3.      while (pos != sz) {
 4.          if (f->pos_tag < f->end_tag) {
 5.              ssize_t n = sz - pos;
 6.              if (n > f->end_tag - f->pos_tag)
 7.                  n = f->end_tag - f->pos_tag;
 8.              memcpy(&buf[pos], &f->cbuf[f->pos_tag - f->tag], n);
 9.              f->pos_tag += n;
10.              pos += n;
11.          } else {
12.              f->tag = f->end_tag;
13.              ssize_t n = read(f->fd, f->cbuf, BUFSIZ);
14.              if (n > 0)
15.                  f->end_tag += n;
16.              else
17.                  return pos ? pos : n;
18.          }
19.      }
20.      return pos;
21.  }

Donald has ideas for “simplifying” this code. Specifically, he wants to try each of the following independently:

Replacing line 4 with “if (f->pos_tag <= f->end_tag) {”.
Removing lines 6–7.
Removing line 9.
Removing lines 16–17.

QUESTION IO-10A. Which simplifications could lead to undefined behavior? List all that apply or say “none.”

QUESTION IO-10B. Which simplifications could cause io61_read to loop forever without causing undefined behavior? List all that apply or say “none.”

QUESTION IO-10C. Which simplifications could lead to io61_read returning incorrect data in buf, meaning that the data read by a series of io61_read calls won’t equal the data in the file? List all that apply or say “none.”

QUESTION IO-10D. Chastened, Donald decides to optimize the code for a specific situation, namely when io61_read is called with a sz that is larger than BUFSIZ. He wants to add code after line 11, like so, so that fewer read system calls will happen for large sz:

11.          } else if (sz - pos > BUFSIZ) {
                 // DONALD’S CODE HERE




11A.         } else {
12.              f->tag = f->end_tag;
                 ....

Finish Donald’s code. Your code should maintain the relevant invariants between tag, pos_tag, end_tag, and the file position, but you need not keep tag aligned.

IO-11. Caching

QUESTION IO-11A. If it takes 200ns to access main memory, which of the following two caches will produce a lower average access time?

A cache with a 10ns access time that produces a 90% hit rate
A cache with a 20ns access time that produces a 98% hit rate

Let's compute average access time for each case:
.9 * 10 + .1 * 200 = 9 + 20 = 29
.98 * 20 + .02 * 200 = 19.6 + 4 = 23.6
The 20ns cache produces a lower average access time.

QUESTION IO-11B. Let’s say that you have a direct-mapped cache with four slots. A page with page number N must reside in the slot numbered N % 4. What is the best hit rate this could achieve given the following sequence of page accesses?

3 6 7 5 3 2 1 1 1 8

Since it's direct mapped, each item can go in only one slot, so if we list the slots for each access, we get:
Access:   3  6  7  5  3  2  1  1  1  8
Slot:     3  2  3  1  3  2  1  1  1  0
Hit/miss: M  M  M  M  M  M  M  H  H  M
The only hits are the two 1s, so your hit rate is 2/10 or 20% or .2.

QUESTION IO-11C. What is the best hit rate a fully-associative four-slot cache could achieve for that sequence of page accesses? (A fully-associative cache may put any page in any slot. You may assume you know the full reference stream in advance.)

QUESTION IO-11D. What hit rate would the fully-associative four-slot cache achieve if it used the LRU eviction policy?

IO-12. I/O traces

QUESTION IO-12A. Which of the following programs cannot be distinguished by the output of the strace utility, not considering open calls? List all that apply; if multiple indistinguishable groups exist (e.g., A, B, & C can’t be distinguished, and D & E can’t be distinguished, but the groups can be distinguished from each other), list them all.

Sequential byte writes using stdio
Sequential byte writes using system calls
Sequential byte writes using system calls and O_SYNC
Sequential block writes using stdio and block size 2
Sequential block writes using system calls and block size 2
Sequential block writes using system calls and O_SYNC and block size 2
Sequential block writes using stdio and block size 4096
Sequential block writes using system calls and block size 4096
Sequential block writes using system calls and O_SYNC and block size 4096

QUESTION IO-12B. Which of the programs in Part A cannot be distinguished using blktrace output? List all that apply.

QUESTION IO-12C. The buffer cache is coherent. Which of the following operating system changes could make the buffer cache incoherent? List all that apply.

Application programs can obtain direct read access to the buffer cache
Application programs can obtain direct write access to the disk, bypassing the buffer cache
Other computers can communicate with the disk independently
The computer has a uninterruptible power supply (UPS), ensuring that the operating system can write the contents of the buffer cache to disk if main power is lost

QUESTION IO-12D. The stdio cache is incoherent. Which of the operating system changes from Part C could make the stdio cache coherent? List all that apply.

IO-13. Reference strings and eviction

QUESTION IO-13A. When demonstrating cache eviction in class, we modeled a completely reactive cache, meaning that the cache performed at most one load from slow storage per access. Name a class of reference string that will have a 0% hit rate on any cold reactive cache. For partial credit, give several examples of such reference strings.

QUESTION IO-13B. What cache optimization can be used to improve the hit rate for the class of reference string in Part A? One word is enough; put the best choice.

QUESTION IO-13C. Give a single reference string with the following properties:

There exists a cache size and eviction policy that gives a 70% hit rate for the string.
There exists a cache size and eviction policy that gives a 0% hit rate for the string.

QUESTION IO-13D. Put the following eviction algorithms in order of how much space they require for per-slot metadata, starting with the least space and ending with the most space. (Assume the slot order is fixed, so once a block is loaded into slot i, it stays in slot i until it is evicted.) For partial credit say what you think the metadata would be.

FIFO
LRU
Random

IO-14. Cache code

Several famous musicians have just started working on CS61 Problem Set 3. They share the following code for their read-only, sequential, single-slot cache:

struct io61_file {
    int fd;
    unsigned char buf[4096];
    size_t pos;    // position of next character to read in `buf`
    size_t sz;     // number of valid characters in `buf`
};

int io61_readc(io61_file* f) {
    if (f->pos >= f->sz) {
        f->pos = f->sz = 0;
        ssize_t nr = read(f->fd, f->buf, sizeof(f->buf));
        if (nr <= 0) {
            f->sz = 0;
            return -1;
        } else {
            f->sz = nr;
        }
    }
    int ch = f->buf[f->pos];
    ++f->pos;
    return ch;
}

But they have different io61_read implementations. Donald (Lambert)’s is:

ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
    return read(f->fd, buf, sz);
}

Solange (Knowles)’s is:

ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
    for (size_t pos = 0; pos < sz; ++pos, ++buf) {
        *buf = io61_readc(f);
    }
    return sz;
}

Caroline (Shaw)’s is:

ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
    if (f->pos >= f->sz) {
        return read(f->fd, buf, sz);
    } else {
        int ch = io61_readc(f);
        if (ch < 0) {
            return 0;
        }
        *buf = ch;
        return io61_read(f, buf + 1, sz - 1) + 1;
    }
}

You are testing each of these musicians’ codes by executing a sequence of io61_readc and/or io61_read calls on an input file and printing the resulting characters to standard output. There are no seeks, and your test programs print until end of file, so your tests’ output should equal the input file’s contents.

You should assume for these questions that no read system call ever returns -1.

QUESTION IO-14A. Describe an access pattern—that is, a sequence of io61_readc and/or io61_read calls (with lengths)—for which Donald’s code can return incorrect data.

QUESTION IO-14B. Which of these musicians’ codes can generate an output file with incorrect length?

For the remaining parts, assume the problem or problems in Part B have been corrected, so that all musicians’ codes generate output files with correct lengths.

QUESTION IO-14C. Give an access pattern for which Solange’s code will return correct data and outperform Donald’s, or vice versa, and say whose code will win.

QUESTION IO-14D. Suggest a small change (≤10 characters) to Caroline’s code that would, most likely, make it perform at least as well as both Solange’s and Donald’s codes on all access patterns. Explain briefly.

IO-15. Caches

Parts A–C concern different implementations of Pset 3’s stdio cache. Assume a program that reads a 32768-byte file a character at a time, like this:

while (io61_readc(inf) != EOF) {
}

This program will call io61_readc 32769 times. (32769 = 2¹⁵ + 1 = 8×2¹² + 1; the +1 accounts for the EOF return.) But the cache implementation might make many fewer system calls.

QUESTION IO-15A. How many read system calls are required assuming a single-slot, 4096-byte io61 cache?

QUESTION IO-15B. How many read system calls are required assuming an eight-slot, 4096-byte io61 cache?

QUESTION IO-15C. How many mmap system calls are required assuming an mmap-based io61 cache?

Parts D–F concern cache implementations and styles. We discussed many caches in class, including:

The buffer cache
The processor cache
Single-slot aligned stdio caches
Single-slot unaligned stdio caches
Circular bounded buffers

QUESTION IO-15D. Which of those caches are implemented entirely in hardware? List all that apply.

QUESTION IO-15E. Which of those software caches could help speed up reverse sequential access to a disk file? List all that apply.

QUESTION IO-15F. Which of those software caches could help speed up access to a pipe or network socket? List all that apply.

IO-16. LRU

These questions concern the least recently used (LRU) and first-in first-out (FIFO) cache eviction policies.

QUESTION IO-16A. List all that apply.

LRU is better than FIFO for a workload that consists of reading a file in sequential order.
If two LRU caches process the same reference string starting from an empty state, then the cache with more slots always has a better hit rate.
If two LRU caches process the same reference string starting from an empty state, then the cache with more slots never has a worse hit rate.
LRU and FIFO should have the same hit rate on average for a workload that consists of reading a file in random order.
None of the above.

For the next two questions, consider a cache with 5 slots that has just processed the reference string 12345. (Thus, its slots contain 1, 2, 3, 4, and 5.)

QUESTION IO-16B. Write a reference string that will observe a higher hit rate under LRU than under FIFO if executed on this cache.

QUESTION IO-16C. Write a reference string that will observe a higher hit rate under FIFO than under LRU if executed on this cache.

The remaining questions in this problem concern the operating system’s buffer cache. LRU requires detecting each use of a cache block (to track the time the block was most recently used). In the buffer cache, the “blocks” are physical memory pages, and blocks are “used” by reads, writes, and accesses to mmaped memory.

QUESTION IO-16D. Which of these changes would let a WeensyOS-like operating system reliably track when buffer-cache physical memory pages are used? List all that apply.

Adding a system call track(uintptr_t physical_address) that a process should call when it accesses a physical page.
Adding a member boottime_t lru_time to struct proc. (boottime_t is a type that measures the time since boot.)
Adding a member boottime_t lru_time to struct pageinfo.
Modifying kernel system call implementations to update the relevant lru_time members when buffer-cache pages are accessed.
None of these changes will help.

QUESTION IO-16E. The mmap system call complicates LRU tracking for buffer-cache pages. Why? List all that apply.

mmap maps buffer-cache pages directly into a process’s address space.
Accessing memory in mmaped regions does not normally invoke the kernel.
Accessing memory in mmaped regions does not use a page table.
mmap starts with two of the letter m, causing LRU to become confused about which m was used least recently.
None of the above.

IO-17. Reference strings and hit rates

QUESTION IO-17A. Write a purely-sequential reference string containing at least five accesses.

QUESTION IO-17B. What is the hit rate for this reference string? Tell us the eviction algorithm and number of slots you’ve chosen.

The next two questions concern this ten-element reference string:

1 2 1 2 3 4 1 5 1 1

We consider executing this reference string starting with different cache contents.

QUESTION IO-17C. A three-slot LRU cache processes this reference string and observes a 70% hit rate. What are the initial contents of the cache?

QUESTION IO-17D. A three-slot FIFO cache processes this reference string with initial contents 4 1 2 and observes a 60% hit rate. Which slot was next up for eviction when the reference string began?

The eviction algorithms we saw in class are entirely reactive: they only insert a block when that block is referenced. This limits how well the cache can perform. A read cache can also be proactive by inserting blocks before they’re needed, possibly speeding up later accesses. This is the essence of prefetching.

In a proactive caching model, the cache can evict and load two or more blocks per access in the reference string. A prefetching policy decides which additional, non-accessed blocks to load.

QUESTION IO-17E. Describe an access pattern for which the following prefetching policy would be effective.

When accessing block A, also load block A+1.

QUESTION IO-17F. Write a reference string and name an eviction policy for which this prefetching policy would be less effective (have a lower hit rate) than no prefetching at all.

IO-18. Coherence

QUESTION IO-18A. Which of the kinds of cache we discussed in class are typically coherent?

QUESTION IO-18B. Which of the kinds of cache we discussed in class are typically single-slot?

Stdio-like caches are not coherent. The remaining questions concern potential mechanisms to make them coherent with respect to disk files.

Pedantic note. Sometimes a read-from-cache operation will occur concurrently with (at the same time as) a write to stable storage. The read operation counts as coherent whether or not it reflects the concurrent write, because logically the read and write occurred “at the same time” (neither is older).

QUESTION IO-18C. First, the new bool changed() system call returns true if and only if a write was performed on some file in the last second.

Describe briefly how changed could be used to make a stdio cache coherent, or explain why it could not.

QUESTION IO-18D. Second, the new int open_with_timestamp(const char* filename, unsigned long* timestamp, ...) system call is like open, except that every time a change is made to the underlying filename, the value in *timestamp is updated to the time, measured in milliseconds since last boot, of the last write operation on the file represented by file descriptor fd.

Describe briefly how open_with_timestamp could be used to make a stdio cache coherent, or explain why it could not.

QUESTION IO-18E. Describe briefly how mmap could be used to make a stdio cache coherent, or explain why it could not.

IO-19. System calls

QUESTION IO-19A. The following system calls have just been made:

int fd = open("f.txt", O_WRONLY | O_CREAT | O_TRUNC);
ssize_t nw = write(fd, "CS121 is awesome!", 17); // returned 17

What series of system calls would ensure that, after all system calls complete, the file f.txt contains the text “CS 61 is terrible” (without the quotation marks)? Minimize the number of bytes written.

QUESTION IO-19B. Which of the following file access patterns might have similar output from the strace utility? List all that apply or say “none.”

Sequential byte writes using stdio
Sequential byte writes using mmap
Sequential byte writes using system calls

QUESTION IO-19C. Which of the following file access patterns might have similar output from the strace utility? List all that apply or say “none.”

Sequential byte writes using stdio
Sequential block writes using stdio
Sequential byte writes using system calls
Sequential block writes using system calls

QUESTION IO-19D. Which of the following file access patterns might have similar output from the strace utility? List all that apply or say “none.”

Reverse-sequential byte writes using stdio
Reverse-sequential block writes using stdio
Reverse-sequential byte writes using system calls
Reverse-sequential block writes using system calls

2 and 4 might look similar if the block writes used the stdio block size. In that case, stdio will emit one lseek and one write per block write, resulting in a sequence like this:
lseek(4, 966656, SEEK_SET)              = 966656
write(4, "A\nA's\nAMD\nAMD's\nAOL\nAOL's\nAachen"..., 4096) = 4096
lseek(4, 962560, SEEK_SET)              = 962560
write(4, "\nAllyson's\nAlma\nAlma's\nAlmach\nAl"..., 4096) = 4096
That’s the same as we would expect from system calls.

Reverse-sequential byte writes using stdio will cause many more lseeks, one per byte, because stdio always emits one lseek system call per fseek library call. For instance, here’s how reverse-sequential byte reads look:
lseek(3, 970752, SEEK_SET)              = 970752
read(3, "zigzags\nzilch\nzilch's\nzillion\nzi"..., 4096) = 826
lseek(3, 971578, SEEK_SET)              = 971578
lseek(3, 971578, SEEK_SET)              = 971578
lseek(3, 971578, SEEK_SET)              = 971578
lseek(3, 971578, SEEK_SET)              = 971578
...
The actual difference between #1 and #2/#4 will be even greater, because when doing reverse-sequential byte writes, stdio also calls write once per byte—the cache is effectively disabled. But you didn’t need to know that to answer the question.
lseek(4, 971577, SEEK_SET)              = 971577
write(4, "A", 1)                        = 1
lseek(4, 971576, SEEK_SET)              = 971576
write(4, "\n", 1)                       = 1
lseek(4, 971575, SEEK_SET)              = 971575
write(4, "A", 1)                        = 1
lseek(4, 971574, SEEK_SET)              = 971574
write(4, "'", 1)                        = 1

	E	D	C	B	A	E	D	A	A	A	B	C	D	E
1	Ⓔ				Ⓐ			A	A	A				Ⓔ
2		Ⓓ				Ⓔ						Ⓒ
3			Ⓒ				Ⓓ						D
4				Ⓑ							B

	E	D	C	B	A	E	D	A	A	A	B	C	D	E
1	Ⓔ					E								E
2		Ⓓ					D						D
3			Ⓒ		Ⓐ			A	A	A		Ⓒ
4				Ⓑ							B

Storage and caching exercises

IO-1. I/O caching

IO-2. Caches and reference strings

IO-3. Processor cache

IO-4. IO caching and strace

IO-5. Processor cache

IO-6. Caching

IO-7. Caching

IO-8. Reference strings

IO-9. Caching: Access times and hit rates

IO-10. Single-slot cache code

IO-11. Caching

IO-12. I/O traces

IO-13. Reference strings and eviction

IO-14. Cache code

IO-15. Caches

IO-16. LRU

IO-17. Reference strings and hit rates

IO-18. Coherence

IO-19. System calls