Exercises not as directly relevant to this year’s class are marked with ⚠️.
IO-1. I/O caching
Mary Ruefle, a poet who lives in Vermont, is working on her caching I/O library for CS 61. She wants to implement a cache with N slots. Since searching those slots might slow down her library, she writes a function that maps addresses to slots. Here’s some of her code.
#define SLOTSIZ 4096
struct io61_slot {
char buf[SLOTSIZ];
off_t pos; // = (off_t) -1 for empty slots
ssize_t sz;
};
#define NSLOTS 64
struct io61_file {
int fd;
off_t pos; // current file position
io61_slot slots[NSLOTS];
};
static inline int find_slot(off_t off) {
return off % NSLOTS;
}
int io61_readc(io61_file* f) {
int slotindex = find_slot(f->pos);
io61_slot* s = &f->slots[slotindex];
if (f->pos < s->pos || f->pos >= s->pos + s->sz) {
// slot contains wrong data, need to refill it
off_t new_pos = lseek(f->fd, f->pos, SEEK_SET);
assert(new_pos != (off_t) -1); // only handle seekable files for now
ssize_t r = read(f->fd, s->buf, SLOTSIZ);
if (r == -1 || r == 0) {
return EOF;
}
s->pos = f->pos;
s->sz = r;
}
int ch = (unsigned char) s->buf[f->pos - s->pos];
++f->pos;
return ch;
}
Before she can run and debug this code, Mary is led “to an emergency of feeling that … results in a poem.” She’ll return to CS61 and fix her implementation soon, but in the meantime, let’s answer some questions about it.
QUESTION IO-1A. True or false: Mary’s cache is direct-mapped, meaning that each cacheable data chunk has exactly one slot that can cache it.
True
QUESTION IO-1B. What changes to Mary’s code could change your answer to Part A? Circle all that apply.
- New code for
find_slot
(keepingio61_readc
the same) - New code in
io61_readc
(keepingfind_slot
the same) - New code in
io61_readc
and new code forfind_slot
- None of the above
#2 and #3
QUESTION IO-1C. Which problems would occur when Mary’s code was used to sequentially read a seekable file of size 2MiB (2×220 = 2097152 bytes) one character at a time? Circle all that apply.
- Excessive process CPU usage (>10x stdio)
- Many system calls to read data (>10x stdio)
- Incorrect data (byte x read at a position where the file has byte y≠x)
- Read too much data (more bytes read than file contains)
- Read too little data (fewer bytes read than file contains)
- Crash/undefined behavior
- None of the above
The basic problem with Mary’s code is that adjacent bytes are stored in different slots, because the hash function maps adjacent offsets to different slots. This means that when reading sequentially, a byte at a time, Mary’s code will read a full block for each byte!
#2 is the best answer. #1 is close, but excessive system calls cause more kernel CPU usage.
QUESTION IO-1D. Which of these new implementations for find_slot
would fix at least one of these problems with reading sequential files?
Circle all that apply.
return (off * 2654435761) % NSLOTS; /* integer hash function from Stack Overflow */
return (off / SLOTSIZ) % NSLOTS;
return off & (NSLOTS - 1);
return 0;
return (off >> 12) & 0x3F;
- None of the above
We need a hash function that usually maps adjacent offsets into the same slot. #2, #4, #5.
IO-2. Caches and reference strings
QUESTION IO-2A. True or false: A direct-mapped cache with N or more slots can handle any reference string containing ≤N distinct addresses with no misses except for cold misses.
False: direct-mapped caches can have conflict misses
QUESTION IO-2B. True or false: A fully-associative cache with N or more slots can handle any reference string containing ≤N distinct addresses with no misses except for cold misses.
True
Consider the following 5 reference strings.
Name | String |
---|---|
α | 1 |
β | 1, 2 |
γ | 1, 2, 3, 4, 5 |
δ | 2, 4 |
ε | 5, 2, 4, 2 |
QUESTION IO-2C. Which of the strings might indicate a sequential access pattern? Circle all that apply.
α |
β |
γ |
δ |
ε |
None of these |
(α), β, γ
QUESTION IO-2D. Which of the strings might indicate a strided access pattern with stride >1? Circle all that apply.
α |
β |
γ |
δ |
ε |
None of these |
(α), δ
One very clever person pointed out that β and γ could also represent large strides: for example, consider a file with 10 bytes accessed with stride 11!
The remaining questions concern concatenated permutations of these five strings. For example, the permutation αβγδε refers to this reference string:
1, 1, 2, 1, 2, 3, 4, 5, 2, 4, 5, 2, 4, 2. |
We pass such permutations through an initially-empty, fully-associative cache with 3 slots, and observe the numbers of hits.
QUESTION IO-2E. How many cold misses might a permutation observe? Circle all that apply.
0 |
1 |
2 |
3 |
4 |
5 |
Some other number |
5. The first time a reference string address is encountered, it must cause a cold miss.
Under LRU eviction, the permutation αβεγδ observes 5 hits as follows. (We annotate each access with “h” for hit or “m” for miss.)
1m; 1h, 2m; 5m, 2h, 4m, 2h; 1m, 2h, 3m, 4m, 5m; 2m, 4h. |
QUESTION IO-2F. How many hits does this permutation observe under FIFO eviction?
4 hits
QUESTION IO-2G. Give a permutation that will observe 8 hits under LRU eviction, which is the maximum for any permutation. There are several possible answers. (Write your answer as a permutation of αβγδε. For partial credit, find a permutation that has 7 hits, etc.)
The following four permutations observe 8 hits under LRU: αβγδε, αβγεδ, βαγδε, βαγεδ. 28 permutations observe 7 hits; 25 observe 6 hits; and 38 observe 5 hits.
QUESTION IO-2H. Give a permutation that will observe 2 hits under LRU eviction, which is the minimum for any permutation. There is one unique answer. (Write your answer as a permutation of αβγδε. For partial credit, find a permutation that has 3 hits, etc.)
δαεγβ. 4 permutations observe 3 hits and 20 observe 4 hits.
IO-3. Processor cache
The git version control system is based on commit hashes, which are 160-bit (20-byte) hash values used to identify commits. In this problem you’ll consider the processor cache behavior of several versions of a “grading server” that maps commits to grades. Here’s the first version:
struct commit_info {
char hash[20];
int grade[11];
};
commit_info* commits;
size_t N;
int get_grade1(const char* hash, int pset) {
for (size_t i = 0; i != N; ++i) {
if (memcmp(commits[i].hash, hash, 20) == 0) {
return commits[i].grade[pset];
}
}
return -1;
}
We will ask questions about the average number of distinct 64-byte cache lines accessed
by variants of get_grade(hash, pset)
. You should make the following
assumptions:
- The
hash
argument is uniformly drawn from the set of known commits. That is, the probability thathash
equals the ith commit’s hash is 1/N
. - Only count cache lines accessible via
commits
. Don’t worry about cache lines used for local variables, for parameters, for global variables, or for instructions. For instance, do not count thehash
argument or the global-data cache line that stores thecommits
variable itself. - The
commits
pointer is 64-byte aligned and cache lines are 64 bytes long. - Commit hashes are mathematically indistinguishable from random numbers. Thus, the probability that two different hashes have the same initial k bits equals 1/2k.
- We’ll ignore small errors; N/2 and (N+1)/2 will be considered equivalent.
QUESTION IO-3A. What is the expected number of cache lines accessed
by get_grade1
, in terms of N
?
The answer is \((N+1)/2\), which is roughly \(N/2\).
Formally, let \(C\) be a random variable representing the number of cache lines accessed. This is a discrete random variable: there are \(N\) possibilities \(C_1,\dots,C_N\), corresponding to the number of cache lines required to access commits \(1,\dots,N\). Then, given our assumptions, the expected value of \(C\) is
\[ E[C] = \sum_{i=1}^N Pr[\text{commit is $i$th commit}] \times C_i = \frac{1}{N}\sum_{i=1}^N C_i. \]
So far, this analysis holds for any
commit_info
structure. Forget_grade1
specifically, commits are stored in order, so accessing the \(i\)th commit accesses \(i\) cache lines, and \(C_i = i\):\[ E[C] = \frac{1}{N} \sum_{i=1}^N i = \frac{1}{N}\frac{N(N+1)}{2} = \frac{N+1}{2}. \]
The second version:
struct commit_info {
char hash[20];
int grade[11];
};
commit_info** commits;
size_t N;
int get_grade2(const char hash[20], int pset) {
for (size_t i = 0; i != N; ++i) {
if (memcmp(commits[i]->hash, hash, 20) == 0) {
return commits[i]->grade[pset];
}
}
return -1;
}
QUESTION IO-3B. What is the expected number of cache lines accessed
by get_grade2
, in terms of N
?
Roughly 9N/16.
We still examine ≈N/2 commit_info objects, but in addition we examine cache lines to evaluate the pointers to those objects. There are 8 such pointers per cache line (8×8=64), and we examine around N/2 pointers, for ≈N/8/2 = N/16 additional cache lines.
Formally \( C_i = \lceil i / 8 \rceil + i \), so
\[ E[C] = \frac{N+1}{2} + \frac{1}{N}\sum_{i=1}^N \lceil i/8 \rceil. \]
There’s no convenient closed form for the sum of the ceiling function, but we know that \( \lceil i/8 \rceil \leq (i + 7)/8 \), so
\[ E[C] \leq \frac{N+1}{2} + \frac{1}{N}\sum_{i=1}^N \frac{i+7}{8} = \frac{N+1}{2} + \frac{N+15}{16} = \frac{9N + 23}{16}. \]
The third version:
struct commit_info {
char hash[20];
int grade[11];
};
struct commit_hint {
char hint[8];
commit_info* commit;
};
commit_hint* commits;
size_t N;
int get_grade3(const char* hash, int pset) {
for (size_t i = 0; i != N; ++i) {
if (memcmp(commits[i].hint, hash, 8) == 0
&& memcmp(commits[i].commit->hash, hash, 20) == 0) {
return commits[i].commit->grade[pset];
}
}
return -1;
}
QUESTION IO-3C. What is the expected number of cache lines accessed
by get_grade3
, in terms of N
? (You may assume that N
≤2000.)
Roughly \( 1 + \lceil N / 8 \rceil \).
The assumption that N≤2000 means we’re exceedingly unlikely to encounter a hint collision (i.e. a commit with the same hint, but different commit value). That means that we will examine N/2
commit_hint
objects but only onecommit_info
object.commit_hint
objects are 16B big, so there are four of them per cache line, and we examine roughly N/4/2 = N/8 cache lines for hint objects.Formally \( C_i = \lceil i / 4 \rceil + 1 \), so
\[ \begin{aligned} E[C] &= \frac{1}{N} \sum_{i=1}^N \left( \left\lceil \frac{i}{4} \right\rceil + 1 \right) = 1 + \frac{1}{N} \sum_{i=1}^N \left\lceil \frac{i}{4} \right\rceil \\ &\leq 1 + \frac{1}{N} \sum_{i=1}^N \frac{i+3}{4} = 1 + \frac{N+7}{8}. \end{aligned} \]
The fourth version is a hash table.
struct commit_info {
char hash[20];
int grade[11];
};
commit_info** commits;
size_t commits_hashsize;
int get_grade4(const char* hash, int pset) {
// choose initial bucket
size_t bucket;
memcpy(&bucket, hash, sizeof(bucket));
bucket = bucket % commits_hashsize;
// search for the commit starting from that bucket
while (commits[bucket] != nullptr) {
if (memcmp(commits[bucket]->hash, hash, 20) == 0) {
return commits[bucket]->grade[pset];
}
bucket = (bucket + 1) % commits_hashsize;
}
return -1;
}
QUESTION IO-3D. Assume that a call to get_grade4
encounters B - 1
expected hash collisions (i.e., examines B
buckets total, including the
bucket that actually contains hash
). What is the expected number of cache
lines accessed by get_grade4
, in terms of N
and B
?
Roughly \( B + \lceil B/8 \rceil + 1 \); more roughly, \( 9B/8 \).
The
while
loop will execute \(B\) times total. Each time through, it will access one bucket and onecommit_info
object. A bucket is represented as a pointer to acommit_info
object, so there are eight such pointers per cache line. It’s difficult to express an exact formula for the number of cache lines accessed; for instance, if the initialbucket
index is 0, then \(B=8\) will access just one cache line for all buckets, but if the initialbucket
index is 7, then \(B=8\) will access two cache lines. We can see, though, that the number of bucket cache lines accessed is always \( \leq \lceil B/8 \rceil + 1\).Formally, \( C_i \leq B + \lceil B/8 \rceil + 1 \). Since this is independent of \(N\), \( E[C] \leq B + \lceil B/8 \rceil + 1 \approx 9B/8 \).
IO-4. IO caching and strace
Elif Batuman is investigating several program executables left behind by
her ex-roommate Fyodor. She runs each executable under strace
in the
following way:
strace -o strace.txt ./EXECUTABLE files/text1meg.txt > files/out.txt
Help her figure out properties of these programs based on their system call traces.
QUESTION IO-4A. Program ./mysterya
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x8193000
brk(0x81b5000) = 0x81b5000
read(3, "A", 1) = 1
write(1, "A", 1) = 1
read(3, "\n", 1) = 1
write(1, "\n", 1) = 1
read(3, "A", 1) = 1
write(1, "A", 1) = 1
read(3, "'", 1) = 1
write(1, "'", 1) = 1
read(3, "s", 1) = 1
write(1, "s", 1) = 1
...
Circle at least one option in each column.
|
|
|
|
1, Sequential IO; a, No read cache; i, No write cache; D, Other
QUESTION IO-4B. Program ./mysteryb
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x96c5000
brk(0x96e6000) = 0x96e6000
read(3, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 2048) = 2048
write(1, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 2048) = 2048
read(3, "kad\nAkron\nAkron's\nAl\nAl's\nAla\nAl"..., 2048) = 2048
write(1, "kad\nAkron\nAkron's\nAl\nAl's\nAla\nAl"..., 2048) = 2048
...
Circle at least one option in each column.
|
|
|
|
1, Sequential IO; b/c, (Un)aligned read cache; ii, Write cache; B, Cache size 2048.
We don’t know whether the read cache is aligned because the reads are sequential. Only non-sequential reads distinguish between aligned and unaligned read caches.
QUESTION IO-4C. Program ./mysteryc
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x9064000
brk(0x9085000) = 0x9085000
fstat64(3, {st_mode=S_IFREG|0664, st_size=1048576, ...}) = 0
lseek(3, 1046528, SEEK_SET) = 1046528
read(3, "ingau\nRheingau's\nRhenish\nRhianno"..., 2048) = 2048
write(1, "oR\ntlevesooR\ns'yenooR\nyenooR\ns't"..., 2048) = 2048
lseek(3, 1044480, SEEK_SET) = 1044480
read(3, "Quinton\nQuinton's\nQuirinal\nQuisl"..., 2048) = 2048
write(1, "ehR\neehR\naehR\ns'hR\nhR\nsdlonyeR\ns"..., 2048) = 2048
lseek(3, 1042432, SEEK_SET) = 1042432
read(3, "emyslid's\nPrensa\nPrensa's\nPrenti"..., 2048) = 2048
write(1, "\ns'nailitniuQ\nnailitniuQ\nnniuQ\ns"..., 2048) = 2048
lseek(3, 1040384, SEEK_SET) = 1040384
read(3, "Pindar's\nPinkerton\nPinocchio\nPin"..., 2048) = 2048
write(1, "rP\ndilsymerP\ns'regnimerP\nregnime"..., 2048) = 2048
...
Circle at least one option in each column.
|
|
|
|
2, Reverse sequential IO; c, Aligned read cache; ii, Write cache; B, Cache size 2048
This time we can assume the read cache is aligned. The offsets are aligned to multiples of 2048, though the results look like reverse-sequential bytewise I/O.
QUESTION IO-4D. Program ./mysteryd
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x9a0e000
brk(0x9a2f000) = 0x9a2f000
fstat64(3, {st_mode=S_IFREG|0664, st_size=1048576, ...}) = 0
lseek(3, 1048575, SEEK_SET) = 1048575
read(3, "o", 2048) = 1
lseek(3, 1048574, SEEK_SET) = 1048574
read(3, "Ro", 2048) = 2
lseek(3, 1048573, SEEK_SET) = 1048573
read(3, "\nRo", 2048) = 3
...
lseek(3, 1046528, SEEK_SET) = 1046528
read(3, "ingau\nRheingau's\nRhenish\nRhianno"..., 2048) = 2048
write(1, "oR\ntlevesooR\ns'yenooR\nyenooR\ns't"..., 2048) = 2048
lseek(3, 1046527, SEEK_SET) = 1046527
read(3, "eingau\nRheingau's\nRhenish\nRhiann"..., 2048) = 2048
lseek(3, 1046526, SEEK_SET) = 1046526
read(3, "heingau\nRheingau's\nRhenish\nRhian"..., 2048) = 2048
...
Circle at least one option in each column.
|
|
|
|
2, Reverse sequential IO; b, Unaligned read cache; ii, Write cache; B, Cache size 2048
QUESTION IO-4E. Program ./mysterye
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x93e5000
brk(0x9407000) = 0x9407000
read(3, "A", 1) = 1
read(3, "\n", 1) = 1
read(3, "A", 1) = 1
...
read(3, "A", 1) = 1
read(3, "l", 1) = 1
write(1, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 1024) = 1024
read(3, "t", 1) = 1
read(3, "o", 1) = 1
read(3, "n", 1) = 1
...
Circle at least one option in each column.
|
|
|
|
1, Sequential IO; a, No read cache; ii, Write cache; C, (write) cache size 1024
QUESTION IO-4F. Program ./mysteryf
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x9281000
brk(0x92a3000) = 0x92a3000
read(3, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 4096) = 4096
write(1, "A", 1) = 1
write(1, "\n", 1) = 1
write(1, "A", 1) = 1
...
write(1, "A", 1) = 1
write(1, "l", 1) = 1
read(3, "ton's\nAludra\nAludra's\nAlva\nAlvar"..., 4096) = 4096
write(1, "t", 1) = 1
write(1, "o", 1) = 1
write(1, "n", 1) = 1
...
Circle at least one option in each column.
|
|
|
|
1, Sequential IO; b/c, (un)aligned read cache; i, No write cache; A, Cache size 4096
IO-5. Processor cache
The following questions use the following C definition for an N
xM
matrix (the matrix has N
rows and M
columns).
struct matrix {
unsigned N;
unsigned M;
double elt[0];
};
matrix* matrix_create(unsigned N, unsigned M) {
matrix* m = (matrix*) malloc(sizeof(matrix) + N * M * sizeof(double));
m->N = N;
m->M = M;
for (size_t i = 0; i < N * M; ++i) {
m->elt[i] = 0.0;
}
return m;
}
Typically, matrix data is stored in row-major order: element
mij (at row i and column
j) is stored in m->elt[i*m->M + j]
. We might write this in C
using an inline function:
inline double* melt1(matrix* m, unsigned i, unsigned j) {
return &m->elt[i * m->M + j];
}
But that’s not the only possible method to store matrix data. Here are several more.
inline double* melt2(matrix* m, unsigned i, unsigned j) {
return &m->elt[i + j * m->N];
}
inline double* melt3(matrix* m, unsigned i, unsigned j) {
return &m->elt[i + ((m->N - i + j) % m->M) * m->N];
}
inline double* melt4(matrix* m, unsigned i, unsigned j) {
return &m->elt[i + ((i + j) % m->M) * m->N];
}
inline double* melt5(matrix* m, unsigned i, unsigned j) {
assert(m->M % 8 == 0);
unsigned k = (i/8) * (m->M/8) + (j/8);
return &m->elt[k*64 + (i % 8) * 8 + j % 8];
}
QUESTION IO-5A. Which method (of melt1
–melt5
) will have the best
processor cache behavior if most matrix accesses use loops like this?
for (unsigned j = 0; j < 100; ++j) {
for (unsigned i = 0; i < 100; ++i) {
f(*melt(m, i, j));
}
}
melt2
QUESTION IO-5B. Which method will have the best processor cache behavior if most matrix accesses use loops like this?
for (unsigned i = 0; i < 100; ++i) {
f(*melt(m, i, i));
}
melt3
QUESTION IO-5C. Which method will have the best processor cache behavior if most matrix accesses use loops like this?
for (unsigned i = 0; i < 100; ++i) {
for (unsigned j = 0; j < 100; ++j) {
f(*melt(m, i, j));
}
}
melt1
(butmelt5
is almost as good)
QUESTION IO-5D. Which method will have the best processor cache behavior if most matrix accesses use loops like this?
for (int di = -3; di <= 3; ++di) {
for (int dj = -3; dj <= 3; ++dj) {
f(*melt(m, I + di, J + dj));
}
}
melt5
QUESTION IO-5E. Here is a matrix-multiply function in ikj order.
matrix* matrix_multiply(matrix* a, matrix* b) {
assert(a->M == b->N);
matrix* c = matrix_create(a->N, b->M);
for (unsigned i = 0; i != a->N; ++i) {
for (unsigned k = 0; k != a->M; ++k) {
for (unsigned j = 0; j != b->M; ++j) {
*melt(c, i, j) += *melt(a, i, k) * *melt(b, k, j);
}
}
}
}
This loop order is cache-optimal when data is stored in melt1
order.
What loop order is cache-optimal for melt2
?
jki is best; kji is a close second.
QUESTION IO-5F. You notice that accessing a matrix element using melt1
is very slow. After some debugging, it seems like the processor on which you
are running code has a very slow integer multiply instruction. Briefly
describe a change to struct matrix
that would let you write a version of
melt1
with no integer multiply instruction. You may add members, change
sizes, or anything you like.
The integer multiplies in this code occur when computing the position of an element in the matrix, in the
i * m->M
computation. So we need to get rid of those integer multiplies. Example answers:
- Add a
double** rows
member that points to each row, changing&m->elt[i * m->M + j]
to something like&m->rows[i][j]
.- Round M up to a power of 2 and use shifts.
- Use floating-point multiply and round down :)
IO-6. Caching
Assume that we have a cache that holds four slots. Assume that each letter below indicates an access to a block. Answer the following questions as they pertain to the following sequence of accesses.
E D C B A E D A A A B C D E
QUESTION IO-6A. What is the hit rate assuming an LRU replacement policy?
E D C B A E D A A A B C D E 1 Ⓔ Ⓐ A A A Ⓔ 2 Ⓓ Ⓔ Ⓒ 3 Ⓒ Ⓓ D 4 Ⓑ B The hit rate is 5/14.
QUESTION IO-6B. What pages will you have in the cache at the end of the run?
B C D E
QUESTION IO-6C. What is the best possible hit rate attainable if you could see into the future?
With Bélády’s algorithm we get:
E D C B A E D A A A B C D E 1 Ⓔ E E 2 Ⓓ D D 3 Ⓒ Ⓐ A A A Ⓒ 4 Ⓑ B So 8/14 (4/7).
IO-7. Caching
Intel and CrossPoint have announced a new persistent memory technology with performance approaching that of DRAM. Your job is to calculate some performance metrics to help system architectects decide how to best incorporate this new technology into their platform.
Let's say that it takes 64ns to access one (32-bit) word of main memory (DRAM) and 256ns to access one (32-bit) word of this new persistent memory, which we'll call NVM (non-volatile memory). The block size of the NVM is 256 bytes. The NVM designers are quite smart and although it takes a long time to access the first byte, when you are accessing NVM sequentially, the devices perform read ahead and stream data efficiently -- at 32 GB/second, which is identical to the bandwidth of DRAM.
QUESTION IO-7A. Let's say that we are performing random accesses of 32 bits (on a 32-bit processor). What fraction of the accesses must be to main memory (as opposed to NVM) to achieve performance within 10% of DRAM?
Let X be the fraction of accesses to DRAM: access time = 64X + 256(1-X). We want that to be <= 1.1*64 (within 10% of DRAM). So, 1.1*64 = 70.4. So, let's solve for: 64X + 256(1-X) = 70.4.
64X + 256 - 256X = 70.4. (256X - 64X) = 256 - 70.4 192X = 186 X = 186/192 about .97
So, we need a hit rate in main memory of 97%
QUESTION IO-7B. Let's say that they write every byte of a 256 block in units of 32 bits. How much faster will write-back cache perform relative to a write-through cache? (An approximate order of magnitude will be sufficient; showing work can earn partial credit.)
Write-through is going to cost 256ns/4 byte write = 256 * 64 = 28*26 = 2^14 = 16 K ns which is roughly 16 microseconds. If we assume a write-back, then it will take us 64 * 64ns to write into the DRAM, but then we get to stream the data from DRAM into the NVM at a rate of 32 GB/sec. So, 64*64 ns = 2^12 ns = 4 microseconds to write into DRAM. Let's convert 32 GB/second ito KB -- that's about 32 KB/microsecond. We need 1/4 of 1 KB which is 1/128 of a microsecond, which is about 8 ns. So, it's really really really fast to stream the data -- once you know that, then you also realize that the real difference is just the relative cost of writing to DRAM versus the cost of writing to NVM. So, the writeback cache is almost 4 times faster than the write through cache. You can get full credit by saying something like: the time to stream the data out of the DRAM into the NVM at the sequential speed is tiny relative to the time to write even a single word to DRAM, so the ultimate difference is the difference in writing to DRAM relative to NVM which is a ratio of 4:1. So, the writeback cache is about 4 times faster (because it is running at almost the full DRAM speed).
QUESTION IO-7C. Why might you not want to use a write-back cache?
A write-through cache will have very different persistence guarantees. If you need every 4- byte write to be persistent, then you have no choice but to implement a write-through cache.
IO-8. Reference strings
The following questions concern the FIFO (First In First Out), LRU (Least Recently Used), and LFU (Least Frequently Used) cache eviction policies.
⚠️ LFU (not covered in 2020) evicts the item that was accessed least frequently.
Your answers should refer to seven-item reference strings made up of digits in the range 0–9. An example answer might be “1231231”. In each case, the reference string is processed by a 3-slot cache that’s initially empty.
QUESTION IO-8A. Give a reference string that has a 1/7 hit rate in all three policies.
1123456
QUESTION IO-8B. Give a reference string that has a 6/7 hit rate in all three policies.
1111111
QUESTION IO-8C. Give a reference string that has different hit rates under LRU and LFU policies, and compute the hit rates.
String:
LRU hit rate:
LFU hit rate:
We’re looking for a string whose least-recently used item is not its least-frequently used item.
String: 1123411
LRU hit rate: 2/7
1 1 2 3 4 1 1 α ① 1 ④ β ② ① 1 γ ③ LFU hit rate: 3/7
1 1 2 3 4 1 1 α ① 1 1 1 β ② ④ γ ③
QUESTION IO-8D. Give a reference string that has different hit rates under FIFO and LRU policies, and compute the hit rates.
String:
FIFO hit rate:
LRU hit rate:
We’re looking for a string where an item is reused after other items are loaded into the cache. This will make it less of a target for LRU eviction than FIFO eviction.
String: 1231411
FIFO hit rate: 2/7
1 2 3 1 4 1 1 α ① 1 ④ β ② ① 1 γ ③ LRU hit rate: 3/7
1 2 3 1 4 1 1 α ① 1 1 1 β ② ④ γ ③
QUESTION IO-8E. Now let's assume that you know a reference string in advance. Given a 3-slot cache and the following reference string, what caching algorithm discussed in class and/or exercises would produce the best hit rate, and would would that hit rate be?
“12341425321521”
Bélády’s optimal algorithm (ACCENTS REQUIRED FOR FULL CREDIT!)(!*#^‡°
1m 2m 3m 4m [124] 1h 4h 2h 5m [125] 3m [123] 2h 1h 5m [125] 2h 1h
7/14 = 1/2
IO-9. Caching: Access times and hit rates
Recall that x86-64 instructions can access memory in units of 1, 2, 4, or 8 bytes at a time. Assume we are running on an x86-64-like machine with 1024-byte cache lines. Our machine takes 32ns to access a unit if the cache hits, regardless of unit size. If the cache misses, an additional 8160ns are required to load the cache, for a total of 8192ns.
QUESTION IO-9A. What is the average access time per access to access all the data in a cache line as an array of 256 integers, starting from an empty cache?
(8192ns * 1 + 32ns * 255)/256 (= 63.875)
QUESTION IO-9B. What unit size (1, 2, 4, or 8) minimizes the access time to access all data in a cache line, starting from an empty cache?
8
QUESTION IO-9C. What unit size (1, 2, 4, or 8) maximizes the hit rate to access all data in a cache line, starting from an empty cache?
1
IO-10. Single-slot cache code
Donald Duck is working on a single-slot cache for reading. He’s using
the pos_tag
/end_tag
representation, which
is:
struct io61_file {
int fd;
unsigned char cbuf[BUFSIZ];
off_t tag; // file offset of first character in cache (same as before)
off_t end_tag; // file offset one past last valid char in cache; end_tag - tag == old `csz`
off_t pos_tag; // file offset of next char to read in cache; pos_tag - tag == old `cpos`
};
Here’s our solution code; in case you want to scribble, the code is copied in the appendix.
1. ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
2. size_t pos = 0;
3. while (pos != sz) {
4. if (f->pos_tag < f->end_tag) {
5. ssize_t n = sz - pos;
6. if (n > f->end_tag - f->pos_tag)
7. n = f->end_tag - f->pos_tag;
8. memcpy(&buf[pos], &f->cbuf[f->pos_tag - f->tag], n);
9. f->pos_tag += n;
10. pos += n;
11. } else {
12. f->tag = f->end_tag;
13. ssize_t n = read(f->fd, f->cbuf, BUFSIZ);
14. if (n > 0)
15. f->end_tag += n;
16. else
17. return pos ? pos : n;
18. }
19. }
20. return pos;
21. }
Donald has ideas for “simplifying” this code. Specifically, he wants to try each of the following independently:
- Replacing line 4 with “
if (f->pos_tag <= f->end_tag) {
”. - Removing lines 6–7.
- Removing line 9.
- Removing lines 16–17.
QUESTION IO-10A. Which simplifications could lead to undefined behavior? List all that apply or say “none.”
B (removing 6–7): you read beyond the cache buffer.
QUESTION IO-10B. Which simplifications could cause io61_read
to
loop forever without causing undefined behavior? List all that apply or
say “none.”
A (replacing 4): you spin forever after exhausting the cache
D (removing 16–17): you spin forever if the file runs out of data or has a persistent error
QUESTION IO-10C. Which simplifications could lead to io61_read
returning incorrect data in buf
, meaning that the data read by a
series of io61_read
calls won’t equal the data in the file? List
all that apply or say “none.”
B (removing 6–7): you read garbage beyond the cache buffer
C (removing 9): you read the same data over & over again.
QUESTION IO-10D. Chastened, Donald decides to optimize the code for
a specific situation, namely when io61_read
is called with a sz
that is larger than BUFSIZ
. He wants to add code after line 11, like
so, so that fewer read
system calls will happen for large sz
:
11. } else if (sz - pos > BUFSIZ) {
// DONALD’S CODE HERE
11A. } else {
12. f->tag = f->end_tag;
....
Finish Donald’s code. Your code should maintain the relevant invariants
between tag
, pos_tag
, end_tag
, and the file position, but
you need not keep tag
aligned.
ssize_t n = read(f->fd, &buf[pos], sz - pos); if (n > 0) { f->tag = f->pos_tag = f->end_tag = f->end_tag + n; pos += n; } else { return pos ? pos : n; }
IO-11. Caching
QUESTION IO-11A. If it takes 200ns to access main memory, which of the following two caches will produce a lower average access time?
- A cache with a 10ns access time that produces a 90% hit rate
- A cache with a 20ns access time that produces a 98% hit rate
Let's compute average access time for each case:
.9 * 10 + .1 * 200 = 9 + 20 = 29 .98 * 20 + .02 * 200 = 19.6 + 4 = 23.6
The 20ns cache produces a lower average access time.
QUESTION IO-11B. Let’s say that you have a direct-mapped cache with
four slots. A page with page number N
must reside in the slot numbered
N % 4
. What is the best hit rate this could achieve given the
following sequence of page accesses?
3 6 7 5 3 2 1 1 1 8
Since it's direct mapped, each item can go in only one slot, so if we list the slots for each access, we get:
Access: 3 6 7 5 3 2 1 1 1 8 Slot: 3 2 3 1 3 2 1 1 1 0 Hit/miss: M M M M M M M H H M
The only hits are the two 1s, so your hit rate is 2/10 or 20% or .2.
QUESTION IO-11C. What is the best hit rate a fully-associative four-slot cache could achieve for that sequence of page accesses? (A fully-associative cache may put any page in any slot. You may assume you know the full reference stream in advance.)
Now we can get hits for 3, 1, and 1, so our hit rate goes to 3/10 or 30%.
QUESTION IO-11D. What hit rate would the fully-associative four-slot cache achieve if it used the LRU eviction policy?
Still 30% (3/10, .3)
IO-12. I/O traces
QUESTION IO-12A. Which of the following programs cannot be distinguished
by the output of the strace
utility, not considering open
calls? List all
that apply; if multiple indistinguishable groups exist (e.g., A, B, & C can’t
be distinguished, and D & E can’t be distinguished, but the groups can be
distinguished from each other), list them all.
- Sequential byte writes using stdio
- Sequential byte writes using system calls
- Sequential byte writes using system calls and
O_SYNC
- Sequential block writes using stdio and block size 2
- Sequential block writes using system calls and block size 2
- Sequential block writes using system calls and
O_SYNC
and block size 2 - Sequential block writes using stdio and block size 4096
- Sequential block writes using system calls and block size 4096
- Sequential block writes using system calls and
O_SYNC
and block size 4096
If you don’t consider
open
:
- 1, 4, 7, 8, 9 can’t be distinguished (writes with size 4096).
- 2, 3 can’t be distinguished (writes with size 1).
- 5, 6 can’t be distinguished (writes with size 2).
If you do consider
open
:
- 1, 4, 7, 8 can’t be distinguished (same open, writes with size 4096).
Everything else is uniquely distinguishable (either the write size differs or the open flags differ).
QUESTION IO-12B. Which of the programs in Part A cannot be distinguished
using blktrace
output? List all that apply.
1, 2, 4, 5, 7, 8, and possibly 9 can’t be distinguished.
QUESTION IO-12C. The buffer cache is coherent. Which of the following operating system changes could make the buffer cache incoherent? List all that apply.
- Application programs can obtain direct read access to the buffer cache
- Application programs can obtain direct write access to the disk, bypassing the buffer cache
- Other computers can communicate with the disk independently
- The computer has a uninterruptible power supply (UPS), ensuring that the operating system can write the contents of the buffer cache to disk if main power is lost
#2, #3
QUESTION IO-12D. The stdio cache is incoherent. Which of the operating system changes from Part C could make the stdio cache coherent? List all that apply.
#1
IO-13. Reference strings and eviction
QUESTION IO-13A. When demonstrating cache eviction in class, we modeled a completely reactive cache, meaning that the cache performed at most one load from slow storage per access. Name a class of reference string that will have a 0% hit rate on any cold reactive cache. For partial credit, give several examples of such reference strings.
Sequential access
QUESTION IO-13B. What cache optimization can be used to improve the hit rate for the class of reference string in Part A? One word is enough; put the best choice.
Prefetching. Batching is an OK answer, but mostly because it involves prefetching when done for reads.
QUESTION IO-13C. Give a single reference string with the following properties:
- There exists a cache size and eviction policy that gives a 70% hit rate for the string.
- There exists a cache size and eviction policy that gives a 0% hit rate for the string.
Example: 1231231231. 70% on any 3-slot cache; 0% on a 1-slot cache.
QUESTION IO-13D. Put the following eviction algorithms in order of how much space they require for per-slot metadata, starting with the least space and ending with the most space. (Assume the slot order is fixed, so once a block is loaded into slot i, it stays in slot i until it is evicted.) For partial credit say what you think the metadata would be.
- FIFO
- LRU
- Random
Random, then FIFO, then LRU. Random needs no additional metadata; FIFO can deal with a single integer for the whole cache, pointing to the next index to use; LRU nees a least-recently-used time.
IO-14. Cache code
Several famous musicians have just started working on CS61 Problem Set 3. They share the following code for their read-only, sequential, single-slot cache:
struct io61_file {
int fd;
unsigned char buf[4096];
size_t pos; // position of next character to read in `buf`
size_t sz; // number of valid characters in `buf`
};
int io61_readc(io61_file* f) {
if (f->pos >= f->sz) {
f->pos = f->sz = 0;
ssize_t nr = read(f->fd, f->buf, sizeof(f->buf));
if (nr <= 0) {
f->sz = 0;
return -1;
} else {
f->sz = nr;
}
}
int ch = f->buf[f->pos];
++f->pos;
return ch;
}
But they have different io61_read
implementations. Donald
(Lambert)’s is:
ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
return read(f->fd, buf, sz);
}
Solange (Knowles)’s is:
ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
for (size_t pos = 0; pos < sz; ++pos, ++buf) {
*buf = io61_readc(f);
}
return sz;
}
Caroline (Shaw)’s is:
ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
if (f->pos >= f->sz) {
return read(f->fd, buf, sz);
} else {
int ch = io61_readc(f);
if (ch < 0) {
return 0;
}
*buf = ch;
return io61_read(f, buf + 1, sz - 1) + 1;
}
}
You are testing each of these musicians’ codes by executing a sequence
of io61_readc
and/or io61_read
calls on an input file and
printing the resulting characters to standard output. There are no
seeks, and your test programs print until end of file, so your tests’
output should equal the input file’s contents.
You should assume for these questions that no read
system call
ever returns -1.
QUESTION IO-14A. Describe an access pattern—that is, a sequence of
io61_readc
and/or io61_read
calls (with lengths)—for which
Donald’s code can return incorrect data.
io61_readc
,io61_read(1)
, …: Any alternation between readc and read is a disaster.
QUESTION IO-14B. Which of these musicians’ codes can generate an output file with incorrect length?
Solange’s code never returns end-of-file; she mistakes EOF for a valid return value. If a program using Donald’s code calls
io61_readc
once and then switches toio61_read
, then they will read too few bytes (bytes in the readc buffer won’t be returned).
For the remaining parts, assume the problem or problems in Part B have been corrected, so that all musicians’ codes generate output files with correct lengths.
QUESTION IO-14C. Give an access pattern for which Solange’s code will return correct data and outperform Donald’s, or vice versa, and say whose code will win.
Solange’s code will outperform Donald’s when
io61_read
is called with small sizes. Donald’s code will outperform Solange’s whenio61_read
is called with large sizes.
QUESTION IO-14D. Suggest a small change (≤10 characters) to Caroline’s code that would, most likely, make it perform at least as well as both Solange’s and Donald’s codes on all access patterns. Explain briefly.
I would change Caroline’s test from
if (f->pos >= f->sz)
toif (f->pos >= f->sz && sz >= 1024)
(or 4096, or something similar). This uses the cache when read is called with small sizes, but avoids the extra copy when read is called with large sizes.
IO-15. Caches
Parts A–C concern different implementations of Pset 3’s stdio cache. Assume a program that reads a 32768-byte file a character at a time, like this:
while (io61_readc(inf) != EOF) {
}
This program will call io61_readc
32769 times. (32769 =
215 + 1 = 8×212 + 1; the +1 accounts for the EOF
return.) But the cache implementation might make many fewer system
calls.
QUESTION IO-15A. How many read
system calls are required assuming a
single-slot, 4096-byte io61 cache?
9
QUESTION IO-15B. How many read
system calls are required assuming an
eight-slot, 4096-byte io61 cache?
9
QUESTION IO-15C. How many mmap
system calls are required assuming an
mmap
-based io61 cache?
1
Parts D–F concern cache implementations and styles. We discussed many caches in class, including:
- The buffer cache
- The processor cache
- Single-slot aligned stdio caches
- Single-slot unaligned stdio caches
- Circular bounded buffers
QUESTION IO-15D. Which of those caches are implemented entirely in hardware? List all that apply.
B. Everything else involves some software.
QUESTION IO-15E. Which of those software caches could help speed up reverse sequential access to a disk file? List all that apply.
A, C. Circular bounded buffers are used in the kernel for pipes, which aren’t seekable and don’t support sequential access. It’s alignment that makes single-slot caches good for reverse-sequential access; an unaligned cache might miss every time. (Seek to position 4095, miss, load cache with bytes 4095–8190, read byte 4095; seek to position 4094, miss, load cache with bytes 4094–8189, read byte 4094; etc.)
QUESTION IO-15F. Which of those software caches could help speed up access to a pipe or network socket? List all that apply.
C, D, E. Stdio caches speed up all access to files, pipes, and sockets; they live above the kernel. Pipes and network sockets only support sequential access, so aligned vs. unaligned doesn’t matter.
IO-16. LRU
These questions concern the least recently used (LRU) and first-in first-out (FIFO) cache eviction policies.
QUESTION IO-16A. List all that apply.
- LRU is better than FIFO for a workload that consists of reading a file in sequential order.
- If two LRU caches process the same reference string starting from an empty state, then the cache with more slots always has a better hit rate.
- If two LRU caches process the same reference string starting from an empty state, then the cache with more slots never has a worse hit rate.
- LRU and FIFO should have the same hit rate on average for a workload that consists of reading a file in random order.
- None of the above.
3, 4.
For the next two questions, consider a cache with 5 slots that has just processed the reference string 12345. (Thus, its slots contain 1, 2, 3, 4, and 5.)
QUESTION IO-16B. Write a reference string that will observe a higher hit rate under LRU than under FIFO if executed on this cache.
161
QUESTION IO-16C. Write a reference string that will observe a higher hit rate under FIFO than under LRU if executed on this cache.
165432
The remaining questions in this problem concern the operating system’s buffer
cache. LRU requires detecting each use of a cache block (to track the time the
block was most recently used). In the buffer cache, the “blocks” are physical
memory pages, and blocks are “used” by reads, writes, and accesses to mmap
ed
memory.
QUESTION IO-16D. Which of these changes would let a WeensyOS-like operating system reliably track when buffer-cache physical memory pages are used? List all that apply.
- Adding a system call
track(uintptr_t physical_address)
that a process should call when it accesses a physical page. - Adding a member
boottime_t lru_time
tostruct proc
. (boottime_t
is a type that measures the time since boot.) - Adding a member
boottime_t lru_time
tostruct pageinfo
. - Modifying kernel system call implementations to update
the relevant
lru_time
members when buffer-cache pages are accessed. - None of these changes will help.
3, 4.
#1 is not a great fit, because system calls are under control of the process (a process could access a page without making the system call), and the question asked for reliable tracking.
#2 is not a great fit because it counts per process, not per page.
QUESTION IO-16E. The mmap
system call complicates LRU tracking for
buffer-cache pages. Why? List all that apply.
mmap
maps buffer-cache pages directly into a process’s address space.- Accessing memory in
mmap
ed regions does not normally invoke the kernel. - Accessing memory in
mmap
ed regions does not use a page table. mmap
starts with two of the letterm
, causing LRU to become confused about whichm
was used least recently.- None of the above.
1, 2
4pts
IO-17. Reference strings and hit rates
QUESTION IO-17A. Write a purely-sequential reference string containing at least five accesses.
3pts. 1 2 3 4 5
QUESTION IO-17B. What is the hit rate for this reference string? Tell us the eviction algorithm and number of slots you’ve chosen.
2pts. 0/5 (0%). This is the same for all algorithms and # slots, for reactive caches like the ones we used in class!
The next two questions concern this ten-element reference string:
We consider executing this reference string starting with different cache contents.
QUESTION IO-17C. A three-slot LRU cache processes this reference string and observes a 70% hit rate. What are the initial contents of the cache?
3pts. 1 2 3. In 3-slot LRU, whatever the initial contents of the cache are, the contents after “…4 1 5” will be “4 1 5”, and all of “4 1 5” will miss. (4 will evict 1 as LRU; 1 will evict 2; 5 will evict 3.) After which “1 1” will hit. So we know the miss rate is at least 3/10. The only way to get 3/10, rather than a higher hit rate, is to start with 1 2 3.
QUESTION IO-17D. A three-slot FIFO cache processes this reference string with initial contents 4 1 2 and observes a 60% hit rate. Which slot was next up for eviction when the reference string began?
2pts. The first.
- 1h 2h 1h 2h 3m (3 1 2) 4m (3 4 2) 1m (3 4 1) 5m (5 4 1) 1h 1h: 60% hit
- 1h 2h 1h 2h 3m (4 3 2) 4h 1m (4 3 1) 5m (5 3 1) 1h 1h: 70% hit
- 1h 2h 1h 2h 3m (4 1 3) 4h 1h 5m (5 1 3) 1h 1h: 80% hit
The eviction algorithms we saw in class are entirely reactive: they only insert a block when that block is referenced. This limits how well the cache can perform. A read cache can also be proactive by inserting blocks before they’re needed, possibly speeding up later accesses. This is the essence of prefetching.
In a proactive caching model, the cache can evict and load two or more blocks per access in the reference string. A prefetching policy decides which additional, non-accessed blocks to load.
QUESTION IO-17E. Describe an access pattern for which the following prefetching policy would be effective.
When accessing block A, also load block A+1.
3pts. Sequential
QUESTION IO-17F. Write a reference string and name an eviction policy for which this prefetching policy would be less effective (have a lower hit rate) than no prefetching at all.
2pts. Ex: Any two-slot cache that seesaws between accesses, e.g. 1 3 1 3 1 3 or 1 2 1 2 1 2.
IO-18. Coherence
QUESTION IO-18A. Which of the kinds of cache we discussed in class are typically coherent?
3pts. Processor cache and buffer cache
QUESTION IO-18B. Which of the kinds of cache we discussed in class are typically single-slot?
3pts. Stdio cache
Stdio-like caches are not coherent. The remaining questions concern potential mechanisms to make them coherent with respect to disk files.
Pedantic note. Sometimes a read-from-cache operation will occur concurrently with (at the same time as) a write to stable storage. The read operation counts as coherent whether or not it reflects the concurrent write, because logically the read and write occurred “at the same time” (neither is older).
QUESTION IO-18C. First, the new bool changed()
system call returns true if
and only if a write
was performed on some file in the last second.
Describe briefly how changed
could be used to make a stdio
cache coherent, or explain why it could not.
3pts.
changed()
is difficult to use, because for example what if:
- stdio loaded a read cache for file
f
,- someone else writes
f
,- 5 seconds go by?
Then
changed
will return false. To actually usechanged
, the user would need to check bothchanged()
and track the time the read cache was loaded; or callchanged()
at least once a second.
QUESTION IO-18D. Second, the new int open_with_timestamp(const char*
filename, unsigned long* timestamp, ...)
system call is like open
, except
that every time a change is made to the underlying filename
, the value in
*timestamp
is updated to the time, measured in milliseconds since last boot,
of the last write
operation on the file represented by file descriptor fd
.
Describe briefly how open_with_timestamp
could be used to
make a stdio cache coherent, or explain why it could not.
3pts. Easy to use.
- To open a file using stdio, use
open_with_timestamp
. The stdio file object contains space for the cache slot; space for the timestamp (which will be updated by the kernel); and space for a snapshot of the timestamp (which is updated when the cache is filled).- When filling the cache, copy the current timestamp into the snapshot.
- Before reading from the filled cache, compare the current timestamp with the timestamp snapshot. Flush and refill if they differ.
Some students said neither
changed
noropen_with_timestamp
would work because what if a write came in just after the timestamp was checked? That’s kind of fair, but the “pedantic note” says concurrent operations count as coherent.Some students said that neither would work because newer data might be hanging out in some other stdio cache. But coherence requires that the cache be no older than the slower storage.
QUESTION IO-18E. Describe briefly how mmap
could be used to make a stdio
cache coherent, or explain why it could not.
3pts. The best answer is Yes, because the buffer cache is coherent. Some students arguing No, because we don’t have system call atomicity guarantees for
mmap
regions. That’s full credit although atomicity and coherence are different concepts.
IO-19. System calls
QUESTION IO-19A. The following system calls have just been made:
int fd = open("f.txt", O_WRONLY | O_CREAT | O_TRUNC);
ssize_t nw = write(fd, "CS121 is awesome!", 17); // returned 17
What series of system calls would ensure that, after all system calls
complete, the file f.txt
contains the text “CS 61 is terrible
” (without
the quotation marks)? Minimize the number of bytes written.
This solution minimizes the number of bytes written, to 10:
lseek(fd, 2, SEEK_SET); write(fd, " 6", 2); lseek(fd, 9, SEEK_SET); write(fd, "terrible", 8);
QUESTION IO-19B. Which of the following file access patterns might have similar
output from the strace
utility? List all that apply or say “none.”
- Sequential byte writes using stdio
- Sequential byte writes using
mmap
- Sequential byte writes using system calls
None. All those patterns are easy to distinguish;
mmap
will have nowrite
system calls, stdio will write blocks, and only #3 will write bytes.
QUESTION IO-19C. Which of the following file access patterns might have similar
output from the strace
utility? List all that apply or say “none.”
- Sequential byte writes using stdio
- Sequential block writes using stdio
- Sequential byte writes using system calls
- Sequential block writes using system calls
1, 2, 4.
QUESTION IO-19D. Which of the following file access patterns might have similar
output from the strace
utility? List all that apply or say “none.”
- Reverse-sequential byte writes using stdio
- Reverse-sequential block writes using stdio
- Reverse-sequential byte writes using system calls
- Reverse-sequential block writes using system calls
2 and 4 might look similar if the block writes used the stdio block size. In that case, stdio will emit one
lseek
and onewrite
per block write, resulting in a sequence like this:lseek(4, 966656, SEEK_SET) = 966656 write(4, "A\nA's\nAMD\nAMD's\nAOL\nAOL's\nAachen"..., 4096) = 4096 lseek(4, 962560, SEEK_SET) = 962560 write(4, "\nAllyson's\nAlma\nAlma's\nAlmach\nAl"..., 4096) = 4096
That’s the same as we would expect from system calls.
Reverse-sequential byte writes using stdio will cause many more
lseek
s, one per byte, because stdio always emits onelseek
system call perfseek
library call. For instance, here’s how reverse-sequential byte reads look:lseek(3, 970752, SEEK_SET) = 970752 read(3, "zigzags\nzilch\nzilch's\nzillion\nzi"..., 4096) = 826 lseek(3, 971578, SEEK_SET) = 971578 lseek(3, 971578, SEEK_SET) = 971578 lseek(3, 971578, SEEK_SET) = 971578 lseek(3, 971578, SEEK_SET) = 971578 ...
The actual difference between #1 and #2/#4 will be even greater, because when doing reverse-sequential byte writes, stdio also calls
write
once per byte—the cache is effectively disabled. But you didn’t need to know that to answer the question.lseek(4, 971577, SEEK_SET) = 971577 write(4, "A", 1) = 1 lseek(4, 971576, SEEK_SET) = 971576 write(4, "\n", 1) = 1 lseek(4, 971575, SEEK_SET) = 971575 write(4, "A", 1) = 1 lseek(4, 971574, SEEK_SET) = 971574 write(4, "'", 1) = 1
IO-20. Caches Big, Fast, and Cheap
QUESTION IO-20A. We discussed several kinds of computer storage in class, including:
- Disks and SSDs
- primary Memory
- processor Cache memory
- Registers
Put those storage technologies in order by latency, slowest first. (An answer might be “DMCR”.)
DMCR
QUESTION IO-20B. Put those storage technologies in order by cost per byte, cheapest first.
DMCR
QUESTION IO-20C. Put those storage technologies in order by capacity in bytes on a typical computer, smallest first.
RCMD
QUESTION IO-20D. Which storage technology acts as the typical slow memory (the “underlying storage”) for each of the following caches? If unsure, explain briefly.
- Buffer cache
- Stdio cache
- Processor cache
- D
- M
- M
QUESTION IO-20E. True or false? Given a cache with two or more blocks, implementing it as a fully associative cache would always produce as good or better hit rates than a direct-mapped implementation.
False. Consider a cache with 3 slots and the ref stream, 1/4/2/3/5/4 A fully-associative FIFO cache would have no hits, but a direct-mapped cache, which requires a reference A to go to slot A mod 3, would have 1 hit (i.e., ref 4)
True gets credit only if answer specifies an optimal replacement policy.
QUESTION IO-20F. True or false? Prefetching bytes from a file on disk into the buffer cache can cause the buffer cache to become incoherent.
False
IO-21. Refrigeration
Your minifridge can hold up to 4 items. Your picky roommate insists that Apple juice goes in the door, Berries go in the fruits and vegetables drawer, Cheese goes in the dairy drawer, and Dill pickles go on the top shelf.
QUESTION IO-21A. Think of the minifridge as a cache for food bought at your corner grocery store. What kind of cache has your roommate forced you to implement?
Direct-mapped
You decide that you’ve had enough of your roommate and move. The new place’s full-size fridge can hold all your favorite foods at once (not just 4 items), but the new counter can only hold 4 items at a time. When 4 items are already on the counter, you must bring an item back to the fridge before getting a new item.
QUESTION IO-21B. To cook lunch, you must use the following food items in the
following order:
Apple juice, Berries, Cheese, Dill pickles, Eggs, Fish sticks,
A, B, E, D, A, C, D, E
(So the full reference recipe is ABCDEFABEDACDE.) Assuming a FIFO
(round-robin) eviction policy, what items are in the fridge after
the last cooking step?
BCDE is on the counter, so AF is in the fridge.
QUESTION IO-21C. What is the hit rate for FIFO on this reference recipe?
A B C D E F A B E D A C D E 1 Ⓐ Ⓔ E Ⓓ D 2 Ⓑ Ⓕ Ⓒ 3 Ⓒ Ⓐ A Ⓔ 4 Ⓓ Ⓑ 3/14
If you misread the reference recipe, using the last 8 items has a hit rate of 3/8.
QUESTION IO-21D. What is the hit rate for LRU on this reference recipe?
A B C D E F A B E D A C D E 1 Ⓐ Ⓔ E E 2 Ⓑ Ⓕ Ⓓ D 3 Ⓒ Ⓐ A 4 Ⓓ Ⓑ Ⓒ 4/14
If you misread the reference recipe, using the last 8 items has a hit rate of 3/8.
NOT A QUESTION. (0 points) Could this meal possibly be delicious?
QUESTION IO-21E. True or false for this cache? List all that apply.
- All recipes that use each ingredient exactly once will have the same hit rate under FIFO and LRU.
- All recipes that use each ingredient exactly twice will have the same hit rate under FIFO and LRU.
- All recipes that use fewer than 5 ingredients will have the same hit rate under FIFO and LRU.
- None of the above.
1 and 3.
1 is true, because if each ingredient is used exactly once, the counter cache will have 100% cold misses.
2 is not true. For example: ABCDAEBCDEFF.
FIFO A B C D A E B C D E F F 1 Ⓐ A Ⓔ E 2 Ⓑ B Ⓕ F 3 Ⓒ C 4 Ⓓ D Hit rate: 6/12
LRU A B C D A E B C D E F F 1 Ⓐ A Ⓓ 2 Ⓑ Ⓔ E 3 Ⓒ Ⓑ Ⓕ F 4 Ⓓ Ⓒ Hit rate: 3/12
3 is true, because all the used ingredients will fit into the cache once they’re loaded using cold misses.