Many exercises that seem less appropriate this year, or which cover topics that we haven’t covered in class, are marked with ⚠️. However, we may have missed some.
IO-1. I/O caching
Mary Ruefle, a poet who lives in Vermont, is working on her caching I/O library for CS 61. She wants to implement a cache with N slots. Since searching those slots might slow down her library, she writes a function that maps addresses to slots. Here’s some of her code.
#define SLOTSIZ 4096
struct io61_slot {
char buf[SLOTSIZ];
off_t pos; // = (off_t) -1 for empty slots
ssize_t sz;
};
#define NSLOTS 64
struct io61_file {
int fd;
off_t pos; // current file position
io61_slot slots[NSLOTS];
};
static inline int find_slot(off_t off) {
return off % NSLOTS;
}
int io61_readc(io61_file* f) {
int slotindex = find_slot(f->pos);
io61_slot* s = &f->slots[slotindex];
if (f->pos < s->pos || f->pos >= s->pos + s->sz) {
// slot contains wrong data, need to refill it
off_t new_pos = lseek(f->fd, f->pos, SEEK_SET);
assert(new_pos != (off_t) -1); // only handle seekable files for now
ssize_t r = read(f->fd, s->buf, SLOTSIZ);
if (r == -1 || r == 0) {
return EOF;
}
s->pos = f->pos;
s->sz = r;
}
int ch = (unsigned char) s->buf[f->pos - s->pos];
++f->pos;
return ch;
}
Before she can run and debug this code, Mary is led “to an emergency of feeling that … results in a poem.” She’ll return to CS61 and fix her implementation soon, but in the meantime, let’s answer some questions about it.
QUESTION IO-1A. True or false: Mary’s cache is a direct-mapped cache.
True
QUESTION IO-1B. What changes to Mary’s code could change your answer to Part A? Circle all that apply.
- New code for
find_slot
(keepingio61_readc
the same) - New code in
io61_readc
(keepingfind_slot
the same) - New code in
io61_readc
and new code forfind_slot
- None of the above
#2 and #3
QUESTION IO-1C. Which problems would occur when Mary’s code was used to sequentially read a seekable file of size 2MiB (2×220 = 2097152 bytes) one character at a time? Circle all that apply.
- Excessive CPU usage (>10x stdio)
- Many system calls to read data (>10x stdio)
- Incorrect data (byte x read at a position where the file has byte y≠x)
- Read too much data (more bytes read than file contains)
- Read too little data (fewer bytes read than file contains)
- Crash/undefined behavior
- None of the above
#2 only
QUESTION IO-1D. Which of these new implementations for find_slot
would fix at least one of these problems with reading sequential files?
Circle all that apply.
return (off * 2654435761) % NSLOTS; /* integer hash function from Stack Overflow */
return (off / SLOTSIZ) % NSLOTS;
return off & (NSLOTS - 1);
return 0;
return (off >> 12) & 0x3F;
- None of the above
#2, #4, #5
IO-2. Caches and reference strings
QUESTION IO-2A. True or false: A direct-mapped cache with N or more slots can handle any reference string containing ≤N distinct addresses with no misses except for cold misses.
False: direct-mapped caches can have conflict misses
QUESTION IO-2B. True or false: A fully-associative cache with N or more slots can handle any reference string containing ≤N distinct addresses with no misses except for cold misses.
True
Consider the following 5 reference strings.
Name | String |
---|---|
α | 1 |
β | 1, 2 |
γ | 1, 2, 3, 4, 5 |
δ | 2, 4 |
ε | 5, 2, 4, 2 |
QUESTION IO-2C. Which of the strings might indicate a sequential access pattern? Circle all that apply.
α |
β |
γ |
δ |
ε |
None of these |
(α), β, γ
QUESTION IO-2D. Which of the strings might indicate a strided access pattern with stride >1? Circle all that apply.
α |
β |
γ |
δ |
ε |
None of these |
(α), δ
One very clever person pointed out that β and γ could also represent large strides: for example, consider a file with 10 bytes accessed with stride 11!
The remaining questions concern concatenated permutations of these five strings. For example, the permutation αβγδε refers to this reference string:
1, 1, 2, 1, 2, 3, 4, 5, 2, 4, 5, 2, 4, 2. |
We pass such permutations through an initially-empty, fully-associative cache with 3 slots, and observe the numbers of hits.
QUESTION IO-2E. How many cold misses might a permutation observe? Circle all that apply.
0 |
1 |
2 |
3 |
4 |
5 |
Some other number |
5. The first time a reference string address is encountered, it must cause a cold miss.
Under LRU eviction, the permutation αβεγδ observes 5 hits as follows. (We annotate each access with “h” for hit or “m” for miss.)
1m; 1h, 2m; 5m, 2h, 4m, 2h; 1m, 2h, 3m, 4m, 5m; 2m, 4h. |
QUESTION IO-2F. How many hits does this permutation observe under FIFO eviction?
4 hits
QUESTION IO-2G. Give a permutation that will observe 8 hits under LRU eviction, which is the maximum for any permutation. There are several possible answers. (Write your answer as a permutation of αβγδε. For partial credit, find a permutation that has 7 hits, etc.)
The following four permutations observe 8 hits under LRU: αβγδε, αβγεδ, βαγδε, βαγεδ. 28 permutations observe 7 hits; 25 observe 6 hits; and 38 observe 5 hits.
QUESTION IO-2H. Give a permutation that will observe 2 hits under LRU eviction, which is the minimum for any permutation. There is one unique answer. (Write your answer as a permutation of αβγδε. For partial credit, find a permutation that has 3 hits, etc.)
δαεγβ. 4 permutations observe 3 hits and 20 observe 4 hits.
IO-3. Processor cache
The git version control system is based on commit hashes, which are 160-bit (20-byte) hash values used to identify commits. In this problem you’ll consider the processor cache behavior of several versions of a “grading server” that maps commits to grades. Here’s the first version:
struct commit_info {
char hash[20];
int grade[11];
};
commit_info* commits;
size_t N;
int get_grade1(const char* hash, int pset) {
for (size_t i = 0; i != N; ++i) {
if (memcmp(commits[i].hash, hash, 20) == 0) {
return commits[i].grade[pset];
}
}
return -1;
}
We will ask questions about the average number of cache lines accessed
by variants of get_grade(hash, pset)
. You should make the following
assumptions:
- The
hash
argument is uniformly drawn from the set of known commits. That is, the probability thathash
equals the ith commit’s hash is 1/N
. - Only count cache lines accessible via
commits
. Don’t worry about cache lines used for local variables, for parameters, for global variables, or for instructions. For instance, do not count thehash
argument or the global-data cache line that stores thecommits
variable itself. - The
commits
pointer is 64-byte aligned and cache lines are 64 bytes long. - Commit hashes are mathematically indistinguishable from random numbers. Thus, the probability that two different hashes have the same initial k bits equals 1/2k.
- We’ll ignore small errors; N/2 and (N+1)/2 will be considered equivalent.
QUESTION IO-3A. What is the expected number of cache lines accessed
by get_grade1
, in terms of N
?
The answer is \((N+1)/2\), which is roughly \(N/2\).
Formally, let \(C\) be a random variable representing the number of cache lines accessed. This is a discrete random variable: there are \(N\) possibilities \(C_1,\dots,C_N\), corresponding to the number of cache lines required to access commits \(1,\dots,N\). Then, given our assumptions, the expected value of \(C\) is
\[ E[C] = \sum_{i=1}^N Pr[\text{commit is $i$th commit}] \times C_i = \frac{1}{N}\sum_{i=1}^N C_i. \]
So far, this analysis holds for any
commit_info
structure. Forget_grade1
specifically, commits are stored in order, so accessing the \(i\)th commit accesses \(i\) cache lines, and \(C_i = i\):\[ E[C] = \frac{1}{N} \sum_{i=1}^N i = \frac{1}{N}\frac{N(N+1)}{2} = \frac{N+1}{2}. \]
The second version:
struct commit_info {
char hash[20];
int grade[11];
};
commit_info** commits;
size_t N;
int get_grade2(const char hash[20], int pset) {
for (size_t i = 0; i != N; ++i) {
if (memcmp(commits[i]->hash, hash, 20) == 0) {
return commits[i]->grade[pset];
}
}
return -1;
}
QUESTION IO-3B. What is the expected number of cache lines accessed
by get_grade2
, in terms of N
?
Roughly 9N/16.
We still examine ≈N/2 commit_info objects, but in addition we examine cache lines to evaluate the pointers to those objects. There are 8 such pointers per cache line (8×8=64), and we examine around N/2 pointers, for ≈N/8/2 = N/16 additional cache lines.
Formally \( C_i = \lceil i / 8 \rceil + i \), so
\[ E[C] = \frac{N+1}{2} + \frac{1}{N}\sum_{i=1}^N \lceil i/8 \rceil. \]
There’s no convenient closed form for the sum of the ceiling function, but we know that \( \lceil i/8 \rceil \leq (i + 7)/8 \), so
\[ E[C] \leq \frac{N+1}{2} + \frac{1}{N}\sum_{i=1}^N \frac{i+7}{8} = \frac{N+1}{2} + \frac{N+15}{16} = \frac{9N + 23}{16}. \]
The third version:
struct commit_info {
char hash[20];
int grade[11];
};
struct commit_hint {
char hint[8];
commit_info* commit;
};
commit_hint* commits;
size_t N;
int get_grade3(const char* hash, int pset) {
for (size_t i = 0; i != N; ++i) {
if (memcmp(commits[i].hint, hash, 8) == 0
&& memcmp(commits[i].commit->hash, hash, 20) == 0) {
return commits[i].commit->grade[pset];
}
}
return -1;
}
QUESTION IO-3C. What is the expected number of cache lines accessed
by get_grade3
, in terms of N
? (You may assume that N
≤2000.)
Roughly \( 1 + \lceil N / 8 \rceil \).
The assumption that N≤2000 means we’re exceedingly unlikely to encounter a hint collision (i.e. a commit with the same hint, but different commit value). That means that we will examine N/2
commit_hint
objects but only onecommit_info
object.commit_hint
objects are 16B big, so there are four of them per cache line, and we examine roughly N/4/2 = N/8 cache lines for hint objects.Formally \( C_i = \lceil i / 4 \rceil + 1 \), so
\[ \begin{aligned} E[C] &= \frac{1}{N} \sum_{i=1}^N \left( \left\lceil \frac{i}{4} \right\rceil + 1 \right) = 1 + \frac{1}{N} \sum_{i=1}^N \left\lceil \frac{i}{4} \right\rceil \\ &\leq 1 + \frac{1}{N} \sum_{i=1}^N \frac{i+3}{4} = 1 + \frac{N+7}{8}. \end{aligned} \]
The fourth version is a hash table.
struct commit_info {
char hash[20];
int grade[11];
};
commit_info** commits;
size_t commits_hashsize;
int get_grade4(const char* hash, int pset) {
// choose initial bucket
size_t bucket;
memcpy(&bucket, hash, sizeof(bucket));
bucket = bucket % commits_hashsize;
// search for the commit starting from that bucket
while (commits[bucket] != nullptr) {
if (memcmp(commits[bucket]->hash, hash, 20) == 0) {
return commits[bucket]->grade[pset];
}
bucket = (bucket + 1) % commits_hashsize;
}
return -1;
}
QUESTION IO-3D. Assume that a call to get_grade4
encounters B - 1
expected hash collisions (i.e., examines B
buckets total, including the
bucket that actually contains hash
). What is the expected number of cache
lines accessed by get_grade4
, in terms of N
and B
?
Roughly \( B + \lceil B/8 \rceil + 1 \); more roughly, \( 9B/8 \).
The
while
loop will execute \(B\) times total. Each time through, it will access one bucket and onecommit_info
object. A bucket is represented as a pointer to acommit_info
object, so there are eight such pointers per cache line. It’s difficult to express an exact formula for the number of cache lines accessed; for instance, if the initialbucket
index is 0, then \(B=8\) will access just one cache line for all buckets, but if the initialbucket
index is 7, then \(B=8\) will access two cache lines. We can see, though, that the number of bucket cache lines accessed is always \( \leq \lceil B/8 \rceil + 1\).Formally, \( C_i \leq B + \lceil B/8 \rceil + 1 \). Since this is independent of \(N\), \( E[C] \leq B + \lceil B/8 \rceil + 1 \approx 9B/8 \).
IO-4. IO caching and strace
Elif Batuman is investigating several program executables left behind by
her ex-roommate Fyodor. She runs each executable under strace
in the
following way:
strace -o strace.txt ./EXECUTABLE files/text1meg.txt > files/out.txt
Help her figure out properties of these programs based on their system call traces.
QUESTION IO-4A. Program ./mysterya
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x8193000
brk(0x81b5000) = 0x81b5000
read(3, "A", 1) = 1
write(1, "A", 1) = 1
read(3, "\n", 1) = 1
write(1, "\n", 1) = 1
read(3, "A", 1) = 1
write(1, "A", 1) = 1
read(3, "'", 1) = 1
write(1, "'", 1) = 1
read(3, "s", 1) = 1
write(1, "s", 1) = 1
...
Circle at least one option in each column.
|
|
|
|
1, Sequential IO; a, No read cache; i, No write cache; D, Other
QUESTION IO-4B. Program ./mysteryb
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x96c5000
brk(0x96e6000) = 0x96e6000
read(3, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 2048) = 2048
write(1, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 2048) = 2048
read(3, "kad\nAkron\nAkron's\nAl\nAl's\nAla\nAl"..., 2048) = 2048
write(1, "kad\nAkron\nAkron's\nAl\nAl's\nAla\nAl"..., 2048) = 2048
...
Circle at least one option in each column.
|
|
|
|
1, Sequential IO; b/c, (Un)aligned read cache; ii, Write cache; B, Cache size 2048
QUESTION IO-4C. Program ./mysteryc
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x9064000
brk(0x9085000) = 0x9085000
fstat64(3, {st_mode=S_IFREG|0664, st_size=1048576, ...}) = 0
lseek(3, 1046528, SEEK_SET) = 1046528
read(3, "ingau\nRheingau's\nRhenish\nRhianno"..., 2048) = 2048
write(1, "oR\ntlevesooR\ns'yenooR\nyenooR\ns't"..., 2048) = 2048
lseek(3, 1044480, SEEK_SET) = 1044480
read(3, "Quinton\nQuinton's\nQuirinal\nQuisl"..., 2048) = 2048
write(1, "ehR\neehR\naehR\ns'hR\nhR\nsdlonyeR\ns"..., 2048) = 2048
lseek(3, 1042432, SEEK_SET) = 1042432
read(3, "emyslid's\nPrensa\nPrensa's\nPrenti"..., 2048) = 2048
write(1, "\ns'nailitniuQ\nnailitniuQ\nnniuQ\ns"..., 2048) = 2048
lseek(3, 1040384, SEEK_SET) = 1040384
read(3, "Pindar's\nPinkerton\nPinocchio\nPin"..., 2048) = 2048
write(1, "rP\ndilsymerP\ns'regnimerP\nregnime"..., 2048) = 2048
...
Circle at least one option in each column.
|
|
|
|
2, Reverse sequential IO; c, Aligned read cache; ii, Write cache; B, Cache size 2048
QUESTION IO-4D. Program ./mysteryd
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x9a0e000
brk(0x9a2f000) = 0x9a2f000
fstat64(3, {st_mode=S_IFREG|0664, st_size=1048576, ...}) = 0
lseek(3, 1048575, SEEK_SET) = 1048575
read(3, "o", 2048) = 1
lseek(3, 1048574, SEEK_SET) = 1048574
read(3, "Ro", 2048) = 2
lseek(3, 1048573, SEEK_SET) = 1048573
read(3, "\nRo", 2048) = 3
...
lseek(3, 1046528, SEEK_SET) = 1046528
read(3, "ingau\nRheingau's\nRhenish\nRhianno"..., 2048) = 2048
write(1, "oR\ntlevesooR\ns'yenooR\nyenooR\ns't"..., 2048) = 2048
lseek(3, 1046527, SEEK_SET) = 1046527
read(3, "eingau\nRheingau's\nRhenish\nRhiann"..., 2048) = 2048
lseek(3, 1046526, SEEK_SET) = 1046526
read(3, "heingau\nRheingau's\nRhenish\nRhian"..., 2048) = 2048
...
Circle at least one option in each column.
|
|
|
|
2, Reverse sequential IO; b, Unaligned read cache; ii, Write cache; B, Cache size 2048
QUESTION IO-4E. Program ./mysterye
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x93e5000
brk(0x9407000) = 0x9407000
read(3, "A", 1) = 1
read(3, "\n", 1) = 1
read(3, "A", 1) = 1
...
read(3, "A", 1) = 1
read(3, "l", 1) = 1
write(1, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 1024) = 1024
read(3, "t", 1) = 1
read(3, "o", 1) = 1
read(3, "n", 1) = 1
...
Circle at least one option in each column.
|
|
|
|
1, Sequential IO; a, No read cache; ii, Write cache; C, (write) cache size 1024
QUESTION IO-4F. Program ./mysteryf
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x9281000
brk(0x92a3000) = 0x92a3000
read(3, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 4096) = 4096
write(1, "A", 1) = 1
write(1, "\n", 1) = 1
write(1, "A", 1) = 1
...
write(1, "A", 1) = 1
write(1, "l", 1) = 1
read(3, "ton's\nAludra\nAludra's\nAlva\nAlvar"..., 4096) = 4096
write(1, "t", 1) = 1
write(1, "o", 1) = 1
write(1, "n", 1) = 1
...
Circle at least one option in each column.
|
|
|
|
1, Sequential IO; b/c, (un)aligned read cache; i, No write cache; A, Cache size 4096
IO-5. Processor cache
The following questions use the following C definition for an N
xM
matrix (the matrix has N
rows and M
columns).
struct matrix {
unsigned N;
unsigned M;
double elt[0];
};
matrix* matrix_create(unsigned N, unsigned M) {
matrix* m = (matrix*) malloc(sizeof(matrix) + N * M * sizeof(double));
m->N = N;
m->M = M;
for (size_t i = 0; i < N * M; ++i) {
m->elt[i] = 0.0;
}
return m;
}
Typically, matrix data is stored in row-major order: element
mij (at row i and column
j) is stored in m->elt[i*m->M + j]
. We might write this in C
using an inline function:
inline double* melt1(matrix* m, unsigned i, unsigned j) {
return &m->elt[i * m->M + j];
}
But that’s not the only possible method to store matrix data. Here are several more.
inline double* melt2(matrix* m, unsigned i, unsigned j) {
return &m->elt[i + j * m->N];
}
inline double* melt3(matrix* m, unsigned i, unsigned j) {
return &m->elt[i + ((m->N - i + j) % m->M) * m->N];
}
inline double* melt4(matrix* m, unsigned i, unsigned j) {
return &m->elt[i + ((i + j) % m->M) * m->N];
}
inline double* melt5(matrix* m, unsigned i, unsigned j) {
assert(m->M % 8 == 0);
unsigned k = (i/8) * (m->M/8) + (j/8);
return &m->elt[k*64 + (i % 8) * 8 + j % 8];
}
QUESTION IO-5A. Which method (of melt1
–melt5
) will have the best
processor cache behavior if most matrix accesses use loops like this?
for (unsigned j = 0; j < 100; ++j) {
for (unsigned i = 0; i < 100; ++i) {
f(*melt(m, i, j));
}
}
melt2
QUESTION IO-5B. Which method will have the best processor cache behavior if most matrix accesses use loops like this?
for (unsigned i = 0; i < 100; ++i) {
f(*melt(m, i, i));
}
melt3
QUESTION IO-5C. Which method will have the best processor cache behavior if most matrix accesses use loops like this?
for (unsigned i = 0; i < 100; ++i) {
for (unsigned j = 0; j < 100; ++j) {
f(*melt(m, i, j));
}
}
melt1
(butmelt5
is almost as good)
QUESTION IO-5D. Which method will have the best processor cache behavior if most matrix accesses use loops like this?
for (int di = -3; di <= 3; ++di) {
for (int dj = -3; dj <= 3; ++dj) {
f(*melt(m, I + di, J + dj));
}
}
melt5
QUESTION IO-5E. Here is a matrix-multiply function in ikj order.
matrix* matrix_multiply(matrix* a, matrix* b) {
assert(a->M == b->N);
matrix* c = matrix_create(a->N, b->M);
for (unsigned i = 0; i != a->N; ++i) {
for (unsigned k = 0; k != a->M; ++k) {
for (unsigned j = 0; j != b->M; ++j) {
*melt(c, i, j) += *melt(a, i, k) * *melt(b, k, j);
}
}
}
}
This loop order is cache-optimal when data is stored in melt1
order.
What loop order is cache-optimal for melt2
?
jki is best; kji is a close second.
QUESTION IO-5F. You notice that accessing a matrix element using
melt1
is very slow. After some debugging, it seems like the processor
on which you are running code has a very slow multiply instruction.
Briefly describe a change to struct matrix
that would let you write a
version of melt1
with no multiply instruction. You may add members,
change sizes, or anything you like.
Example answers:
- Add a
double** rows
member that points to each row so you don't need to multiply- Round M up to a power of 2 and use shifts
IO-6. Caching
Assume that we have a cache that holds four slots. Assume that each letter below indicates an access to a block. Answer the following questions as they pertain to the following sequence of accesses.
E D C B A E D A A A B C D E
QUESTION IO-6A. What is the hit rate assuming an LRU replacement policy?
E D C B A E D A A A B C D E 1 Ⓔ Ⓐ A A A Ⓔ 2 Ⓓ Ⓔ Ⓒ 3 Ⓒ Ⓓ D 4 Ⓑ B The hit rate is 5/14.
QUESTION IO-6B. What pages will you have in the cache at the end of the run?
B C D E
QUESTION IO-6C. What is the best possible hit rate attainable if you could see into the future?
With Bélády’s algorithm we get:
E D C B A E D A A A B C D E 1 Ⓔ E E 2 Ⓓ D D 3 Ⓒ Ⓐ A A A Ⓒ 4 Ⓑ B So 8/14 (4/7).
IO-7. Caching
Intel and CrossPoint have announced a new persistent memory technology with performance approaching that of DRAM. Your job is to calculate some performance metrics to help system architectects decide how to best incorporate this new technology into their platform.
Let's say that it takes 64ns to access one (32-bit) word of main memory (DRAM) and 256ns to access one (32-bit) word of this new persistent memory, which we'll call NVM (non-volatile memory). The block size of the NVM is 256 bytes. The NVM designers are quite smart and although it takes a long time to access the first byte, when you are accessing NVM sequentially, the devices perform read ahead and stream data efficiently -- at 32 GB/second, which is identical to the bandwidth of DRAM.
QUESTION IO-7A. Let's say that we are performing random accesses of 32 bits (on a 32-bit processor). What fraction of the accesses must be to main memory (as opposed to NVM) to achieve performance within 10% of DRAM?
Let X be the fraction of accesses to DRAM: access time = 64X + 256(1-X). We want that to be <= 1.1*64 (within 10% of DRAM). So, 1.1*64 = 70.4. So, let's solve for: 64X + 256(1-X) = 70.4.
64X + 256 - 256X = 70.4. (256X - 64X) = 256 - 70.4 192X = 186 X = 186/192 about .97
So, we need a hit rate in main memory of 97%
QUESTION IO-7B. Let's say that they write every byte of a 256 block in units of 32 bits. How much faster will write-back cache perform relative to a write-through cache? (An approximate order of magnitude will be sufficient; showing work can earn partial credit.)
Write-through is going to cost 256ns/4 byte write = 256 * 64 = 28*26 = 2^14 = 16 K ns which is roughly 16 microseconds. If we assume a write-back, then it will take us 64 * 64ns to write into the DRAM, but then we get to stream the data from DRAM into the NVM at a rate of 32 GB/sec. So, 64*64 ns = 2^12 ns = 4 microseconds to write into DRAM. Let's convert 32 GB/second ito KB -- that's about 32 KB/microsecond. We need 1/4 of 1 KB which is 1/128 of a microsecond, which is about 8 ns. So, it's really really really fast to stream the data -- once you know that, then you also realize that the real difference is just the relative cost of writing to DRAM versus the cost of writing to NVM. So, the writeback cache is almost 4 times faster than the write through cache. You can get full credit by saying something like: the time to stream the data out of the DRAM into the NVM at the sequential speed is tiny relative to the time to write even a single word to DRAM, so the ultimate difference is the difference in writing to DRAM relative to NVM which is a ratio of 4:1. So, the writeback cache is about 4 times faster (because it is running at almost the full DRAM speed).
QUESTION IO-7C. Why might you not want to use a write-back cache?
A write-through cache will have very different persistence guarantees. If you need every 4- byte write to be persistent, then you have no choice but to implement a write-through cache.
IO-8. Reference strings
The following questions concern the FIFO (First In First Out), LRU (Least Recently Used), and LFU (Least Frequently Used) cache eviction policies.
Your answers should refer to seven-item reference strings made up of digits in the range 0–9. An example answer might be “1231231”. In each case, the reference string is processed by a 3-slot cache that’s initially empty.
QUESTION IO-8A. Give a reference string that has a 1/7 hit rate in all three policies.
1123456
QUESTION IO-8B. Give a reference string that has a 6/7 hit rate in all three policies.
1111111
QUESTION IO-8C. Give a reference string that has different hit rates under LRU and LFU policies, and compute the hit rates.
String:
LRU hit rate:
LFU hit rate:
String: 1234111
LRU hit rate: 2/7
LFU hit rate: 3/7
QUESTION IO-8D. Give a reference string that has different hit rates under FIFO and LRU policies, and compute the hit rates.
String:
FIFO hit rate:
LRU hit rate:
String: 1231411
FIFO hit rate: 2/7
LRU hit rate: 3/7
QUESTION IO-8E. Now let's assume that you know a reference string in advance. Given a 3-slot cache and the following reference string, what caching algorithm discussed in class and/or exercises would produce the best hit rate, and would would that hit rate be?
“12341425321521”
Bélády’s optimal algorithm (ACCENTS REQUIRED FOR FULL CREDIT!)(!*#^‡°
1m 2m 3m 4m [124] 1h 4h 2h 5m [125] 3m [123] 2h 1h 5m [125] 2h 1h
7/14 = 1/2
IO-9. Caching: Access times and hit rates
Recall that x86-64 instructions can access memory in units of 1, 2, 4, or 8 bytes at a time. Assume we are running on an x86-64-like machine with 1024-byte cache lines. Our machine takes 32ns to access a unit if the cache hits, regardless of unit size. If the cache misses, an additional 8160ns are required to load the cache, for a total of 8192ns.
QUESTION IO-9A. What is the average access time per access to access all the data in a cache line as an array of 256 integers, starting from an empty cache?
(8192ns * 1 + 32ns * 255)/256 (= 63.875)
QUESTION IO-9B. What unit size (1, 2, 4, or 8) minimizes the access time to access all data in a cache line, starting from an empty cache?
8
QUESTION IO-9C. What unit size (1, 2, 4, or 8) maximizes the hit rate to access all data in a cache line, starting from an empty cache?
1
IO-10. Single-slot cache code
Donald Duck is working on a single-slot cache for reading. He’s using
the pos_tag
/end_tag
representation, which
is:
struct io61_file {
int fd;
unsigned char cbuf[BUFSIZ];
off_t tag; // file offset of first character in cache (same as before)
off_t end_tag; // file offset one past last valid char in cache; end_tag - tag == old `csz`
off_t pos_tag; // file offset of next char to read in cache; pos_tag - tag == old `cpos`
};
Here’s our solution code; in case you want to scribble, the code is copied in the appendix.
1. ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
2. size_t pos = 0;
3. while (pos != sz) {
4. if (f->pos_tag < f->end_tag) {
5. ssize_t n = sz - pos;
6. if (n > f->end_tag - f->pos_tag)
7. n = f->end_tag - f->pos_tag;
8. memcpy(&buf[pos], &f->cbuf[f->pos_tag - f->tag], n);
9. f->pos_tag += n;
10. pos += n;
11. } else {
12. f->tag = f->end_tag;
13. ssize_t n = read(f->fd, f->cbuf, BUFSIZ);
14. if (n > 0)
15. f->end_tag += n;
16. else
17. return pos ? pos : n;
18. }
19. }
20. return pos;
21. }
Donald has ideas for “simplifying” this code. Specifically, he wants to try each of the following independently:
- Replacing line 4 with “
if (f->pos_tag <= f->end_tag) {
”. - Removing lines 6–7.
- Removing line 9.
- Removing lines 16–17.
QUESTION IO-10A. Which simplifications could lead to undefined behavior? List all that apply or say “none.”
B (removing 6–7): you read beyond the cache buffer.
QUESTION IO-10B. Which simplifications could cause io61_read
to
loop forever without causing undefined behavior? List all that apply or
say “none.”
A (replacing 4): you spin forever after exhausting the cache
D (removing 16–17): you spin forever if the file runs out of data or has a persistent error
QUESTION IO-10C. Which simplifications could lead to io61_read
returning incorrect data in buf
, meaning that the data read by a
series of io61_read
calls won’t equal the data in the file? List
all that apply or say “none.”
B (removing 6–7): you read garbage beyond the cache buffer
C (removing 9): you read the same data over & over again.
QUESTION IO-10D. Chastened, Donald decides to optimize the code for
a specific situation, namely when io61_read
is called with a sz
that is larger than BUFSIZ
. He wants to add code after line 11, like
so, so that fewer read
system calls will happen for large sz
:
11. } else if (sz - pos > BUFSIZ) {
// DONALD’S CODE HERE
11A. } else {
12. f->tag = f->end_tag;
....
Finish Donald’s code. Your code should maintain the relevant invariants
between tag
, pos_tag
, end_tag
, and the file position, but
you need not keep tag
aligned.
ssize_t n = read(f->fd, &buf[pos], sz - pos); if (n > 0) { f->tag = f->pos_tag = f->end_tag = f->end_tag + n; pos += n; } else { return pos ? pos : n; }
IO-11. Caching
QUESTION IO-11A. If it takes 200ns to access main memory, which of the following two caches will produce a lower average access time?
- A cache with a 10ns access time that produces a 90% hit rate
- A cache with a 20ns access time that produces a 98% hit rate
Let's compute average access time for each case:
.9 * 10 + .1 * 200 = 9 + 20 = 29 .98 * 20 + .02 * 200 = 19.6 + 4 = 23.6
The 20ns cache produces a lower average access time.
QUESTION IO-11B. Let’s say that you have a direct-mapped cache with
four slots. A page with page number N
must reside in the slot numbered
N % 4
. What is the best hit rate this could achieve given the
following sequence of page accesses?
3 6 7 5 3 2 1 1 1 8
Since it's direct mapped, each item can go in only one slot, so if we list the slots for each access, we get:
3 2 3 1 3 2 1 1 1 0
The only hits are the 2 1's, so your hit rate is 2/10 or 20% or .2.
QUESTION IO-11C. What is the best hit rate a fully-associative four-slot cache could achieve for that sequence of page accesses? (A fully-associative cache may put any page in any slot. You may assume you know the full reference stream in advance.)
Now we can get hits for 3, 1, and 1, so our hit rate goes to 3/10 or 30%.
QUESTION IO-11D. What hit rate would the fully-associative four-slot cache achieve if it used the LRU eviction policy?
Still 30% (3/10, .3)
IO-12. I/O traces
QUESTION IO-12A. Which of the following programs cannot be distinguished
by the output of the strace
utility, not considering open
calls? List all
that apply; if multiple indistinguishable groups exist (e.g., A, B, & C can’t
be distinguished, and D & E can’t be distinguished, but the groups can be
distinguished from each other), list them all.
- Sequential byte writes using stdio
- Sequential byte writes using system calls
- Sequential byte writes using system calls and
O_SYNC
- Sequential block writes using stdio and block size 2
- Sequential block writes using system calls and block size 2
- Sequential block writes using system calls and
O_SYNC
and block size 2 - Sequential block writes using stdio and block size 4096
- Sequential block writes using system calls and block size 4096
- Sequential block writes using system calls and
O_SYNC
and block size 4096
1, 4, 7, 8 can’t be distinguished.
If you consider
open
, then the O_SYNC cases and the others become distinguishable. Assuming that, 2&3 can’t be distinguished, 4&5 can’t be distinguished, and 1,4,7,8,9 can’t be distinguished.
QUESTION IO-12B. Which of the programs in Part A cannot be distinguished
using blktrace
output? List all that apply.
1, 2, 4, 5, 7, 8, and possibly 9 can’t be distinguished.
QUESTION IO-12C. The buffer cache is coherent. Which of the following operating system changes could make the buffer cache incoherent? List all that apply.
- Application programs can obtain direct read access to the buffer cache
- Application programs can obtain direct write access to the disk, bypassing the buffer cache
- Other computers can communicate with the disk independently
- The computer has a uninterruptible power supply (UPS), ensuring that the operating system can write the contents of the buffer cache to disk if main power is lost
#2, #3
QUESTION IO-12D. The stdio cache is incoherent. Which of the operating system changes from Part C could make the stdio cache coherent? List all that apply.
#1
IO-13. Reference strings and eviction
QUESTION IO-13A. When demonstrating cache eviction in class, we modeled a completely reactive cache, meaning that the cache performed at most one load from slow storage per access. Name a class of reference string that will have a 0% hit rate on any cold reactive cache. For partial credit, give several examples of such reference strings.
Sequential access
QUESTION IO-13B. What cache optimization can be used to improve the hit rate for the class of reference string in Part A? One word is enough; put the best choice.
Prefetching. Batching is an OK answer, but mostly because it involves prefetching when done for reads.
QUESTION IO-13C. Give a single reference string with the following properties:
- There exists a cache size and eviction policy that gives a 70% hit rate for the string.
- There exists a cache size and eviction policy that gives a 0% hit rate for the string.
Example: 1231231231. 70% on any 3-slot cache; 0% on a 1-slot cache.
QUESTION IO-13D. Put the following eviction algorithms in order of how much space they require for per-slot metadata, starting with the least space and ending with the most space. (Assume the slot order is fixed, so once a block is loaded into slot i, it stays in slot i until it is evicted.) For partial credit say what you think the metadata would be.
- FIFO
- LRU
- Random
Random, then FIFO, then LRU. Random needs no additional metadata; FIFO can deal with a single integer for the whole cache, pointing to the next index to use; LRU nees a least-recently-used time.
IO-14. Cache code
Several famous musicians have just started working on CS61 Problem Set
- They share the following code for their read-only, sequential, single-slot cache:
struct io61_file {
int fd;
unsigned char buf[4096];
size_t pos; // position of next character to read in `buf`
size_t sz; // number of valid characters in `buf`
};
int io61_readc(io61_file* f) {
if (f->pos >= f->sz) {
f->pos = f->sz = 0;
ssize_t nr = read(f->fd, f->buf, sizeof(f->buf));
if (nr <= 0) {
f->sz = 0;
return -1;
} else {
f->sz = nr;
}
}
int ch = f->buf[f->pos];
++f->pos;
return ch;
}
But they have different io61_read
implementations. Donald
(Lambert)’s is:
ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
return read(f->fd, buf, sz);
}
Solange (Knowles)’s is:
ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
for (size_t pos = 0; pos < sz; ++pos, ++buf) {
*buf = io61_readc(f);
}
return sz;
}
Caroline (Shaw)’s is:
ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
if (f->pos >= f->sz) {
return read(f->fd, buf, sz);
} else {
int ch = io61_readc(f);
if (ch < 0) {
return 0;
}
*buf = ch;
return io61_read(f, buf + 1, sz - 1) + 1;
}
}
You are testing each of these musicians’ codes by executing a sequence
of io61_readc
and/or io61_read
calls on an input file and
printing the resulting characters to standard output. There are no
seeks, and your test programs print until end of file, so your tests’
output should equal the input file’s contents.
You should assume for these questions that no read
system call
ever returns -1.
QUESTION IO-14A. Describe an access pattern—that is, a sequence of
io61_readc
and/or io61_read
calls (with lengths)—for which
Donald’s code can return incorrect data.
io61_readc
,io61_read(1)
, …: Any alternation between readc and read is a disaster.
QUESTION IO-14B. Which of these musicians’ codes can generate an output file with incorrect length?
For the remaining parts, assume the problem in Part B has been corrected, so that all musicians’ codes generate output files with correct lengths.
Solange’s code never returns end-of-file; she mistakes EOF for a valid return value. If a program using Donald’s code calls
io61_readc
once and then switches toio61_read
, then they will read too few bytes (bytes in the readc buffer won’t be returned).
QUESTION IO-14C. Give an access pattern for which Solange’s code will return correct data and outperform Donald’s, or vice versa, and say whose code will win.
Solange’s code will outperform Donald’s when
io61_read
is called with small sizes. Donald’s code will outperform Solange’s whenio61_read
is called with large sizes.
QUESTION IO-14D. Suggest a small change (≤10 characters) to Caroline’s code that would, most likely, make it perform at least as well as both Solange’s and Donald’s codes on all access patterns. Explain briefly.
I would change Caroline’s test from
if (f->pos >= f->sz)
toif (f->pos >= f->sz && sz >= 1024)
(or 4096, or something similar). This uses the cache when read is called with small sizes, but avoids the extra copy when read is called with large sizes.
IO-15. Caches
Parts A–C concern different implementations of Pset 3’s stdio cache. Assume a program that reads a 32768-byte file a character at a time, like this:
while (io61_readc(inf) != EOF) {
}
This program will call io61_readc
32769 times. (32769 =
215 + 1 = 8×212 + 1; the +1 accounts for the EOF
return.) But the cache implementation might make many fewer system
calls.
QUESTION IO-15A. How many read
system calls are required assuming a
single-slot, 4096-byte io61 cache?
9
QUESTION IO-15B. How many read
system calls are required assuming an
eight-slot, 4096-byte io61 cache?
9
QUESTION IO-15C. How many mmap
system calls are required assuming an
mmap
-based io61 cache?
1
Parts D–F concern cache implementations and styles. We discussed many caches in class, including:
- The buffer cache
- The processor cache
- Single-slot aligned stdio caches
- Single-slot unaligned stdio caches
- Circular bounded buffers
QUESTION IO-15D. Which of those caches are implemented entirely in hardware? List all that apply.
B
QUESTION IO-15E. Which of those software caches could help speed up reverse sequential access to a disk file? List all that apply.
A, C
QUESTION IO-15F. Which of those software caches could help speed up access to a pipe or network socket? List all that apply.
C, D, E