2017/Section7
Contents
Midterm
We’ve heard a lot of thoughts and despair on the midterm, and want to share with you how we think about exams in this class.
 Our exams are difficult by design—but this midterm turned out more difficult.
 We want our exams to test knowledge at many levels, so we ask hard questions and assign partial credit to answers that demonstrate some understanding. We expect medium to low averages and empty answers.
 We understand that it can feel bad when time runs out with some problems incomplete, and have heard from many students who felt like their exam performance showed they didn’t understand the class material. (Some of these students are doing well above median!) That’s not how we feel. We feel like many of your partial answers showed good understanding of course concepts.
 There are several kinds of work in the class, and we feel that each kind of work shows some students at their best. The problem sets are deeply important to us. The exams are also important, but there is more time pressure, which is harmful for some students.
 So what can you do? No matter how hard an exam is, it is finite, and this one is over. It is entirely possible to bomb an exam and complete the course with an Arange or Brange grade. Concentrate your effort on what suits your own learning style, which might be problem sets, extra credit, course participation (questions in class or section; Piazza questions and answers).
 The final will not be this bad.
Processor cache performance
As you have seen already, accessing values that are stored in the processor cache is much faster than accessing values in memory. In the following examples (brought to you by CS165), assume that a
is a large array with n
elements.
Between (1) and (2), which is faster?
(1):
for (int i = 0; i < n; i++) { min = a[i] < min ? a[i] : min; } for (int i = 0; i < n ; i++) { max = a[i] > max ? a[i] : max; }
(2):
for (int i = 0; i < n; i++) { min = a[i] < min ? a[i] : min; max = a[i] > max ? a[i] : max; }
Now let's also use an array b
, which has the same number of elements as a
. Between (3) and (4), which is faster? Are there any drawbacks to using the solution that better utilizes the processor cache?
(3):
for (int i = 0; i < n; i++) { min = a[i] < min ? a[i] : min; } for(int i = 0; i < n; i++) { max = b[i] > max ? b[i] : max; }
(4):
for (int i = 0; i < n; i++) { min = a[i] < min ? a[i] : min; max = b[i] > max ? b[i] : max; }
Data structure cache performance
It is important to use data structures that will take full advantage of the cache. Let’s look at some popular data structures and analyze their performance with a processor cache.
A “keyvalue” data structure, or “KV store”, maps keys to values. Many keyvalue structures support these operations:

get(key) > optional value
: Returns the value associated withkey
, if one exists. 
put(key, value)
: Modifies the structure to associatevalue
withkey
.
Often a keyvalue data structure uses strings for keys and values, but there are other possibilities. For example, you can imagine one that maps ints to ints.
Consider the following intbased keyvalue data structures. In each case, we have written the get
function for you.
(1) A 1D array:
#define MAXKEY … struct kv1 { int v[MAXKEY]; }; int get(struct kv1* kv, int key) { if (key >= 0 && key < MAXKEY) { return kv>v[key]; } else { return NONEXISTENT_VALUE; } }
(2) A 2D array stored in “rowmajor” order (this data structure uses two ints for its key):
#define NROWS … #define NCOLS … struct kv2 { int v[NROWS * NCOLS]; }; int get(struct kv2* kv, int key_row, int key_col) { if (key_row >= 0 && key_row < NROWS && key_col >= 0 && key_col < NCOLS) { return kv>v[key_row * NCOLS + key_col]; } else { return NONEXISTENT_VALUE; } }
(3) A singlylinked list:
struct kv3 { int k; int v; struct kv3* next; }; int get(struct kv3* kv, int key) { for (; kv; kv = kv>next) { if (kv>k == key) { return kv>v; } } return NONEXISTENT_VALUE; }
(4) A chained hash table: In a chained hash table every bucket contains a linked list of all the elements which hash to that bucket. If you try to add an element to a bucket, then you just add it to the linked list in that bucket.
typedef struct kv4_node { int k; int v; struct kv4_node* next; } kv4_node; struct kv4 { kv4_node* buckets[NBUCKETS]; }; unsigned hash(int key) { /* return hash code */ } int get(struct kv4* kv, int key) { int h = hash(key) % NBUCKETS; for (kv4_node* n = kv>buckets[h]; n; n = n>next) { if (n>k == key) { return n>v; } } return NONEXISTENT_VALUE; }
(5) An opencoded hash table: In an opencoded hash table, every bucket only holds one value. If a collision happens, then you just find another bucket that doesn't yet have a value in it. In this case we are using linear probing. This means that when a collision happens, then we look at the next bucket and see if that one is free. If it is, then we will store the value there and if it is not, then we will repeat this process and look at the next one.
#define INVALID_KEY INT_MIN typedef struct kv5_slot { int k; int v; } kv5_slot; struct kv5 { kv5_slot buckets[NBUCKETS]; }; unsigned hash(int key) { /* return hash code */ } int get(struct kv5* kv, int key) { unsigned h = hash(key) % NBUCKETS; // assume at least 1 bucket is empty while (1) { if (kv>buckets[h].k == key) { return kv>buckets[k].v; } else if (kv>buckets[h].k == INVALID_KEY) { return NONEXISTENT_VALUE; } else { h = (h + 1) % NBUCKETS; } } }
(6) A binary tree:
struct kv6 { int k; int v; struct kv6* left_child; struct kv6* right_child; }; int get(struct kv6* kv, int key) { while (kv) { if (kv>k == key) { return kv>v; } else if (kv>k < key) { kv = kv>left_child; } else { kv = kv>right_child; } } return NONEXISTENT_VALUE; }
Draw a picture of each of these structures so you understand their design.
Now assume that each keyvalue structure holds N elements, where N is much larger than the processor cache; and imagine a program that repeatedly calls get
on a structure. How many processor cache misses will get
cause? Specifically, for each structure:
 How many cache misses should be expected for the first call to
get
?  How many cache misses per
get
should be expected ifget
is called many times with random keys?  How many cache misses per
get
should be expected ifget
is called many times with sequential keys?  How many cache misses per
get
should be expected ifget
is called many times with the same key?  (#2 only) How many cache misses per
get
should be expected ifget
is called for every element in a row? For every element in a column?
Now give examples of workloads that would be best for each keyvalue structure.
Matrix multiplication
Pull the latest version of the sections repository by entering in the terminal
git pull
The cs61section/s07
directory contains a program called matrixmultiply
(source in matrixmultiply.c
). This program creates two random square matrices, multiplies them together, prints some information about their product, then exits.
Your job is to transform the matrix_multiply
function so that it runs as fast as possible, while producing the same results as the original program (which you can run as ./basematrixmultiply
) for all arguments.
Matrix multiplication background
A matrix is a 2D array of numbers. An mxn matrix has m rows and n columns; the value in row i and column j is written m_{ij}, or, in more Clike notation, m[i, j] or m[i][j].
A square matrix has the same number of rows and columns: m = n.
The product of two matrices a and b, written a×b or a*b, is calculated by the following rules.
 a must have the same number of columns as b has rows.
 If a has dimensions mxn, and b has dimensions nxp, then a×b has dimensions mxp: it has a’s row count and b’s column count.
 Let c = a×b. Then we compute c[i,j] as:
c_{ij} = ∑_{0≤k<n} a_{ik} * b_{kj}.
Some pictures might help. Here’s how the dimensions work:
And here’s the values used to compute element c_{ij} of the destination matrix:
Matrixes in most languages are stored in rowmajor order. That means that we store the matrix in a single array, where the values for each row are grouped together. For instance, if a
were a 4x4 square matrix, we’d store its elements in this order:
 a[0,0]  a[0,1]  a[0,2]  a[0,3]  a[1,0]  a[1,1]  a[1,2]  a[1,3]  a[2,0]  a[2,1]  a[2,2]  a[2,3]  a[3,0]  a[3,1]  a[3,2]  a[3,3] 
Thus, element a_{ij} is stored in array index a[i*NCOLS + j]
, where NCOLS
is the number of columns (here, 4).
Code
 The function
me
is used to access an element of a matrix.  You should only modify function
matrix_multiply
.  Think about the cache!
Once you get the “aha” insight you should be able to make ./matrixmultiply
take no more than 0.1x the time of ./basematrixmultiply
, while producing statistics that are 0% off.
Try to explain why the performance improvement is as big as it is.
 Find the fastest matrix multiplier you can.
 Find the slowest matrix multiplier you can.
 Explain why these multipliers are fast and slow.
 If you run
./matrixmultiply n N
several times with the sameN
, you should notice that the corner statistics all stay about the same—even though the input matrices are being initialized randomly, and differently, each time. Why?