Midterm

We’ve heard a lot of thoughts and despair on the midterm, and want to share with you how we think about exams in this class.

Our exams are difficult by design—but this midterm turned out more difficult.
We want our exams to test knowledge at many levels, so we ask hard questions and assign partial credit to answers that demonstrate some understanding. We expect medium to low averages and empty answers.
We understand that it can feel bad when time runs out with some problems incomplete, and have heard from many students who felt like their exam performance showed they didn’t understand the class material. (Some of these students are doing well above median!) That’s not how we feel. We feel like many of your partial answers showed good understanding of course concepts.
There are several kinds of work in the class, and we feel that each kind of work shows some students at their best. The problem sets are deeply important to us. The exams are also important, but there is more time pressure, which is harmful for some students.
So what can you do? No matter how hard an exam is, it is finite, and this one is over. It is entirely possible to bomb an exam and complete the course with an A-range or B-range grade. Concentrate your effort on what suits your own learning style, which might be problem sets, extra credit, course participation (questions in class or section; Piazza questions and answers).
The final will not be this bad.

Processor cache performance

As you have seen already, accessing values that are stored in the processor cache is much faster than accessing values in memory. In the following examples (brought to you by CS165), assume that a is a large array with n elements.

Between (1) and (2), which is faster?

(1):

for (int i = 0; i < n; i++) {
    min = a[i] < min ? a[i] : min;
}
for (int i = 0; i < n ; i++) {
    max = a[i] > max ? a[i] : max;
}

(2):

for (int i = 0; i < n; i++) {
    min = a[i] < min ? a[i] : min;
    max = a[i] > max ? a[i] : max;
}

Now let's also use an array b, which has the same number of elements as a. Between (3) and (4), which is faster? Are there any drawbacks to using the solution that better utilizes the processor cache?

(3):

for (int i = 0; i < n; i++) {
    min = a[i] < min ? a[i] : min;
}
for(int i = 0; i < n; i++) {
    max = b[i] > max ? b[i] : max;
}

(4):

for (int i = 0; i < n; i++) {
    min = a[i] < min ? a[i] : min;
    max = b[i] > max ? b[i] : max;
}

Data structure cache performance

It is important to use data structures that will take full advantage of the cache. Let’s look at some popular data structures and analyze their performance with a processor cache.

A “key-value” data structure, or “KV store”, maps keys to values. Many key-value structures support these operations:

get(key) -\> optional value: Returns the value associated with key, if one exists.
put(key, value): Modifies the structure to associate value with key.

Often a key-value data structure uses strings for keys and values, but there are other possibilities. For example, you can imagine one that maps ints to ints.

Consider the following int-based key-value data structures. In each case, we have written the get function for you.

(1) A 1D array:

#define MAXKEY …
struct kv1 {
    int v[MAXKEY];
};
int get(struct kv1* kv, int key) {
    if (key >= 0 && key < MAXKEY) {
        return kv->v[key];
    } else {
        return NONEXISTENT_VALUE;
    }
}

(2) A 2D array stored in “row-major” order (this data structure uses two ints for its key):

#define NROWS …
#define NCOLS …
struct kv2 {
    int v[NROWS * NCOLS];
};
int get(struct kv2* kv, int key_row, int key_col) {
    if (key_row >= 0 && key_row < NROWS && key_col >= 0 && key_col < NCOLS) {
        return kv->v[key_row * NCOLS + key_col];
    } else {
        return NONEXISTENT_VALUE;
    }
}

(3) A singly-linked list:

struct kv3 {
    int k;
    int v;
    struct kv3* next;
};
int get(struct kv3* kv, int key) {
    for (; kv; kv = kv->next) {
        if (kv->k == key) {
            return kv->v;
        }
    }
    return NONEXISTENT_VALUE;
}

(4) A chained hash table: In a chained hash table every bucket contains a linked list of all the elements which hash to that bucket. If you try to add an element to a bucket, then you just add it to the linked list in that bucket.

typedef struct kv4_node {
    int k;
    int v;
    struct kv4_node* next;
} kv4_node;
struct kv4 {
    kv4_node* buckets[NBUCKETS];
};
unsigned hash(int key) { /* return hash code */ }
int get(struct kv4* kv, int key) {
    int h = hash(key) % NBUCKETS;
    for (kv4_node* n = kv->buckets[h]; n; n = n->next) {
        if (n->k == key) {
            return n->v;
        }
    }
    return NONEXISTENT_VALUE;
}

(5) An open-coded hash table: In an open-coded hash table, every bucket only holds one value. If a collision happens, then you just find another bucket that doesn't yet have a value in it. In this case we are using linear probing. This means that when a collision happens, then we look at the next bucket and see if that one is free. If it is, then we will store the value there and if it is not, then we will repeat this process and look at the next one.

#define INVALID_KEY INT_MIN
typedef struct kv5_slot {
    int k;
    int v;
} kv5_slot;
struct kv5 {
    kv5_slot buckets[NBUCKETS];
};
unsigned hash(int key) { /* return hash code */ }
int get(struct kv5* kv, int key) {
    unsigned h = hash(key) % NBUCKETS;
    // assume at least 1 bucket is empty
    while (1) {
        if (kv->buckets[h].k == key) {
            return kv->buckets[k].v;
        } else if (kv->buckets[h].k == INVALID_KEY) {
            return NONEXISTENT_VALUE;
        } else {
            h = (h + 1) % NBUCKETS;
        }
    }
}

(6) A binary tree:

struct kv6 {
    int k;
    int v;
    struct kv6* left_child;
    struct kv6* right_child;
};
int get(struct kv6* kv, int key) {
    while (kv) {
        if (kv->k == key) {
            return kv->v;
        } else if (kv->k < key) {
            kv = kv->left_child;
        } else {
            kv = kv->right_child;
        }
    }
    return NONEXISTENT_VALUE;
}

Draw a picture of each of these structures so you understand their design.

Now assume that each key-value structure holds N elements, where N is much larger than the processor cache; and imagine a program that repeatedly calls get on a structure. How many processor cache misses will get cause? Specifically, for each structure:

How many cache misses should be expected for the first call to get?
How many cache misses per get should be expected if get is called many times with random keys?
How many cache misses per get should be expected if get is called many times with sequential keys?
How many cache misses per get should be expected if get is called many times with the same key?
(#2 only) How many cache misses per get should be expected if get is called for every element in a row? For every element in a column?

Now give examples of workloads that would be best for each key-value structure.

Matrix multiplication

Pull the latest version of the sections repository by entering in the terminal

git pull

The cs61-section/s07 directory contains a program called matrixmultiply (source in matrixmultiply.c). This program creates two random square matrices, multiplies them together, prints some information about their product, then exits.

Your job is to transform the matrix_multiply function so that it runs as fast as possible, while producing the same results as the original program (which you can run as ./base-matrixmultiply) for all arguments.

Matrix multiplication background

A matrix is a 2D array of numbers. An mxn matrix has m rows and n columns; the value in row i and column j is written m_ij, or, in more C-like notation, m[i, j] or m[i][j].

A square matrix has the same number of rows and columns: m = n.

The product of two matrices a and b, written a×b or a*b, is calculated by the following rules.

a must have the same number of columns as b has rows.
If a has dimensions mxn, and b has dimensions nxp, then a×b has dimensions mxp: it has a’s row count and b’s column count.
Let c = a×b. Then we compute c[i,j] as:

c_ij = ∑_0≤k<n a_ik * b_kj.

Some pictures might help. Here’s how the dimensions work:

And here’s the values used to compute element c_ij of the destination matrix:

Matrixes in most languages are stored in row-major order. That means that we store the matrix in a single array, where the values for each row are grouped together. For instance, if a were a 4x4 square matrix, we’d store its elements in this order:

| a[0,0] | a[0,1] | a[0,2] | a[0,3] | a[1,0] | a[1,1] | a[1,2] | a[1,3] | a[2,0] | a[2,1] | a[2,2] | a[2,3] | a[3,0] | a[3,1] | a[3,2] | a[3,3] |

Thus, element a_ij is stored in array index a[i\*NCOLS + j], where NCOLS is the number of columns (here, 4).

Code

The function me is used to access an element of a matrix.
You should only modify function matrix_multiply.
Think about the cache!

Once you get the “aha” insight you should be able to make ./matrixmultiply take no more than 0.1x the time of ./base-matrixmultiply, while producing statistics that are 0% off.

Try to explain why the performance improvement is as big as it is.

Find the fastest matrix multiplier you can.
Find the slowest matrix multiplier you can.
Explain why these multipliers are fast and slow.
If you run ./matrixmultiply -n N several times with the same N, you should notice that the corner statistics all stay about the same—even though the input matrices are being initialized randomly, and differently, each time. Why?