# 2017/Section7

## Contents

# Midterm

We’ve heard a lot of thoughts and despair on the midterm, and want to share with you how we think about exams in this class.

- Our exams are difficult by design—but this midterm turned out more difficult.
- We want our exams to test knowledge at many levels, so we ask hard questions and assign partial credit to answers that demonstrate some understanding. We expect medium to low averages and empty answers.
- We understand that it can feel bad when time runs out with some problems incomplete, and have heard from many students who felt like their exam performance showed they didn’t understand the class material. (Some of these students are doing well above median!) That’s not how we feel. We feel like many of your partial answers showed good understanding of course concepts.
- There are several kinds of work in the class, and we feel that each kind of work shows some students at their best. The problem sets are deeply important to us. The exams are also important, but there is more time pressure, which is harmful for some students.
- So what can you do? No matter how hard an exam is, it is finite, and this one is over. It is entirely possible to bomb an exam and complete the course with an A-range or B-range grade. Concentrate your effort on what suits your own learning style, which might be problem sets, extra credit, course participation (questions in class or section; Piazza questions and answers).
- The final will not be this bad.

# Processor cache performance

As you have seen already, accessing values that are stored in the processor cache is much faster than accessing values in memory. In the following examples (brought to you by CS165), assume that `a`

is a large array with `n`

elements.

Between (1) and (2), which is faster?

(1):

for (int i = 0; i < n; i++) { min = a[i] < min ? a[i] : min; } for (int i = 0; i < n ; i++) { max = a[i] > max ? a[i] : max; }

(2):

for (int i = 0; i < n; i++) { min = a[i] < min ? a[i] : min; max = a[i] > max ? a[i] : max; }

Now let's also use an array `b`

, which has the same number of elements as `a`

. Between (3) and (4), which is faster? Are there any drawbacks to using the solution that better utilizes the processor cache?

(3):

for (int i = 0; i < n; i++) { min = a[i] < min ? a[i] : min; } for(int i = 0; i < n; i++) { max = b[i] > max ? b[i] : max; }

(4):

for (int i = 0; i < n; i++) { min = a[i] < min ? a[i] : min; max = b[i] > max ? b[i] : max; }

## Data structure cache performance

It is important to use data structures that will take full advantage of the cache. Let’s look at some popular data structures and analyze their performance with a processor cache.

A “key-value” data structure, or “KV store”, maps keys to values. Many key-value structures support these operations:

`get(key) -> optional value`

: Returns the value associated with`key`

, if one exists.`put(key, value)`

: Modifies the structure to associate`value`

with`key`

.

Often a key-value data structure uses strings for keys and values, but there are other possibilities. For example, you can imagine one that maps ints to ints.

Consider the following int-based key-value data structures. In each case, we have written the `get`

function for you.

(1) A **1D array**:

#define MAXKEY … struct kv1 { int v[MAXKEY]; }; int get(struct kv1* kv, int key) { if (key >= 0 && key < MAXKEY) { return kv->v[key]; } else { return NONEXISTENT_VALUE; } }

(2) A **2D array** stored in “row-major” order (this data structure uses *two* ints for its key):

#define NROWS … #define NCOLS … struct kv2 { int v[NROWS * NCOLS]; }; int get(struct kv2* kv, int key_row, int key_col) { if (key_row >= 0 && key_row < NROWS && key_col >= 0 && key_col < NCOLS) { return kv->v[key_row * NCOLS + key_col]; } else { return NONEXISTENT_VALUE; } }

(3) A **singly-linked list**:

struct kv3 { int k; int v; struct kv3* next; }; int get(struct kv3* kv, int key) { for (; kv; kv = kv->next) { if (kv->k == key) { return kv->v; } } return NONEXISTENT_VALUE; }

(4) A **chained hash table**:
In a chained hash table every bucket contains a linked list of all the elements which hash to that bucket. If you try to add an element to a bucket, then you just add it to the linked list in that bucket.

typedef struct kv4_node { int k; int v; struct kv4_node* next; } kv4_node; struct kv4 { kv4_node* buckets[NBUCKETS]; }; unsigned hash(int key) { /* return hash code */ } int get(struct kv4* kv, int key) { int h = hash(key) % NBUCKETS; for (kv4_node* n = kv->buckets[h]; n; n = n->next) { if (n->k == key) { return n->v; } } return NONEXISTENT_VALUE; }

(5) An **open-coded hash table**:
In an open-coded hash table, every bucket only holds one value. If a collision happens, then you just find another bucket that doesn't yet have a value in it. In this case we are using linear probing. This means that when a collision happens, then we look at the next bucket and see if that one is free. If it is, then we will store the value there and if it is not, then we will repeat this process and look at the next one.

#define INVALID_KEY INT_MIN typedef struct kv5_slot { int k; int v; } kv5_slot; struct kv5 { kv5_slot buckets[NBUCKETS]; }; unsigned hash(int key) { /* return hash code */ } int get(struct kv5* kv, int key) { unsigned h = hash(key) % NBUCKETS; // assume at least 1 bucket is empty while (1) { if (kv->buckets[h].k == key) { return kv->buckets[k].v; } else if (kv->buckets[h].k == INVALID_KEY) { return NONEXISTENT_VALUE; } else { h = (h + 1) % NBUCKETS; } } }

(6) A **binary tree**:

struct kv6 { int k; int v; struct kv6* left_child; struct kv6* right_child; }; int get(struct kv6* kv, int key) { while (kv) { if (kv->k == key) { return kv->v; } else if (kv->k < key) { kv = kv->left_child; } else { kv = kv->right_child; } } return NONEXISTENT_VALUE; }

Draw a picture of each of these structures so you understand their design.

Now assume that each key-value structure holds N elements, where N is much larger than the processor cache; and imagine a program that repeatedly calls `get`

on a structure. How many processor cache misses will `get`

cause? Specifically, for each structure:

- How many cache misses should be expected for the first call to
`get`

? - How many cache misses per
`get`

should be expected if`get`

is called many times with random keys? - How many cache misses per
`get`

should be expected if`get`

is called many times with*sequential*keys? - How many cache misses per
`get`

should be expected if`get`

is called many times with*the same*key? - (#2 only) How many cache misses per
`get`

should be expected if`get`

is called for every element in a row? For every element in a column?

Now give examples of workloads that would be best for each key-value structure.

# Matrix multiplication

Pull the latest version of the sections repository by entering in the terminal

git pull

The `cs61-section/s07`

directory contains a program called `matrixmultiply`

(source in `matrixmultiply.c`

). This program creates two random square matrices, multiplies them together, prints some information about their product, then exits.

Your job is to transform the `matrix_multiply`

function so that it runs *as fast as possible*, while producing the same results as the original program (which you can run as `./base-matrixmultiply`

) *for all arguments.*

## Matrix multiplication background

A matrix is a 2D array of numbers. An *m*x*n* matrix has *m* rows and *n* columns; the value in row *i* and column *j* is written *m*_{ij}, or, in more C-like notation, *m*[*i*, *j*] or *m*[*i*][*j*].

A *square* matrix has the same number of rows and columns: *m* = *n*.

The *product* of two matrices *a* and *b*, written *a*×*b* or *a***b*, is calculated by the following rules.

*a*must have the same number of columns as*b*has rows.- If
*a*has dimensions*m*x*n*, and*b*has dimensions*n*x*p*, then*a*×*b*has dimensions*m*x*p*: it has*a*’s row count and*b*’s column count. - Let
*c*=*a*×*b*. Then we compute*c*[*i*,*j*] as:*c*_{ij}= ∑_{0≤k<n}*a*_{ik}**b*_{kj}.

Some pictures might help. Here’s how the dimensions work:

And here’s the values used to compute element *c*_{ij} of the destination matrix:

Matrixes in most languages are stored in *row-major order*. That means that we store the matrix in a single array, where the values for each row are grouped together. For instance, if `a`

were a 4x4 square matrix, we’d store its elements in this order:

| a[0,0] | a[0,1] | a[0,2] | a[0,3] | a[1,0] | a[1,1] | a[1,2] | a[1,3] | a[2,0] | a[2,1] | a[2,2] | a[2,3] | a[3,0] | a[3,1] | a[3,2] | a[3,3] |

Thus, element *a*_{ij} is stored in array index `a[i*NCOLS + j]`

, where `NCOLS`

is the number of columns (here, 4).

## Code

- The function
`me`

is used to access an element of a matrix. - You should only modify function
`matrix_multiply`

. *Think about the cache!*

Once you get the “aha” insight you should be able to make `./matrixmultiply`

take no more than 0.1x the time of `./base-matrixmultiply`

, while producing statistics that are 0% off.

Try to explain why the performance improvement is as big as it is.

- Find the
*fastest*matrix multiplier you can. - Find the
*slowest*matrix multiplier you can. - Explain
*why*these multipliers are fast and slow. - If you run
`./matrixmultiply -n N`

several times with the same`N`

, you should notice that the corner statistics all stay about the same—even though the input matrices are being initialized randomly, and differently, each time. Why?