Learning Objectives

Prior to today, we've been focusing on the effects of software caches -- those in the operating system and those in standard I/O, and with assignment 3, those that you can build. There are many orders of magnitude difference between disk and memory times, so examining software caches for files is particularly satisfying (even if you have a solid state disk, you've got an order of magnitude or two to play with). However, hardware caches also affect performance, sometimes in significant ways. Today's exercises are to give you some exposure to the effects and prepare you to write cache-friendly code.

Demonstrate how access patterns affect performance
Determine how C arrays are allocated in memory
Modify programs to produce more cache friendly code

Getting Started

Today's code can be found in cs61-exercises/l12.

The program we're going to play with today is a matrix multiplication program. If you don't know what matrix multiplication is, that's fine. We'll explain it and you'll be all ready to go.

A matrix is simply a 2-dimensional array, or a grid, like this: In mathematics and in code, we frequently use the letters i and j to represent the row and column indices respectively. So, we would access the box labeled 1,2 with the following code:

// Declare an array with 3 rows and 5 columns:


int array[3][5];
int i, j;
i = 1;
j = 2;

array[i][j];

You don't have to use I and j as your variable names, of course, but it is traditional to do so.

We multiply matrices by multiplying a row of one matrix by a column of a second matrix. Let's say that we have two matrices A and B. Let's say that A has 3 rows and 5 columns as the picture above shows. To multiply A by B, B must have 5 rows (you'll see why in a minute), but just to make it interesting, let's say that B has 5 rows and 4 columns, like so: Now, we compute result[i][j] by multiplying row i of A by column j of B. And we multiply two arrays (which each of a row and a column are) by performing pairwise multiplication and summing the results. So, to make this explicit:

result[1][2] = A[1][0] * B[0][2] + A[1][1] * B[1][2] + A[1][2] * B[2][2] + A[1][3] * B[3][2] + A[1][4] * B[4][2]

Notice that if we are computing a cell whose indices are i and j, we multiply together A[i][K] and B[K][j] while K iterates from 0 to the maximum column(row) index of A(B). The picture below illustrates computing result[1][2].

Multi-dimensional array layout in C

You already know how to declare one-dimensional arrays: int a[100];. You can also declare two (or more)-dimensional arrays: int a[rows][columns].

Now, we know that one-dimensional arrays in C are allocated contiguously. But how do you suppose two dimensional arrays are allocated? How might you figure this out?

Yup -- we're not going to tell you -- we want you to:

1. Write a simple program that iterates over every element in a two dimensional array, printing out each address and value. We've started the function for you in layout.c; you get to finish.

2. Once you finish writing it, use your knowledge of assembly language to figure out how the compiler has laid the data out in memory. There are two basic ways to lay out an array, row major format and column major format. In row major format, the items in each row are contiguous; in column major format, the items in each column are contiguous. Which does C do?

You might find it instructive to print out the size of one dimension of the two dimensional array! That is, add a printf statement to your program that prints out the size of array[0].

3. What do you suppose the type of array[0] is?

Why do we care?

You might ask why we care. Recall that we are in the midst of a unit on caching. When all your data fits in memory (i.e., you are not accessing a disk or flash drive), using the various hardware caches effectively can have a profound impact on performance, and today we're going to provide opportunities for you to experience this first hand.

We've provided you with two programs: layout-row and layout-column, each of which iterate over a large array in row/column order respectively. Predict how the two programs will perform relative to each other. Then run them using the UNIX time command to see if your predictions were correct. (time ./layout-column).

Can you explain why one runs faster than the other?

Matrix Multiplication

Next we're going to play with a matrix multiplication program and examine the access patterns that the computation generates. Before we begin, let's talk about passing around addresses of multi-dimensional arrays.

It would be useful to have a generic routine that could perform matrix multiplication on an array of any size. Unfortunately, you can't pass a two dimensional array with dimensions of unknown size at compile time. So, now that we know how C lays out two dimensional arrays, we can treat a one dimensional array like a two dimensional array (if this makes your head hurt -- be patient, we're getting there).

Let's say that we want to represent a two dimensional array of integers with 2 rows and 3 columns and a one dimensional array, A. Given how C lays out 2D arrays, we would want to place row 0 in the first 3 locations and row 1 in the next three locations as shown below.

1. Can you write an expression in terms of i and j that will let you index into the one dimensional array A to locate the value stored int he two dimensional array for A[i][j]?

When you have an answer to that, examine the file mm-ijk.c which a function that computes matrix multiplication on square arrays. You will notice that this function multiplies double precision floating point numbers, not integers, but the calculation that it makes is the same as what we described.

2. Make mm-ijk and time how long the program takes to execute.

3. It turns out that matrix multiple is one of those key operations that people have worked very hard to optimize. Let's see how fast a really good matrix multiplication is. To run this next test, you'll need to install a couple of libraries. The following two commands should load and install the libraries you'll need.

sudo apt-get install -y libblas-dev

sudo apt-get install -y liblapacke-dev

Now, make mm-blas and time that. How much faster is it?

4. Let's see how much faster we can make our simple matrix multiply, just by paying attention to the caching behavior. Using scrap paper or your whiteboard, draw the memory access pattern generated by mm-ijk (you needn't use huge arrays, just assume that a row is too large to fit in a single cache line).

5. See if you can come up with an access pattern that performs better than mm-ijk -- just by changing the order of the triple nested loops. (There are only 6 combinations, but before you brute force it, see if you can predict which ones will give better performance and which ones will yield worse performance.) Eventually, try all six combinations and see how well your predictions match actual performance. Can you explain the performance you get in each case?

Linked Lists

Linked lists are a popular data structure. If performance doesn't matter, linked lists may be fine data structures, but they aren't particularly cache friendly. Let's see if we can demonstrate this.

1. In your exercise directory, you will find a program called numwave. Build it. numwave simply adds some number of values to a data structure and then removes them. It is designed to support a variety of data structures and access patterns. By default, it uses a linked list and adds elements to the front of the list and removes them from the front of the list. Unsurprisingly, this is quite fast. Run time ./numwave to see just how fast this is.

2. Linked lists are great as long as you can take advantage of how easy it is to insert and remove from the front. But what if we wanted to keep the data sorted? See how much this slows down the implementation by running time ./numwave -f sort. That may still seem quite fast to you, but what happens if we change the number of values from the default 10000 to 100,000? Try it and see (the README.txt file explains the command line parameters in detail.

(If you get bored, you should feel free to kill it. It will finish, but it does take a while.)

3. Now, let's compare the performance of the linked list implementation with an equally simple vector (array) representation. Without any parameters, the vector representation also inserts elements at the front of the vector. While this is advantageous for linked lists, it's kind of a bummer for the vector, because you have to shift all the values up one slot in the array. Take a look at the code in numwave-vector.c to see how we do this (the routines in question are vector_add_first and vector_remove_first.

What happens when we run vector in the default mode time ./numwave -t vector? How does it compare to the linked list in the default mode?

4. Next, run with the vector representation in its fast mode (appending entries to the end of the vector instead of putting them at the front). In some ways this is a much fairer comparison, because it compares the best case for each representation. How do the two representations compare? (time ./numwave -t vector -f fast)

5. Now, let's see what happens when we sort the data values. Run each representation using -f sort. What happened? Why? Does the effect get bigger or smaller when you increase the number of values? Why?

Finishing up

Please take a moment and complete this survey.