Learning Objectives
Prior to today, we've been focusing on the effects of software caches -- those in the operating system and those in standard I/O, and with assignment 3, those that you can build. There are many orders of magnitude difference between disk and memory times, so examining software caches for files is particularly satisfying (even if you have a solid state disk, you've got an order of magnitude or two to play with). However, hardware caches also affect performance, sometimes in significant ways. Today's exercises are to give you some exposure to the effects and prepare you to write cache-friendly code.
- Demonstrate how access patterns affect performance
- Determine how C arrays are allocated in memory
- Modify programs to produce more cache friendly code
Getting Started
Today's code can be found in cs61-exercises/l12
.
The program we're going to play with today is a matrix multiplication program. If you don't know what matrix multiplication is, that's fine. We'll explain it and you'll be all ready to go.
A matrix is simply a 2-dimensional array, or a grid, like this:
In mathematics and in code, we frequently use the letters i
and j
to
represent the row and column indices respectively. So, we would access
the box labeled 1,2
with the following code:
// Declare an array with 3 rows and 5 columns:
int array[3][5];
int i, j;
i = 1;
j = 2;
array[i][j];
You don't have to use I
and j
as your variable names, of course, but
it is traditional to do so.
We multiply matrices by multiplying a row of one matrix by a column of a
second matrix. Let's say that we have two matrices A and B. Let's say
that A has 3 rows and 5 columns as the picture above shows. To multiply
A by B, B must have 5 rows (you'll see why in a minute), but just to
make it interesting, let's say that B has 5 rows and 4 columns, like
so:
Now, we compute result[i][j]
by multiplying row i
of A
by column
j
of B
. And we multiply two arrays (which each of a row and a column
are) by performing pairwise multiplication and summing the results. So,
to make this explicit:
result[1][2] = A[1][0] * B[0][2] + A[1][1] * B[1][2] + A[1][2] * B[2][2] + A[1][3] * B[3][2] + A[1][4] * B[4][2]
Notice that if we are computing a cell whose indices are i
and j
, we
multiply together A[i][K]
and B[K][j]
while K iterates from 0 to the
maximum column(row) index of A(B). The picture below illustrates
computing result[1][2]
.
Multi-dimensional array layout in C
You already know how to declare one-dimensional arrays: int a[100];
.
You can also declare two (or more)-dimensional arrays:
int a[rows][columns]
.
Now, we know that one-dimensional arrays in C are allocated contiguously. But how do you suppose two dimensional arrays are allocated? How might you figure this out?
Yup -- we're not going to tell you -- we want you to:
1. Write a simple program that iterates over every element in a two
dimensional array, printing out each address and value. We've started
the function for you in layout.c
; you get to finish.
2. Once you finish writing it, use your knowledge of assembly language to figure out how the compiler has laid the data out in memory. There are two basic ways to lay out an array, row major format and column major format. In row major format, the items in each row are contiguous; in column major format, the items in each column are contiguous. Which does C do?
You might find it instructive to print out the size of one dimension of
the two dimensional array! That is, add a printf
statement to your
program that prints out the size of array[0]
.
3. What do you suppose the type of array[0]
is?
Why do we care?
You might ask why we care. Recall that we are in the midst of a unit on caching. When all your data fits in memory (i.e., you are not accessing a disk or flash drive), using the various hardware caches effectively can have a profound impact on performance, and today we're going to provide opportunities for you to experience this first hand.
We've provided you with two programs: layout-row
and layout-column
,
each of which iterate over a large array in row/column order
respectively. Predict how the two programs will perform relative to each
other. Then run them using the UNIX time
command to see if your
predictions were correct. (time ./layout-column
).
Can you explain why one runs faster than the other?
Matrix Multiplication
Next we're going to play with a matrix multiplication program and examine the access patterns that the computation generates. Before we begin, let's talk about passing around addresses of multi-dimensional arrays.
It would be useful to have a generic routine that could perform matrix multiplication on an array of any size. Unfortunately, you can't pass a two dimensional array with dimensions of unknown size at compile time. So, now that we know how C lays out two dimensional arrays, we can treat a one dimensional array like a two dimensional array (if this makes your head hurt -- be patient, we're getting there).
Let's say that we want to represent a two dimensional array of integers with 2 rows and 3 columns and a one dimensional array, A. Given how C lays out 2D arrays, we would want to place row 0 in the first 3 locations and row 1 in the next three locations as shown below.
1. Can you write an expression in terms of i and j that will let you index into the one dimensional array A to locate the value stored int he two dimensional array for A[i][j]?
When you have an answer to that, examine the file mm-ijk.c
which a
function that computes matrix multiplication on square arrays. You will
notice that this function multiplies double precision floating point
numbers, not integers, but the calculation that it makes is the same as
what we described.
2. Make mm-ijk
and time how long the program takes to execute.
3. It turns out that matrix multiple is one of those key operations that people have worked very hard to optimize. Let's see how fast a really good matrix multiplication is. To run this next test, you'll need to install a couple of libraries. The following two commands should load and install the libraries you'll need.
sudo apt-get install -y libblas-dev
sudo apt-get install -y liblapacke-dev
Now, make mm-blas
and time that. How much faster is it?
4. Let's see how much faster we can make our simple matrix multiply,
just by paying attention to the caching behavior. Using scrap paper or
your whiteboard, draw the memory access pattern generated by mm-ijk
(you needn't use huge arrays, just assume that a row is too large to fit
in a single cache line).
5. See if you can come up with an access pattern that performs better
than mm-ijk
-- just by changing the order of the triple nested loops.
(There are only 6 combinations, but before you brute force it, see if
you can predict which ones will give better performance and which ones
will yield worse performance.) Eventually, try all six combinations and
see how well your predictions match actual performance. Can you explain
the performance you get in each case?
Linked Lists
Linked lists are a popular data structure. If performance doesn't matter, linked lists may be fine data structures, but they aren't particularly cache friendly. Let's see if we can demonstrate this.
1. In your exercise directory, you will find a program called
numwave
. Build it. numwave
simply adds some number of values to a
data structure and then removes them. It is designed to support a
variety of data structures and access patterns. By default, it uses a
linked list and adds elements to the front of the list and removes them
from the front of the list. Unsurprisingly, this is quite fast. Run
time ./numwave
to see just how fast this is.
2. Linked lists are great as long as you can take advantage of how easy
it is to insert and remove from the front. But what if we wanted to keep
the data sorted? See how much this slows down the implementation by
running time ./numwave -f sort
. That may still seem quite fast to you,
but what happens if we change the number of values from the default
10000 to 100,000? Try it and see (the README.txt
file explains the
command line parameters in detail.
(If you get bored, you should feel free to kill it. It will finish, but it does take a while.)
3. Now, let's compare the performance of the linked list implementation
with an equally simple vector (array) representation. Without any
parameters, the vector representation also inserts elements at the front
of the vector. While this is advantageous for linked lists, it's kind of
a bummer for the vector, because you have to shift all the values up one
slot in the array. Take a look at the code in numwave-vector.c
to see
how we do this (the routines in question are vector_add_first
and
vector_remove_first
.
What happens when we run vector in the default mode
time ./numwave -t vector
? How does it compare to the linked list in
the default mode?
4. Next, run with the vector representation in its fast mode (appending
entries to the end of the vector instead of putting them at the front).
In some ways this is a much fairer comparison, because it compares the
best case for each representation. How do the two representations
compare? (time ./numwave -t vector -f fast
)
5. Now, let's see what happens when we sort the data values. Run each
representation using -f sort
. What happened? Why? Does the effect get
bigger or smaller when you increase the number of values? Why?
Finishing up
Please take a moment and complete this survey.