Prior to today, we've been focusing on the effects of software caches -- those in the operating system and those in standard I/O, and with assignment 3, those that you can build. There are many orders of magnitude difference between disk and memory times, so examining software caches for files is particularly satisfying (even if you have a solid state disk, you've got an order of magnitude or two to play with). However, hardware caches also affect performance, sometimes in significant ways. Today's exercises are to give you some exposure to the effects and prepare you to write cache-friendly code.
- Demonstrate how access patterns affect performance
- Determine how C arrays are allocated in memory
- Modify programs to produce more cache friendly code
Today's code can be found in
The program we're going to play with today is a matrix multiplication program. If you don't know what matrix multiplication is, that's fine. We'll explain it and you'll be all ready to go.
A matrix is simply a 2-dimensional array, or a grid, like this:
In mathematics and in code, we frequently use the letters
j to represent the row and column indices respectively. So, we would access the box labeled
1,2 with the following code:
// Declare an array with 3 rows and 5 columns:
int i, j;
i = 1;
j = 2;
You don't have to use
j as your variable names, of course, but it is traditional to do so.
We multiply matrices by multiplying a row of one matrix by a column of a second matrix. Let's say that we have two matrices A and B. Let's say that A has 3 rows and 5 columns as the picture above shows. To multiply A by B, B must have 5 rows (you'll see why in a minute), but just to make it interesting, let's say that B has 5 rows and 4 columns, like so:
Now, we compute
result[i][j] by multiplying row
A by column
B. And we multiply two arrays (which each of a row and a column are) by performing pairwise multiplication and summing the results. So, to make this explicit:
result = A * B + A * B + A * B + A * B + A * B
Notice that if we are computing a cell whose indices are
j, we multiply together
B[K][j] while K iterates from 0 to the maximum column(row) index of A(B).
The picture below illustrates computing
Multi-dimensional array layout in C
You already know how to declare one-dimensional arrays:
You can also declare two (or more)-dimensional arrays:
Now, we know that one-dimensional arrays in C are allocated contiguously. But how do you suppose two dimensional arrays are allocated? How might you figure this out?
Yup -- we're not going to tell you -- we want you to:
1. Write a simple program that iterates over every element in a two dimensional array,
printing out each address and value. We've started the function for you in
layout.c; you get to finish.
2. Once you finish writing it, use your knowledge of assembly language to figure out how the compiler has laid the data out in memory. There are two basic ways to lay out an array, row major format and column major format. In row major format, the items in each row are contiguous; in column major format, the items in each column are contiguous. Which does C do?
You might find it instructive to print out the size of one dimension of the two dimensional array! That is,
printf statement to your program that prints out the size of
3. What do you suppose the type of
Why do we care?
You might ask why we care. Recall that we are in the midst of a unit on caching. When all your data fits in memory (i.e., you are not accessing a disk or flash drive), using the various hardware caches effectively can have a profound impact on performance, and today we're going to provide opportunities for you to experience this first hand.
We've provided you with two programs:
layout-column, each of which iterate over a large array in row/column order respectively. Predict how the two programs will perform relative to each other. Then run them using the UNIX
time command to see if your predictions were correct. (
Can you explain why one runs faster than the other?
Next we're going to play with a matrix multiplication program and examine the access patterns that the computation generates. Before we begin, let's talk about passing around addresses of multi-dimensional arrays.
It would be useful to have a generic routine that could perform matrix multiplication on an array of any size. Unfortunately, you can't pass a two dimensional array with dimensions of unknown size at compile time. So, now that we know how C lays out two dimensional arrays, we can treat a one dimensional array like a two dimensional array (if this makes your head hurt -- be patient, we're getting there).
Let's say that we want to represent a two dimensional array of integers with 2 rows and 3 columns and a one dimensional array, A. Given how C lays out 2D arrays, we would want to place row 0 in the first 3 locations and row 1 in the next three locations as shown below.
1. Can you write an expression in terms of i and j that will let you index into the one dimensional array A to locate the value stored int he two dimensional array for A[i][j]?
When you have an answer to that, examine the file
mm-ijk.c which a function that computes matrix multiplication on square arrays. You will notice that this function multiplies double precision floating point numbers, not integers, but the calculation that it makes is the same as what we described.
mm-ijk and time how long the program takes to execute.
3. It turns out that matrix multiple is one of those key operations that people have worked very hard to optimize. Let's see how fast a really good matrix multiplication is. To run this next test, you'll need to install a couple of libraries. The following two commands should load and install the libraries you'll need.
sudo apt-get install -y libblas-dev
sudo apt-get install -y liblapacke-dev
mm-blas and time that. How much faster is it?
4. Let's see how much faster we can make our simple matrix multiply, just by paying attention to the caching behavior. Using scrap paper or your whiteboard, draw the memory access pattern generated by
mm-ijk (you needn't use huge arrays, just assume that a row is too large to fit in a single cache line).
5. See if you can come up with an access pattern that performs better than
mm-ijk -- just by changing the order of the triple nested loops. (There are only 6 combinations, but before you brute force it, see if you can predict which ones will give better performance and which ones will yield worse performance.) Eventually, try all six combinations and see how well your predictions match actual performance. Can you explain the performance you get in each case?
Linked lists are a popular data structure. If performance doesn't matter, linked lists may be fine data structures, but they aren't particularly cache friendly. Let's see if we can demonstrate this.
1. In your exercise directory, you will find a program called
numwave. Build it.
numwave simply adds some number of values to a data structure and then removes them.
It is designed to support a variety of data structures and access patterns.
By default, it uses a linked list and adds elements to the front of the list and removes them from the front of the list.
Unsurprisingly, this is quite fast. Run
time ./numwave to see just how fast this is.
2. Linked lists are great as long as you can take advantage of how easy it is to insert and remove from the front. But what if we wanted to keep the data sorted? See how much this slows down the implementation by running
time ./numwave -f sort. That may still seem quite fast to you, but what happens if we change the number of values from the default 10000 to 100,000? Try it and see (the
README.txt file explains the command line parameters in detail.
(If you get bored, you should feel free to kill it. It will finish, but it does take a while.)
3. Now, let's compare the performance of the linked list implementation with an equally simple vector (array) representation. Without any parameters, the vector representation also inserts elements at the front of the vector. While this is advantageous for linked lists, it's kind of a bummer for the vector, because you have to shift all the values up one slot in the array. Take a look at the code in
numwave-vector.c to see how we do this (the routines in question are
What happens when we run vector in the default mode
time ./numwave -t vector? How does it compare to the linked list in the default mode?
4. Next, run with the vector representation in its fast mode (appending entries to the end of the vector instead of putting them
at the front). In some ways this is a much fairer comparison, because it compares the best case for each representation. How do the two representations compare? (
time ./numwave -t vector -f fast)
5. Now, let's see what happens when we sort the data values. Run each representation using
What happened? Why? Does the effect get bigger or smaller when you increase the number of values? Why?
Please take a moment and complete this survey.