2012/Lecture22
Systems Programming and Machine Organization
This is the 2012 version of the course. Main site
Lecture 22 Thoughts
We ran several matrix multiplication examples in class and reasoned about their performance. The book's presentation of this material is in Section 6.6.2. Here are results of running the l22/
programs on 1000-element matrices on several machines. The “(cse)” variants use common subexpression elimination to avoid some memory dereferences; this makes a small difference for some matrix orders. Slowdowns are measured as ratios of time to the fastest time for the given machine, which is ikj (with cse).
iMac | Appliance on iMac in VMware Fusion |
ThinkPad T61 | ||||
---|---|---|---|---|---|---|
Time | Slowdown | Time | Slowdown | Time | Slowdown | |
Order ijk | 6.80s | (9.58x) | 13.50s | (15.52x) | 6.66s | (3.81x) |
Order ijk (cse) | 6.84s | (9.63x) | 13.30s | (15.29x) | 6.51s | (3.82x) |
Order ikj | 0.94s | (1.32x) | 1.03s | (1.18x) | 1.84s | (1.05x) |
Order ikj (cse) | 0.71s | (1.00x) | 0.87s | (1.00x) | 1.75s | (1.00x) |
Order jik | 6.62s | (9.32x) | 13.30s | (15.29x) | 5.53s | (3.16x) |
Order jik (cse) | 6.63s | (9.33x) | 14.31s | (16.45x) | 5.44s | (3.11x) |
Order jki | 15.26s | (21.49x) | 32.20s | (37.01x) | 11.54s | (6.59x) |
Order jki (cse) | 15.34s | (21.61x) | 31.10s | (35.75x) | 11.41s | (6.52x) |
Order kij | 1.00s | (1.41x) | 1.12s | (1.29x) | 3.82s | (2.18x) |
Order kij (cse) | 0.80s | (1.12x) | 0.90s | (1.03x) | 3.77s | (2.15x) |
Order kji | 15.31s | (21.56x) | 31.35s | (36.03x) | 13.45s | (7.69x) |
Order kji (cse) | 15.30s | (21.55x) | 31.18s | (35.84x) | 12.75s | (7.29x) |
Performance differences between T61 and iMac are clear and interesting. On the iMac (and in the book), ijk and jik perform quite similarly; so do jki and kij, and kij and ikj. But on the T61 there’s a pretty big difference within each pair. More surprisingly, on the iMac and in the book the jik order is much slower than kij (12x); on the T61 it’s much closer (just 1.4x slower). Most likely the T61 has a materially different cache design than the iMac. (Less associativity?…Not clear.)
BLAS
But a speed demon of matrix multiplication is the BLAS (Basic Linear Algebra Support) library, which someone has spent the time to optimize.
iMac | Appliance on iMac in VMware Fusion |
ThinkPad T61 | ||||
---|---|---|---|---|---|---|
Time | Speedup | Time | Speedup | Time | Speedup | |
BLAS library | 0.06s | 11.83x | 0.25s | 3.48x | 0.32s | 5.47x |
(Speedup is relative to ikj with cse. The ThinkPad T61 used OpenBLAS; the default Ubuntu libblas library provided no speedup. The iMac used the default lapack
and blas
libraries.)
How does BLAS get such an impressive speedup over the fastest loop ordering? We can find out by looking at its source, or by stepping through its execution using gdb. It uses techniques including loop unrolling (§5.8 of CS:APP2e), loop blocking, and Single-Instruction, Multiple-Data (SIMD) instructions (Web Aside OPT:SIMD), such as MOVAPD, MULPD, and ADDPD.
More optimization
But if we ask GCC nicely to try extra extra hard, it too can unroll loops, tile loop code, and use SIMD instructions!
Our matrixmultiply-ikj-extraopt.c
code does this. I tried several combinations of options. A small, yet effective, set on my iMac is -fprofile-use -Ofast -march=corei7
. The -fprofile-use
option (which must be coupled with a separate run using -fprofile-generate
) turns on profile-guided optimization. The -Ofast
option turns on as much optimization as possible. The -march=corei7
option says use the Intel Core i7’s full instruction set, including SIMD instructions.
The result? Pretty good speedup for unmodified source code. matrixmultiply-ikj-extraopt
has the same code as matrixmultiply-ikj
, but takes 0.44s on the iMac. This is a 1.61x speedup over matrixmultiply-ikj
. Of course the BLAS library still does far better!
Floating point precision
But, possibly to your surprise, the BLAS library generates different results.
openat:l22 kohler$ ./matrixmultiply-ikj-extraopt c[0, 0] == 332.833 c[999, 999] == 9.98999e+08 openat:l22 kohler$ ./matrixmultiply-blas c[0, 0] == 332.834 c[999, 999] == 9.98999e+08
Note the last digit of c[0, 0]
. Computer floating point arithmetic is inexact, and code transformations that change the order in which floating point numbers are added or multiplied can change the results. This usually doesn’t matter, but it can matter a lot. For instance, given three double variables double a, b, c
, consider a + b + c
with a = exp2(100)
(= 2^{100}), b = -exp2(100)
, and c = 1
. Then (a + b) + c == 1
, but a + (b + c) == 0
! Here 1
is clearly the better, more precise answer, and really good numerical computation libraries will specialize their computation strategy according to their actual inputs so that they get a more precise result. We don’t know whether BLAS’s result is “better” or “worse” here, but it is definitely different.
We used hexadecimal floating point notation to check whether our programs generated exactly the same results at the extreme corners for our input matrices. The result: All 6 of the default orders generate the exact same numbers, with and without -cse
.
c[0, 0] == 332.833 (0x1.4cd5604189373p+8) c[999, 999] == 9.98999e+08 (0x1.dc5c292eab026p+29)
But ikj-extraopt
generates different results; c[0, 0]
is off by one bit.
c[0, 0] == 332.833 (0x1.4cd5604189374p+8) c[999, 999] == 9.98999e+08 (0x1.dc5c292eab026p+29)
And blas
diverges further.
c[0, 0] == 332.834 (0x1.4cd5604189375p+8) c[999, 999] == 9.98999e+08 (0x1.dc5c292eab021p+29)