Overview
In this lecture, we work through some examples of processor cache performance and transition to the process control unit.
Full lecture notes on storage — Textbook readings
Processor cache
- Processor caches divide primary memory into blocks of ~64 aligned bytes
- 128 bytes on some machines
- Processor implements prefetching strategies
- Sequential access often beneficial
- User can help:
prefetch
instruction
accessor.cc
- Initialize array of N integers
- Sum up N elements of the array in one of three orders
-u
(up order): 0…(N-1)-d
(down order): (N-1)…0-r
(random order): N random choices from [0, N-1)
Question
- What is the relative speed of these orders, and why?
list-inserter.cc
- Initialize sorted linked list of N integers
- Insert N elements in one of three orders
-u
(up order): 0…N-1-d
(down order): N-1…0-r
(random order): N random choices from [0, N-1)
Question
- What is the relative speed of these orders, and why?
Machine learning
Machine learning Matrix multiplication
- Matrix is 2D array of numbers
- matrix has
rows and columns
- is the value in row and column
- Product of two matrices
- If is , then must have dimensions (same number of rows as has columns), and has dimensions
Pictorial matrix multiplication
Matrix storage
- How to store a 2D matrix in “1D” memory?
- Row-major order
- Store row values contiguously in memory
- Single-array representation
M
xN
matrix stored in arraym[M*N]
- Matrix element stored in array element
m[i*N + j]
- 4x4 matrix: :=
a[0]
; :=a[1]
; :=a[2]
; …; :=a[14]
; :=a[15]
Implementing matrix multiplication
// clear `c`
for (size_t i = 0; i != c.size(); ++i) {
for (size_t j = 0; j != c.size(); ++j) {
c.at(i, j) = 0;
}
}
// compute product and update `c`
for (size_t i = 0; i != c.size(); ++i) {
for (size_t j = 0; j != c.size(); ++j) {
for (size_t k = 0; k != c.size(); ++k) {
c.at(i, j) += a.at(i, k) * b.at(k, j);
}
}
}
Question
- Can you speed up
matrixmultiply.cc
by 10x?