Storage 3: Cache optimizations and coherence

Overview

In this lecture, we describe access costs and cache optimization strategies, then discuss cache coherence.

Full lecture notes on storage — Textbook readings

Review: Access costs

C_N: Cost of accessing N units of data at once (N \geq 1)
- C_N = \infty if it is impossible to access N units at once
C_N = R + NU often roughly holds, where
- R is the cost per request, independent of the number of units accessed
- U is the cost per unit

Financial examples

ATM withdrawal out-of-network
- $1 fee per transaction
- Withdrawal limit of $400/day
Let C_N represent the fee of withdrawing \$N
- C_N = \mathord{?}
- C_N = \begin{cases} 1 & \text{if } N \leq 400 \\ \infty & \text{otherwise}\end{cases}
- R = 1, U = 0
Credit card fees
- $0 fee per transaction
- 17%+ APR per month
Let C_N represent the 1-month cost of a balance of \$N
- C_N = \mathord{?}
- C_N = 0.17N
- R = 0, U = 0.17

Mail examples

USPS first-class mail
- $0.60 for envelopes less than 3.5oz
- $4.80 + $0.30/oz for packages

Latency and throughput

In computer storage, we often care about the delay it takes to move data
Latency: another name for delay
- Seconds to complete a request
Throughput: how many units of data can be moved per second

Relationship of latency and throughput

If one unit of data is moved per request, and requests cannot execute in parallel, then throughput = 1/latency
If per-request delay is much smaller than per-unit delay (R \ll U), then C_N \approx NU
- Throughput per unit with N-unit requests = N/C_N \approx 1/U
- Latency = C_1 \approx U
- Example: main memory access
If per-unit delay is much smaller than per-request delay (R \gg U), then C_N \approx R
- Throughput per unit with N-unit requests = N/C_N \approx N/R
- Latency = C_1 \approx R
- Examples: system calls, reading/writing a drive

Reducing access costs

Reduce number of requests
- Avoid redundant requests
- Do the same work in fewer requests

Cache optimizations

Batching (read or write)
Prefetching (read)
Write coalescing (write)
Parallel access (read or write)

Batching

Combine multiple requests into one request
Example: Read 4096 bytes in 1 request instead of 4096
Reduces total per-request cost R

Prefetching

Fetch data before it is needed
Example: Assume sequential access and read more data than user requested
Ideally, cache reads underlying storage before user reads from cache
Reduces number of requests

Write coalescing

Do not write data that will be overwritten later
Example: Assume a cache line is updated multiple times
- Think a local variable updated in a loop, or multiple local variables in a cache line updated in close proximity
Ideally, cache writes underlying storage after user writes to cache
Reduces number of requests

Parallel access

Perform accesses in parallel with other work

Question

How might these optimization strategies hurt performance?
How might these optimizations change semantics (produce different observable results)?

Cache correctness

Caches aim for transparency
- Result of accesses should be independent of cache
- Cached reads return same data as direct reads
- Cached writes perform same ultimate modifications as direct writes
A fully transparent cache is called coherent
Not every cache is coherent!
- Processor cache is coherent
- Buffer cache is coherent
- Standard I/O cache…?

`coherence.cc`

Exploring performance of file access

w-osbyte vs. w-osblock
w-stdiobyte vs. w-stdioblock

Exploring performance of access patterns

r-osbyte vs. r-osbyterev
r-stdiobyte vs. r-stdiobyterev

Multi-slot caches

The stdio cache is a single-slot cache
- Still quite effective for many workloads!
Many real caches have more than one slot
- Processor cache, buffer cache, …

Storage 3: Cache optimizations and coherence

Overview

Review: Access costs

Financial examples

Mail examples

Latency and throughput

Relationship of latency and throughput

Reducing access costs

Cache optimizations

Batching

Prefetching

Write coalescing

Parallel access

Question

Cache correctness

coherence.cc

Exploring performance of file access

Exploring performance of access patterns

Multi-slot caches

Diagramming a multi-slot cache

`coherence.cc`