Overview
In this lecture, we describe access costs and cache optimization strategies, then discuss cache coherence.
Full lecture notes on storage — Textbook readings
Review: Access costs
- C_N: Cost of accessing N units of data at once (N \geq 1)
- C_N = \infty if it is impossible to access N units at once
- C_N = R + NU often roughly holds, where
- R is the cost per request, independent of the number of units accessed
- U is the cost per unit
Financial examples
- ATM withdrawal out-of-network
- $1 fee per transaction
- Withdrawal limit of $400/day
- Let C_N represent the fee of withdrawing \$N
- C_N = \mathord{?}
- C_N = \begin{cases} 1 & \text{if } N \leq 400 \\ \infty & \text{otherwise}\end{cases}
- R = 1, U = 0
- Credit card fees
- $0 fee per transaction
- 17%+ APR per month
- Let C_N represent the 1-month cost of a balance of \$N
- C_N = \mathord{?}
- C_N = 0.17N
- R = 0, U = 0.17
Mail examples
- USPS first-class mail
- $0.60 for envelopes less than 3.5oz
- $4.80 + $0.30/oz for packages
Latency and throughput
- In computer storage, we often care about the delay it takes to move data
- Latency: another name for delay
- Seconds to complete a request
- Throughput: how many units of data can be moved per second
Relationship of latency and throughput
- If one unit of data is moved per request, and requests cannot execute in parallel, then throughput = 1/latency
- If per-request delay is much smaller than per-unit delay (R \ll U), then C_N \approx NU
- Throughput per unit with N-unit requests = N/C_N \approx 1/U
- Latency = C_1 \approx U
- Example: main memory access
- If per-unit delay is much smaller than per-request delay (R \gg U), then C_N \approx R
- Throughput per unit with N-unit requests = N/C_N \approx N/R
- Latency = C_1 \approx R
- Examples: system calls, reading/writing a drive
Reducing access costs
- Reduce number of requests
- Avoid redundant requests
- Do the same work in fewer requests
Cache optimizations
- Batching (read or write)
- Prefetching (read)
- Write coalescing (write)
- Parallel access (read or write)
Batching
- Combine multiple requests into one request
- Example: Read 4096 bytes in 1 request instead of 4096
- Reduces total per-request cost R
Prefetching
- Fetch data before it is needed
- Example: Assume sequential access and read more data than user requested
- Ideally, cache reads underlying storage before user reads from cache
- Reduces number of requests
Write coalescing
- Do not write data that will be overwritten later
- Example: Assume a cache line is updated multiple times
- Think a local variable updated in a loop, or multiple local variables in a cache line updated in close proximity
- Ideally, cache writes underlying storage after user writes to cache
- Reduces number of requests
Parallel access
- Perform accesses in parallel with other work
Question
- How might these optimization strategies hurt performance?
- How might these optimizations change semantics (produce different observable results)?
Cache correctness
- Caches aim for transparency
- Result of accesses should be independent of cache
- Cached reads return same data as direct reads
- Cached writes perform same ultimate modifications as direct writes
- A fully transparent cache is called coherent
- Not every cache is coherent!
- Processor cache is coherent
- Buffer cache is coherent
- Standard I/O cache…?
coherence.cc
Exploring performance of file access
w-osbyte
vs.w-osblock
w-stdiobyte
vs.w-stdioblock
Exploring performance of access patterns
r-osbyte
vs.r-osbyterev
r-stdiobyte
vs.r-stdiobyterev
Multi-slot caches
- The stdio cache is a single-slot cache
- Still quite effective for many workloads!
- Many real caches have more than one slot
- Processor cache, buffer cache, …