This is not the current version of the class.

# Storage 3: Cache optimizations and coherence

## Overview

In this lecture, we describe access costs and cache optimization strategies, then discuss cache coherence.

## Review: Access costs

• C_N: Cost of accessing N units of data at once (N \geq 1)
• C_N = \infty if it is impossible to access N units at once
• C_N = R + NU often roughly holds, where
• R is the cost per request, independent of the number of units accessed
• U is the cost per unit

## Financial examples

• ATM withdrawal out-of-network
• $1 fee per transaction • Withdrawal limit of$400/day
• Let C_N represent the fee of withdrawing \$N • C_N = \mathord{?} • C_N = \begin{cases} 1 & \text{if } N \leq 400 \\ \infty & \text{otherwise}\end{cases} • R = 1, U = 0 • Credit card fees •$0 fee per transaction
• 17%+ APR per month
• Let C_N represent the 1-month cost of a balance of \$N • C_N = \mathord{?} • C_N = 0.17N • R = 0, U = 0.17 ## Mail examples • USPS first-class mail •$0.60 for envelopes less than 3.5oz
• $4.80 +$0.30/oz for packages

## Latency and throughput

• In computer storage, we often care about the delay it takes to move data
• Latency: another name for delay
• Seconds to complete a request
• Throughput: how many units of data can be moved per second

## Relationship of latency and throughput

• If one unit of data is moved per request, and requests cannot execute in parallel, then throughput = 1/latency
• If per-request delay is much smaller than per-unit delay (R \ll U), then C_N \approx NU
• Throughput per unit with N-unit requests = N/C_N \approx 1/U
• Latency = C_1 \approx U
• Example: main memory access
• If per-unit delay is much smaller than per-request delay (R \gg U), then C_N \approx R
• Throughput per unit with N-unit requests = N/C_N \approx N/R
• Latency = C_1 \approx R
• Examples: system calls, reading/writing a drive

## Reducing access costs

• Reduce number of requests
• Avoid redundant requests
• Do the same work in fewer requests

## Cache optimizations

• Write coalescing (write)
• Parallel access (read or write)

## Batching

• Combine multiple requests into one request
• Reduces total per-request cost R

## Prefetching

• Fetch data before it is needed
• Example: Assume sequential access and read more data than user requested
• Reduces number of requests

## Write coalescing

• Do not write data that will be overwritten later
• Example: Assume a cache line is updated multiple times
• Think a local variable updated in a loop, or multiple local variables in a cache line updated in close proximity
• Ideally, cache writes underlying storage after user writes to cache
• Reduces number of requests

## Parallel access

• Perform accesses in parallel with other work

## Question

• How might these optimization strategies hurt performance?
• How might these optimizations change semantics (produce different observable results)?

## Cache correctness

• Caches aim for transparency
• Result of accesses should be independent of cache
• Cached writes perform same ultimate modifications as direct writes
• A fully transparent cache is called coherent
• Not every cache is coherent!
• Processor cache is coherent
• Buffer cache is coherent
• Standard I/O cache…?

## Exploring performance of file access

• w-osbyte vs. w-osblock
• w-stdiobyte vs. w-stdioblock

## Exploring performance of access patterns

• r-osbyte vs. r-osbyterev
• r-stdiobyte vs. r-stdiobyterev

## Multi-slot caches

• The stdio cache is a single-slot cache
• Still quite effective for many workloads!
• Many real caches have more than one slot
• Processor cache, buffer cache, …