Storage 5: Putting the Wrap on Caching

Today's exercise will mostly be Q&A without a computer -- the goal of the exercise is to make sure you are comfortable with all the key concepts surrounding caching. Written questions should be answered in the survey.

The Memory Hierarchy

Recall that caches exist in both hardware and software and that the combination of hardware and software caches form a hierarchy.

Question 1: As you move closer to the processor, do you expect caches to be faster or slower? Larger or smaller? Cheaper or more expensive?

Caches become smaller, faster, and more expensive as you move closer to the processor.

As a programmer (even if you are writing an operating system), you have little control over how HW caches work. Nonetheless, writing cache friendly code will allow you to use caches more efficiently making programs perform better.

For the following two questions, assume that you have a collection of structures, all of which fit in the L1 cache. You write two programs. Program 1 makes one malloc call, allocating all the structures in an array. Program 2 allocates an array of pointers and then allocates one structure for each element in the array. Both programs iterate over all the structures.

Question 2: Which program do you think will perform better? Why?

Program 1 should perform better; Program 2 will have an extra memory reference to get to each item (it has to traverse the pointer) and there will be extra data allocated by malloc, so even if the memory is allocated contiguously by malloc (which may or may not be true), the combination of data and metadata will consume more memory and therefore incur more cache misses.

Question 3: List three ways you might organize and/or process your data to make it more cache friendly.

Arrange your data structures sequentially and contiguously so that you pack data into cache lines.
Arrange structure fields to use as little padding as possible
Arrange data that you are likely to access together so they are likely to be in the same cache line

Question 4: Which is more likely to be fully associative, a hardware cache or a software cache? Why?

Software caches are more likely to be fully associative, because making hardware more associative requires adding hardware to the system, which is expensive and can sometimes slow things down. In software, it's often worth a few extra instructions to improve the performance of a cache. Refer back to the slide on the memory hierarchy and look at the ratios between different layers in the hierarchy.

Evaluating Caches: Hits and Misses and Access Time, Oh My

At this point, everyone seems quite good at computing hit/miss rates for caches given different replacement policy. Let's review the different kinds of misses we can experience:

Compulsory: These are the misses that occur the first time you ever reference data.

Capacity: These are the misses due to the fact that your cache is not large enough to hold all your data (i.e., your eviction policy gets called into action).

Conflict: These are misses that occur because two or more data items you are accessing need to reside in the same place in your cache.

We use the term thrashing to describe what your program is doing when it suffers from an extraordinarily high number of conflict misses.

Question 5: Do you expect hardware or software caches to incur conflict misses?

Since HW caches are more likely to be direct-mapped or set-associatively cached and SW caches are more likely to be fully associative, we expect HW caches to incur conflict misses, while SW caches frequently will not.

For each of the following questions, select all that apply:

A Make your data smaller

B Pad your data items to fill a cache line

C Make your cache bigger

D Rearrange your data in your address space

E None of the above

Question 6: Which approach will reduce the number of compulsory misses you incur?

Since, by definition, compulsory misses are, well, compulsory/required, you don't have a lot of freedom to reduce the number. However, if you are able to represent your data more compactly (i.e., A), then ideally it will fit in fewer cache lines and therefore reduce the number of misses you take obtaining the data the first time.

Question 7: Which approach will reduce the number of capacity misses you incur?

If our data are smaller (A), they will consume less space, which should reduce capacity misses. Similarly, making a cache larger C) will also reduce the number of capacity misses.

Question 8: Which approach will reduce the number of conflict misses you incur?

Changing sizes doesn't necessarily improve conflict misses (although it might: A, C), but sometimes you can rearrange your data so that frequently accessed data doesn't conflict in the cache (D).

So much for misses and hit rates -- another way to evaluate the efficacy of a cache is by computing average access times. Recall that we compute the average access time by adding the total hit time plus the total miss time and dividing by the number of accesses. For example, if we have a hit rate of 75% and a hit takes 4ns while a miss takes 125 ns, then our average access time is:

(75 * 5 + 25 * 125) / 100 =
(375 + 3125)/100 = 
3100/100 = 
31

These computations are useful both in comparing two caches, but also in illustrating just how costly misses can be. Let's put this to work!

Question 9: We are comparing two caches. Both caches process a hit in 1 time unit. The larger cache can achieve a 95% hit rate, but has a 100 time unit miss cost. The smaller cache has a 50 time unit miss cost. How high must its hit rate be to outperform (i.e., have a lower average access time than) the larger cache?

To determine the necessary hit rate for the smaller cache, we should express the average access times for both caches and then set those times equal -- whenever the smaller cache has a hit rate greater than the value that makes the expressions equal, the smaller cache will produce a lower average access time.

So, for the larger cache:

.95 * 1 + .05 * 100

And for the smaller cache:

X * 1 + (1-X) * 50

So, setting them equal, we get:

.95 + 5 = X + 50- 50X
49X = 44.05
X = 44.05/49 = ~90
X should be approximately

So, the smaller cache needs to achieve a hit rate of at least 90% so its average access time equals that of the larger cache.

(In)consistency and (In)coherence

Recall that there are two ways that we might manage writes in caches:

write-through: writes are applied to data in the cache and to the backing storage before the write is considered complete.

write-back: writes are applied in the cache and propagated to the backing storage at some point in the future.

Practically all software caches are write-back. Most hardware caches are also write-back, although it is far more likely to run across a write-through hardware cache than a write-through software cache.

Question 10. If your cache is write-through, why bother caching at all?

Even if we are writing data through to the backing store, subsequent reads might hit in the cache. So if you have a workload with reads and writes, the cache provides a benefit.

In the presence of multiple caches, it is possible for the same data item to be cached in these multiple caches. This introduces the possibility that the data may have different values in the different caches. Most hardware platforms today ensure cache coherence, which means that regardless of where data is cached, all reads will observe the same value of the item. For example, if two processors each have a cache and the memory location 0xABCDEF00 has its contents cached on both processors, when one processor writes a new value to that location, a read on the other processor will see that newly written value. The hardware ensures this behavior, even if the caches are writeback. Systems that make this guarantee are called cache coherent and systems that do not are called cache incoherent (not to mention, difficult to program).

When we have software caches, we frequently use the term cache consistency instead of cache coherence.

Consider two processes, both using standard IO. Both are accessing the same file, but one is reading and one is writing.

Question 11: Is the reading process guaranteed to see everything the writing process has written?

No! Both processes have their own per-process cache, so the writer may be updating its cache, while the reader is reading out of its cache, oblivious to what the writer is doing.

Question 12: Is it possible that the reading process will see what the writing process has written?

Although there is no guarantee, it's certainly possible that the writer will fill its standard IO buffer, causing a write to the operating system. At that point, if a reader needs to refill its cache, it will then request data from the OS and could see data that the writer has written.

For the next set of questions, we will consider different combinations of processes reading and writing using standard IO and direct file system calls. In each case, indicate whether the reader must see the data the writer wrote, might see the data, or will never see the data.

Question 13: Writer uses the write system call; reader uses standard IO.

This time, the writer is writing data to the operating system, but the reader may be reading from its own stdio buffer. So, the reader might see data from the writer, depending on when the reader loads it cache.

Question 14: Writer uses the write system call; reader uses the read system call.

If both processes are reading from/writing to the operating system, the reader must see the data the writer wrote (prior to the reader issuing the read call).

Question 15: Writer uses standard IO; reader uses the read system call.

Once again, the writer is writing into a per-process cache, so even if the reader reads from the OS, it will not see written data until the writer evicts data from its cache. So the reader might see data written by the writer.

A multislot cache

Recall the single-slot cache we worked on in class. Let's explore some ideas around turning it into a multi slots cache.

The single-slot read cache had a file structure that looked like this:

struct io61_file {
   int fd;
   unsigned char cbuf[BUFSIZ];
   off_t tag;      // file offset of first character in cache
   off_t end_tag;  // file offset one past last valid char in cache
   off_t pos_tag;  // file offset of next char to read in cache
};

Let's say that we have a new macro, NSLOTS, and we modify the io61_file structure to support multiple slots like this:

struct io61_slot {
   unsigned char cbuf[BUFSIZ];
   off_t tag;      // file offset of first character in cache
   off_t end_tag;  // file offset one past last valid char in cache
   off_t pos_tag;  // file offset of next char to read in cache
};

struct io61_file {
   int fd;
   struct io61_slot slots[NSLOTS];
};

We're going to follow up on a hint at the end of the single slot read cache exercise 2017/Storage3X: Hint: Ensure that f->tag is always aligned on a multiple of BUFSIZ, extending this to the multi-slot case.

Let's pick NSLOTS to be a power of two and implement a direct-mapped cache. Our goal is to figure out how to map particular offsets to the proper slot. We'll start by converting offsets to tags for aligned buffers. Let's say that BUFSIZ is 4096.

Question 16: If buffers are aligned on multiples of 4096, what do you know about the low-order 12 bits of the tag for a buffer?

They should all be 0.

Question 17: Given an offset off, what C expression will convert from that offset to the tag of the appropriate buffer?

We need to construct (a mask) that will 0 out the low 12 bits; Well, 4096 is 0x1000, 4096 - 1 will produce 0xFFF; if we flip those bits, we get: 0xFFFFFFFFFFFFF000, so:

off & ~(4096 - 1)

off & ~(BUFSIZ - 1)

Great, we're almost there. Now let's see what happens if we have multiple slots -- how do we decide which slot to place a buffer? Let's imagine that NSLOTS is 4 and that we are caching a file that is 16 KB long. We probably want offset 0 to be placed in slot 0; offset 4096 in slot 1; 8192 in slot 2, and 12288 in slot 3. Let's look at these numbers in hex:

 4096 = 0x1000
 8192 = 0x2000
12288 = 0x3000

That forms a nice pattern!

Question 18: Write an expression that converts from an offset to a slot number.

Let's work this out with real numbers and then come up with a general expression like we did above. So, the low 12 bits are going to be offsets into a buffer sized chunk. The next two bits could then be used to convert to an index into our array of slots. So, given an offset, off, we want to extract the 13th and 14th bits ... how can we do that? How might we mask out the 13th and 14th bits? How about:

(off & 0x3000)

That will clear all the bits except the two that represent the slot index. But they are way bigger now than the numbers 0-3, so let's either divide by the BUFSIZ:

(off & 0x3000)/ BUFSIZ

or we could shift:

(off & 0x3000) >> 12

So, ideally the expressions in questions 17 and 18 get you to a particular buffer in the cache and produce a value to compare against the tag. Let's assume that we find the right buffer. The last thing we need to do is convert the offset we are given to an offset in the buffer.

Question 19: Write an expression that takes an offset in the file and converts it into an offset in a buffer.

If you've gotten this far, then you should now have all the tools to implement a multi-slot cache if you wanted to!