Storage 5: Putting the Wrap on Caching

Today's exercise will mostly be Q&A without a computer -- the goal of the exercise is to make sure you are comfortable with all the key concepts surrounding caching. Written questions should be answered in the survey.

The Memory Hierarchy

Recall that caches exist in both hardware and software and that the combination of hardware and software caches form a hierarchy.

Question 1: As you move closer to the processor, do you expect caches to be faster or slower? Larger or smaller? Cheaper or more expensive?

As a programmer (even if you are writing an operating system), you have little control over how HW caches work. Nonetheless, writing cache friendly code will allow you to use caches more efficiently making programs perform better.

For the following two questions, assume that you have a collection of structures, all of which fit in the L1 cache. You write two programs. Program 1 makes one malloc call, allocating all the structures in an array. Program 2 allocates an array of pointers and then allocates one structure for each element in the array. Both programs iterate over all the structures.

Question 2: Which program do you think will perform better? Why?

Question 3: List three ways you might organize and/or process your data to make it more cache friendly.

Question 4: Which is more likely to be fully associative, a hardware cache or a software cache? Why?

Check your answers here

Evaluating Caches: Hits and Misses and Access Time, Oh My

At this point, everyone seems quite good at computing hit/miss rates for caches given different replacement policy. Let's review the different kinds of misses we can experience:

Compulsory: These are the misses that occur the first time you ever reference data.

Capacity: These are the misses due to the fact that your cache is not large enough to hold all your data (i.e., your eviction policy gets called into action).

Conflict: These are misses that occur because two or more data items you are accessing need to reside in the same place in your cache.

We use the term thrashing to describe what your program is doing when it suffers from an extraordinarily high number of conflict misses.

Question 5: Do you expect hardware or software caches to incur conflict misses?

For each of the following questions, select all that apply:

A Make your data smaller

B Pad your data items to fill a cache line

C Make your cache bigger

D Rearrange your data in your address space

E None of the above

Question 6: Which approach will reduce the number of compulsory misses you incur?

Question 7: Which approach will reduce the number of capacity misses you incur?

Question 8: Which approach will reduce the number of conflict misses you incur?

So much for misses and hit rates -- another way to evaluate the efficacy of a cache is by computing average access times. Recall that we compute the average access time by adding the total hit time plus the total miss time and dividing by the number of accesses. For example, if we have a hit rate of 75% and a hit takes 4ns while a miss takes 125 ns, then our average access time is:

(75 * 5 + 25 * 125) / 100 =
(375 + 3125)/100 = 
3100/100 = 
31

These computations are useful both in comparing two caches, but also in illustrating just how costly misses can be. Let's put this to work!

Question 9: We are comparing two caches. Both caches process a hit in 1 time unit. The larger cache can achieve a 95% hit rate, but has a 100 time unit miss cost. The smaller cache has a 50 time unit miss cost. How high must its hit rate be to outperform (i.e., have a lower average access time than) the larger cache?

(In)consistency and (In)coherence

Recall that there are two ways that we might manage writes in caches:

write-through: writes are applied to data in the cache and to the backing storage before the write is considered complete.

write-back: writes are applied in the cache and propagated to the backing storage at some point in the future.

Practically all software caches are write-back. Most hardware caches are also write-back, although it is far more likely to run across a write-through hardware cache than a write-through software cache.

Question 10. If your cache is write-through, why bother caching at all?

In the presence of multiple caches, it is possible for the same data item to be cached in these multiple caches. This introduces the possibility that the data may have different values in the different caches. Most hardware platforms today ensure cache coherence, which means that regardless of where data is cached, all reads will observe the same value of the item. For example, if two processors each have a cache and the memory location 0xABCDEF00 has its contents cached on both processors, when one processor writes a new value to that location, a read on the other processor will see that newly written value. The hardware ensures this behavior, even if the caches are writeback. Systems that make this guarantee are called cache coherent and systems that do not are called cache incoherent (not to mention, difficult to program).

When we have software caches, we frequently use the term cache consistency instead of cache coherence.

Consider two processes, both using standard IO. Both are accessing the same file, but one is reading and one is writing.

Question 11: Is the reading process guaranteed to see everything the writing process has written?

Question 12: Is it possible that the reading process will see what the writing process has written?

For the next set of questions, we will consider different combinations of processes reading and writing using standard IO and direct file system calls. In each case, indicate whether the reader must see the data the writer wrote, might see the data, or will never see the data.

Question 13: Writer uses the write system call; reader uses standard IO.

Question 14: Writer uses the write system call; reader uses the read system call.

Question 15: Writer uses standard IO; reader uses the read system call.

A multislot cache

Recall the single-slot cache we worked on in class. Let's explore some ideas around turning it into a multi slots cache.

The single-slot read cache had a file structure that looked like this:

struct io61_file {
   int fd;
   unsigned char cbuf[BUFSIZ];
   off_t tag;      // file offset of first character in cache (same as before)
`    off_t end_tag;  // file offset one past last valid char in cache; end_tag - tag == old `csz` `
`    off_t pos_tag;  // file offset of next char to read in cache; pos_tag - tag == old `cpos` `
};

Let's say that we have a new macro, NSLOTS, and we modify the io61_file structure to support multiple slots like this:

struct io61_slot {
   unsigned char cbuf[BUFSIZ];
   off_t tag;      // file offset of first character in cache
   off_t end_tag;  // file offset one past last valid char in cache
   off_t pos_tag;  // file offset of next char to read in cache
};

struct io61_file {
   int fd;
   struct io61_slot slots[NSLOTS];
};

We're going to follow up on a hint at the end of the single slot read cache exercise 2016/Storage3X: Hint: Ensure that f->tag is always aligned on a multiple of BUFSIZ, extending this to the multi-slot case.

Let's pick NSLOTS to be a power of two and implement a direct-mapped cache. Our goal is to figure out how to map particular offsets to the proper slot. We'll start by converting offsets to tags for aligned buffers. Let's say that BUFSIZ is 4096.

Question 16: If buffers are aligned on multiples of 4096, what do you know about the low-order 12 bits of the tag for a buffer?

Question 17: Given an offset off, what C expression will convert from that offset to the tag of the appropriate buffer?

Great, we're almost there. Now let's see what happens if we have multiple slots -- how do we decide which slot to place a buffer? Let's imagine that NSLOTS is 4 and that we are caching a file that is 16 KB long. We probably want offset 0 to be placed in slot 0; offset 4096 in slot 1; 8192 in slot 2, and 12288 in slot 3. Let's look at these numbers in hex:

 4096 = 0x1000
 8192 = 0x2000
12288 = 0x3000

That forms a nice pattern!

Question 18: Write an expression that converts from an offset to a slot number.

So, ideally the expressions in questions 17 and 18 get you to a particular buffer in the cache and produce a value to compare against the tag. Let's assume that we find the right buffer. The last thing we need to do is convert the offset we are given to an offset in the buffer.

Question 19: Write an expression that takes an offset in the file and converts it into an offset in a buffer.

If you've gotten this far, then you should now have all the tools to implement a multi-slot cache if you wanted to!