Final Review Questions

Solutions available here (Group 3 solutions on Piazza)

Group 1

A. You are programming a machine with a 16-word cache where the block size (the unit of data transfer between cache and primary memory—i.e., the cache line size) is just 1 word (4 bytes), rather than the typical 64 bytes. Your program accesses the following reference string (address sequence), starting from an empty cache:

 40 44 48 4C 70 74 78 7C 40 44 80 84 88 8C 90 94 98 9C 00 04 08 0C 40 44 48 10 14 18 1C 20

What is the miss rate on this sequence if the cache is a fully associative cache with an LRU eviction policy? How many of the misses are cold misses? Capacity misses? Conflict misses?

B. What is the miss rate on this sequence if the cache is a fully associative cache with a FIFO replacement policy? How many of the misses are cold misses? Capacity misses? Conflict misses?

C. What parameter of this cache can we change to decrease the miss rate, besides increasing the size of the cache?

D. Does each of the following function as a cache in a typical operating system? If yes, indicate what it is a cache for.

Primary memory (RAM)
Buffer cache
Disk
General-purpose registers

E. Typical set-associative caches reserve the middle bits for the the set number. What kind of workload would effectively make use of a set-associative cache that reserves the least-significant digits (right-most) for the set number?

F. True or false?

Batching can improve file system read performance.
Batching can improve file system write performance.
Prefetching can improve file system read performance.
Prefetching can improve file system write performance.

Group 2

You are creating a multi-threaded bank application that supports the following operation:

`  /** Transfers `amount` from the account with number `acct1` `
`      to the account with number `acct2`. `
`      Precondition: `acct1 < NACCOUNTS && acct2 < NACCOUNTS` */ `
 void transfer(size_t acct1, size_t acct2, int amount);

Below is your struct definition of an account and declaration of mutexes.

 #define NACCOUNTS 100
 pthread_mutex_t big_lock;
 
 typedef struct {
   int amount;
   pthread_mutex_t lock;
 } account;
 
 account accounts[NACCOUNTS];

You implement transfer in several ways:

 void transfer1(size_t acct1, size_t acct2, int amount) {
   accounts[acct1].amount -= amount;
   accounts[acct2].amount += amount;
 }
 
 void transfer2(size_t acct1, size_t acct2, int amount) {
   pthread_mutex_lock(&big_lock);
   accounts[acct1].amount -= amount;
   accounts[acct2].amount += amount;
   pthread_mutex_unlock(&big_lock);
 }
 
 void transfer3(size_t acct1, size_t acct2, int amount) {
   pthread_mutex_lock(&accounts[acct1].lock);
   pthread_mutex_lock(&accounts[acct2].lock);
   accounts[acct1].amount -= amount;
   accounts[acct2].amount += amount;
   pthread_mutex_unlock(&accounts[acct1].lock);
   pthread_mutex_unlock(&accounts[acct2].lock);
 }
 
 void transfer4(size_t acct1, size_t acct2, int amount) {
   size_t min = (acct1 < acct2) ? acct1 : acct2;
   size_t max = min ^ acct1 ^ acct2;
 
   pthread_mutex_lock(&accounts[min].lock);
   pthread_mutex_lock(&accounts[max].lock);
   accounts[acct1].amount -= amount;
   accounts[acct2].amount += amount;
   pthread_mutex_unlock(&accounts[min].lock);
   pthread_mutex_unlock(&accounts[max].lock);
 }
 
 void transfer5(size_t acct1, size_t acct2, int amount) {
   size_t min = (acct1 < acct2) ? acct1 : acct2;
   size_t max = min ^ acct1 ^ acct2;
 
   pthread_mutex_lock(&accounts[min].lock);
   pthread_mutex_lock(&accounts[max].lock);
   accounts[acct1].amount -= amount;
   accounts[acct2].amount += amount;
   pthread_mutex_unlock(&accounts[max].lock);
   pthread_mutex_unlock(&accounts[min].lock);
 }

Suppose NACCOUNTS is very large and our application has a million threads, all running the same version of transfer. Also suppose we receive a very large number of requests.

A. Which of the above implementations are correctly synchronized (list all)?

B. Which of the above implementation(s) may result in undefined behavior for the amounts in accounts?

C. Which of the above implementation(s) may result in deadlock?

D. Of the implementations that are correctly synchronized, which implementation(s) results in the highest utilization?

E. And the lowest utilization?

Group 3

We expect that this problem is more difficult than the problems you will see on the exam.

A barrier is a kind of synchronization object. A barrier has a single operation,

 void barrier(barrier_t *b);

with the following semantics: In a program with n threads T1, ..., Tn, any call to barrier(b) will block until all threads have called barrier(b). Once all n threads are blocked on barrier(b), they all simultaneously unblock and thereafter execute in any order.

George Washington Carver decides to implement a barrier using pipes and a helper process. Here’s his implementation; he assumes there are no signals:

 #define N 8 /* number of threads */
 
 typedef struct {
   int arrive_pipe;
   int leave_pipe;
 } barrier_t;
 
 void barrier(barrier_t *b) {
   char c = 'X';
   write(b->arrive_pipe, &c, 1);                     /* LINE B1 */
   read(b->leave_pipe, &c, 1); /* blocks */          /* LINE B2 */
 }
 
 void barrier_init(barrier_t *b) {
   int ap[2], lp[2];
   pipe(ap);
   pipe(lp);
   if (fork() == 0) {
     close(ap[1]);
     close(lp[0]);
     
     char buf[N];
     while (1) {
       read(ap[0], buf, N); /* assume this blocks; a real implementation would need a loop here */
       write(lp[1], buf, N);
     }
   }
   b->arrive_pipe = ap[1];
   b->leave_pipe = lp[0];
   close(ap[0]);
   close(lp[1]); 
 }

However, George observes a bug. Given N threads, with IDs starting at 1, each running the following code (after barrier_init has been called):

 extern int thread_1_passed_first_barrier; /* initialized to 0 */
 void thread_function(barrier_t *b) {
   barrier(b);                                       /* LINE T1 */
   if (pthread_self() == 1) {                        /* LINE T2 */
     /* pthread_self() returns current thread’s ID */
     thread_1_passed_first_barrier = 1;              /* LINE T3 */
   }
   barrier(b);                                       /* LINE T4 */
   assert(thread_1_passed_first_barrier);            /* LINE T5 */
 }

The assertion sometimes fails (for instance, for thread ID 2). This would never happen if barrier were programmed correctly.

A. Give a sequential order for lines of code executed by threads 1 and 2 that would cause an assertion failure. Refer to lines B1-2, T1-5 as identified above. “1B1” refers to thread 1 executing line B1, for example.

B. Fix this problem. You may use pseudocode.

Group 4

A. Imagine we have a struct that looks like this:

 typedef struct foo {
   int x[3];
   int y[3];
 } foo;

In C++, vector is a dynamic array type. It is laid out in memory just like an array. We denote a vector containing type TYPE as vector. We can access a vector just like an array. So to get the 4th element of a vector named bar we say bar[3]. We can also get the number of elements in the vector in O(1) time using bar.size().

Imagine we have a vector of type foo, named bar:

 vector`<foo>` bar;

Imagine we wish to run a function which does something to all the elements in both the x and y arrays in every element in bar. Below follow several code snippets which do this, ranked from worst expected performance to best:

 void snippet1() {
   for (int j = 0; j < 3; ++j) {
     for (size_t i = 0; i < bar.size(); ++i) {
       doSomething(bar[i].x[j]);
       doSomething(bar[i].y[j]);
     }
   }
 }
 
 void snippet2() {
   for (size_t i = 0; i < bar.size(); ++i) {
     for (int j = 0; j < 3; ++j) {
       doSomething(bar[i].x[j]);
       doSomething(bar[i].y[j]);
     }
   }
 }
 
 void snippet3() {
   for (size_t i = 0; i < bar.size(); ++i) {
     for (int j = 0; j < 3; ++j)
       doSomething(bar[i].x[j]);
     for (int j = 0; j < 3; ++j)
       doSomething(bar[i].y[j]);
   }
 }
 
 void snippet4() {
   size_t bar_size = bar.size();
   for (size_t i = 0; i < bar_size; ++i) {
     for (int j = 0; j < 3; ++j)
       doSomething(bar[i].x[j]);
     for (int j = 0; j < 3; ++j)
       doSomething(bar[i].y[j]);
   }
 }

Explain concretely why each is better than the last. Feel free to sketch out memory access patterns to explain your answers. Hint: Use terms like “locality” and “invariant”.

Book Problems

Here’s a selection of practice problems and homework problems from the book that may be useful for you. Parenthesized problems are less relevant for our class. Note that the book includes solutions for all practice problems.

Chapter 5: 5.1, 5.2, 5.3, (5.5), 5.10 A-B, 5.12, 5.13, 5.14

Chapter 6: 6.2, 6.3, 6.4, 6.7, 6.8, 6.9, 6.10, (6.13), 6.18, 6.19, 6.20, 6.21

Chapter 8: 8.1, 8.2, (8.3); Homework Problems 8.11-8.15, 8.19

Chapter 9: 9.1, 9.2, 9.3, 9.7

Chapter 10: 10.1, 10.2, 10.3, (10.5)

Chapter 12: 12.1, (12.2), 12.5, 12.6, 12.7