Synchronization 1: Bounded buffer implementation

Synchronization is a general term for the set of techniques we use to make parallel code correct.

What makes synchronization difficult is the exponential increase in possible behaviors that happens when multiple programs are interacting or, worse, operating on the same data. Coordination is a hard problem! This makes intuitive, or even proverbial, sense: idioms like “too many cooks in the kitchen”, “too many cooks spoil the broth”, and “herding cats” remind us of the difficulty of coordinating parallelism and the importance of synchronization in all aspects of life.

Our study of synchronization covers multithreaded parallelism, the system calls, library functions, and operating system features that allow a single process to use the resources of multiple processors. It also covers networking, the system calls and operating system features that allow programs to communicate with one another over a network. Many programs that perform networking tasks use threads to do it. But we begin with another instance of synchronization, one that we have not necessarily thought of as a synchronization problem: kernel code that handles inter-process communication.

WeensyOS pipes and bounded buffers

The pipe mechanism for IPC is generally implemented with a data structure called a bounded buffer. This data structure is interesting both for pipes and for synchronization, since a full implementation must solve many interesting synchronization challenges.

We start with a simple OS that supports a pipe.

WeensyOS pipe kernel

Our OS supports these system calls:

ssize_t sys_pipewrite(const char\* buf, size_t sz): Write the sz bytes of data starting at buf into the pipe. Return the number of characters written, which can be sz; a number less than sz, indicating a partial write; or -1, indicating that the kernel’s buffer is full and the writer should try again later. Only returns 0 if sz == 0.
ssize_t sys_piperead(char\* buf, size_t sz): Read up to sz bytes of data from the pipe into buf. Return the number of characters read, which can be sz; a number greater than zero and less than sz, indicating a partial read; 0, meaning either that sz == 0 or that the pipe is closed; or -1, indicating that the kernel’s buffer is empty and the reader should try again later.

These are like read and write system calls on Unix, but rather than taking a file descriptor argument, our system calls apply to the OS’s single pipe. (It would be easy to generalize to multiple pipes.) But the kernel’s implementation of these system calls differs from Linux’s in important ways.

Linux, as we’ve seen, has a default pipe buffer of 65536 characters. Our initial pipeos has a pipe buffer of one character!
Linux blocks read and write system calls as appropriate: a read system call will block on an empty pipe buffer, and a write system call will block on a full pipe buffer. Our pipeos system calls never block, requiring programs that call them to poll instead.

We can see both of these features by looking at the kernel’s pipe implementation, which looks like this:

// GLOBALS
// memory representing a pipe buffer
static char pipebuffer[1];
// number of characters currently in pipe buffer
static size_t pb_len;  /* always ≥0 and ≤1 */

` // CASES IN `exception` `
case INT_SYS_PIPEWRITE: {
    const char* buf = (const char*) current->p_registers.reg_rdi;
    size_t sz = current->p_registers.reg_rsi;

    if (!check_memory(kernel_pagetable, (uintptr_t) buf, sz,
                      PTE_P | PTE_U)) {
        // bad memory
        current->p_registers.reg_rax = -1;
    } else if (sz == 0) {
        // nothing to write
        current->p_registers.reg_rax = 0;
    } else if (pb_len == sizeof(pipebuffer)) {
        // no room, try again later
        current->p_registers.reg_rax = -1;
    } else {
        // write 1 character
        pipebuffer[0] = buf[0];
        pb_len = 1;
        current->p_registers.reg_rax = 1;
    }

    break;
}

case INT_SYS_PIPEREAD: {
    char* buf = (char*) current->p_registers.reg_rdi;
    size_t sz = current->p_registers.reg_rsi;

    if (!check_memory(kernel_pagetable, (uintptr_t) buf, sz,
                      PTE_P | PTE_W | PTE_U)) {
        // bad memory
        current->p_registers.reg_rax = -1;
    } else if (sz == 0) {
        // empty read buffer
        current->p_registers.reg_rax = 0;
    } else if (pb_len == 0) {
        // nothing to read, try again later
        current->p_registers.reg_rax = -1;
    } else {
        // read 1 character
        buf[0] = pipebuffer[0];
        pb_len = 0;
        current->p_registers.reg_rax = 1;
    }

    break;
}

User-level programs

The OS runs two programs, p-pipewriter and p-pipereader, that behave as follows.

p-pipewriter repeatedly writes a randomly-chosen newline-terminated message to the pipe, then spins (waits) for about 1.25 seconds.
p-pipereader repeatedly reads a (newline-terminated) line from the pipe and prints it.

On Linux, we’d expect that each message should be transferred using just two system calls: one to write the message and one to read it. WeensyOS will require more. But how many more? A lot more:

16498 syscalls: wrote I am a plumber.
15940 syscalls: read I am a plumber.
32100 syscalls: wrote Hello, friend!
163173 syscalls: read Hello, friend!
48391 syscalls: wrote I am a plumber.
308110 syscalls: read I am a plumber.

It appears to take roughly 1000 pipe system calls per character to write a message, and many more to read a message. That’s many, many system calls to accomplish a single unit of work! Polling is often inefficient, leading to much wasted work.

In the remainder of class, we improved the operating system to reduce the number of pipe system calls per operation. This requires several changes, including making the bounded buffer larger than one character (not really a synchronization-related change) and making system calls block (absolutely a synchronization-related change).

Improvement 1: Scheduling order

The minimum number of system calls required to write an N-character message, using a one-character buffer, is N. The pipewriter is making many more than N system calls per write.

The reason is that the kernel is making a bad scheduling decision. Check out the SYS_PIPEWRITE code. When there’s no room in the pipe, the kernel returns -1 to the calling process—and then runs the calling process again (there’s a run(current) call at the end of exception). Though correct—it is always correct to run any runnable process—this is an inefficient scheduling order. The pipewriter’s sys_pipewrite function will always fail until the pipereader runs to free up some space. In general, when a system call fails on some pipe-like object that’s used for inter-process communication, it often makes sense to run the other process.

Here’s the revised SYS_PIPEWRITE code:

case INT_SYS_PIPEWRITE: {
    const char* buf = (const char*) current->p_registers.reg_rdi;
    size_t sz = current->p_registers.reg_rsi;

    if (!check_memory(kernel_pagetable, (uintptr_t) buf, sz,
                      PTE_P | PTE_U)) {
        // bad memory
        current->p_registers.reg_rax = -1;
    } else if (sz == 0) {
        // nothing to write
        current->p_registers.reg_rax = 0;
    } else if (pb_len == sizeof(pipebuffer)) {
        // no room, try again later
        current->p_registers.reg_rax = -1;

schedule();

    } else {
        // write 1 character
        pipebuffer[0] = buf[0];
        pb_len = 1;
        current->p_registers.reg_rax = 1;
    }

    break;
}

A similar change can be made in SYS_PIPEREAD. The result is many fewer system calls per write and for the first read:

31 syscalls: wrote I am a plumber.
31 syscalls: read I am a plumber.
60 syscalls: wrote Hello, friend!
72422 syscalls: read Hello, friend!
91 syscalls: wrote I am a plumber.
145328 syscalls: read I am a plumber.

Improvement 2: Buffer size

We can’t reduce the number of system calls per write much further without a larger pipe buffer. Linux uses a 65536-character pipe buffer by default. We don’t need that big a buffer; let’s start with a 16-character buffer:

static char pipebuffer[16];
static size_t pb_len;  /* always ≥0 and ≤16 */

But coding this buffer isn’t trivial. We want “sliding window” of at most 16 characters. An initial attempt might work like this, with each read moving any remaining characters down to the beginning of the buffer:

                      pipebuffer           pb_len
1. empty buffer       ................     0
2. write("A", 1)      A...............     1
3. write("Hello", 5)  AHello..........     6
4. read(buf, 2)       ello............     4
5. write("What", 4)   elloWhat........     8
6. write("Abracadabra", 11) [returns 8]
                      elloWhatAbracada    16
7. read(buf, 12)      cada............     4
8. read(buf, 12) [returns 4]
                      ................     0

This move-based implementation is, however, quite expensive. In the worst case, reading N characters will do O(N²) units of work!

A smart bounded buffer implementation can read N characters with the minimum O(N) units of work. What we need is a new variable, the position. Our implementation will call this pb_pos.

Infinite buffer

First off, let’s pretend that pipebuffer is stored in an array of infinite size. An infinite-sized array is a good match for the I/O stream abstraction! However, we’ll store at most 16 characters in the array at a time.

In this model, the pb_pos variable represents the number of characters read so far. Write system calls will add new characters to the end of the pipebuffer and increase pb_len. Read system calls, however, will consume characters from the beginning of the pipe buffer. The read system call will copy data out of the buffer and shrink pb_len. It will also increase pb_pos, thus representing the read. Pictorially, this looks like the following. Since characters disappear from the buffer after being read, we mark the read places with “#”:

                      pipebuffer                       pb_pos  pb_len
1. empty buffer       ........................ ...     0       0
2. write("A", 1)      A....................... ...     0       1
3. write("Hello", 5)  AHello.................. ...     0       6
4. read(buf, 2)       ##ello.................. ...     2       4
5. write("What", 4)   ##elloWhat.............. ...     2       8
6. write("Abracadabra", 11) [returns 8]
                      ##elloWhatAbracada...... ...     2       16
7. read(buf, 12)      ##############cada...... ...     14      4
8. read(buf, 12) [returns 4]
                      ##################...... ...     16      0

The code for such a buffer is not too difficult to write; all we need is infinite memory.

size_t pipebuffer[∞];
size_t pb_pos, pb_len;

case INT_SYS_PIPEWRITE: { ...
    } else /* sz > 0 && pb_len < 16 */ {
        // write characters
        size_t ncopy = sz;
        if (ncopy > 16 - pb_len) {
            ncopy = 16 - pb_len;
        }
        memcpy(&pipebuffer[pb_pos + pb_len], buf, ncopy);
        pb_len += ncopy;
        current->p_registers.reg_rax = ncopy;
    }
    break; ...

case INT_SYS_PIPEREAD: { ...
    } else /* sz > 0 && pb_len > 0 */ {
        // read characters
        size_t ncopy = sz;
        if (ncopy > pb_len) {
            ncopy = pb_len;
        }
        memcpy(buf, &pipebuffer[pb_pos], ncopy);
        pb_pos += ncopy;
        pb_len -= ncopy;
        current->p_registers.reg_rax = ncopy;
    }
    break; ...

Unlike the inefficient implementation above, which touches all characters in the buffer on read, this implementation only touches the number of characters actually read. That’s key to achieving O(N) read performance.

Some things to note: Both functions use all four components of the pipe state (the buffer, the capacity [here, 16 bytes], the position, and the length). Write puts data at the end of the buffer, while read takes data from the beginning. The buffer starts at index pb_pos.

Circular bounded buffer

But of course we don’t have infinite memory. And we don’t need infinite memory—the buffer has at most 16 characters at a time. The insight behind the circular bounded buffer implementation technique is that modular arithmetic can easily model this infinite array. We map the infinite array onto a finite array with 16 characters by storing infinite-array index i at character position i % 16. Since the buffer contains at most 16 contiguous characters at a time, collisions never occur: every valid pipe buffer uses each character position at most once. Note, here, how the active characters are always at different positions:

                      pipebuffer positions (mod 16):
                       0 1 2 3 4 5 6 7 8 9101112131415  0 1 2 3 4 5 6 7 8 ...
1. empty buffer        . . . . . . . . . . . . . . . .  . . . . . . . . . ...
2. write("A", 1)       A . . . . . . . . . . . . . . .  . . . . . . . . . ...
3. write("Hello", 5)   A H e l l o . . . . . . . . . .  . . . . . . . . . ...
4. read(buf, 2)        # # e l l o . . . . . . . . . .  . . . . . . . . . ...
5. write("What", 4)    # # e l l o W h a t . . . . . .  . . . . . . . . . ...
6. write("Abracadabra", 11) [returns 8]
                       # # e l l o W h a t A b r a c a  d a . . . . . . . ...
7. read(buf, 12)       # # # # # # # # # # # # # # c a  d a . . . . . . . ...

However, this technique does make reading and writing more complex, because a single write or read may take up to 2 calls to memcpy. In our running example, write("Abcracadabra") must perform two memcpy calls, one to put Abraca at the end of the buffer and one to put da at the beginning. Similarly, the last read must first read ca from the end of the buffer and then read da from the beginning.

Our circular bounded buffer code ensures that 0 ≤ pb_pos \< 16 at all times. Rather than if statements to distinguish different copy amounts, we use a loop; the result is cleaner. Here’s the write code:

// write characters
size_t pos = 0;
while (pos < sz && pb_len < sizeof(pipebuffer)) {
    size_t pb_index = (pb_pos + pb_len) % sizeof(pipebuffer);
    size_t ncopy = sz - pos;
    if (ncopy > sizeof(pipebuffer) - pb_index) {
        ncopy = sizeof(pipebuffer) - pb_index;
    }
    if (ncopy > sizeof(pipebuffer) - pb_len) {
        ncopy = sizeof(pipebuffer) - pb_len;
    }
    memcpy(&pipebuffer[pb_index], &buf[pos], ncopy);
    pos += ncopy;
    pb_len += ncopy;
}
current->p_registers.reg_rax = pos;

And the read code:

// read characters
size_t pos = 0;
while (pos < sz && pb_len > 0) {
    size_t ncopy = sz - pos;
    if (ncopy > sizeof(pipebuffer) - pb_pos) {
        ncopy = sizeof(pipebuffer) - pb_pos;
    }
    if (ncopy > pb_len) {
        ncopy = pb_len;
    }
    memcpy(&buf[pos], &pipebuffer[pb_pos], ncopy);
    pb_pos = (pb_pos + ncopy) % sizeof(pipebuffer);
    pb_len -= ncopy;
    pos += ncopy;
}
current->p_registers.reg_rax = pos;

Things to note:

Write sets ncopy to min(sz - pos, sizeof(pipebuffer) - pb_index, sizeof(pipebuffer) - pb_len). Make sure you understand the three branches of this min. The first branch, sz - pos, limits the write to the number of characters requested by the user. The second branch, sizeof(pipebuffer) - pb_index, handles circular writes—where a write would fall off the end of the pipebuffer array. The third branch, sizeof(pipebuffer) - pb_len, limits the size of the buffer. Similar logic applies to read.
Make sure you understand what pb_index is doing in the write case, and why the read case doesn’t need an equivalent.

The results:

1 syscalls: wrote I am a plumber.
1 syscalls: read I am a plumber.
2 syscalls: wrote Hello, friend!
80506 syscalls: read Hello, friend!
3 syscalls: wrote I am a plumber.
155472 syscalls: read I am a plumber.
6 syscalls: wrote I'm talking through a pipe.
231383 syscalls: read I'm talking through a pipe.
7 syscalls: wrote I am a plumber.
307154 syscalls: read I am a plumber.

One system call per write! (Except for the fourth message, which is longer than 16 characters.)

Improvement 3: Blocking

But reading still takes many, many system calls. During the pipewriter’s delay, the reader just polls the empty pipe, waiting for data to appear. This is an efficiency problem (or would be, if the OS had anything else to do).

To solve this problem, we need some way to put the reader process to sleep—and then, critically, we must wake it up when data arrives on the pipe.

Real operating systems solve this problem with data structures called wait queues. A wait queue can be associated with any condition on which a process can block. When a process makes a blocking system call, the OS finds the wait queue associated with the blocking condition, and enqueues the blocked process on that queue. When the condition is satisfied, the OS unqueues the blocked processes and lets them run again.

For example, every process has a “wait queue” corresponding to the waitpid system call. If the parent calls blocking waitpid(), the parent will be enqueued on the relevant childrens’ wait queues. When a child dies, the operating system signals its wait queue. This does nothing if the queue is empty. But if the queue is nonempty, then the parent is blocked waiting for the child’s status. The kernel, working on behalf now of the parent, will mark the parent as runnable, remove the parent from all other wait queues, and (eventually) complete its waitpid() system call.

In our OS, though, we don’t need a general wait queue structure. There’s only one process that needs to block—the pipereader. So we can implement the “wait queue” as a single integer: the process (if any) blocked on read.

// PID of process waiting to read
static pid_t pb_waiter;

When a read system call has nothing to read, we enqueue the calling process on the wait queue:

... else if (pb_len == 0) {
    // nothing to read, try again later
    current->p_registers.reg_rax = -1;

pb_waiter = current->p_pid; current->p_state = P_BLOCKED;

    schedule();
} ...

And when a writer successfully writes some data, it must signal the wait queue:

// write characters ...
current->p_registers.reg_rax = pos;

if (pos > 0 && pb_waiter > 0) { processes[pb_waiter].p_state = P_RUNNABLE; pb_waiter = 0; }

The result:

1 syscalls: wrote I am a plumber.
1 syscalls: read I am a plumber.
2 syscalls: wrote Hello, friend!
3 syscalls: read Hello, friend!
3 syscalls: wrote I am a plumber.
5 syscalls: read I am a plumber.
6 syscalls: wrote I'm talking through a pipe.
9 syscalls: read I'm talking through a pipe.
7 syscalls: wrote I am a plumber.
11 syscalls: read I am a plumber.

One system call per write, two per read (for short messages)!

(Comprehension question: What would go wrong if you left off the current-\>p_registers.reg_rax = 1; line from the read system call implementation above?)

Synchronization

Let’s step back for a moment and reconnect with synchronization. WeensyOS kernel synchronization is built on the foundation of interrupt disabling. WeensyOS configures the hardware to temporarily disable interrupts whenever a system call is running. When all system calls are polling system calls, this suffices to implement system call atomicity. But once we introduce blocking, things get more complicated. The kernel must be aware of multiple processes at once—a system call made by one process can affect the state of another. That’s the essence of synchronization.

Improvement 4: Kernel contexts

The pipereader makes at least two system calls per message (after the first). In our simple wakeup code, even though the reader blocks until there’s data in the pipe, the unblocked reader is set up to immediately return -1 to the process. But we can fix this too. The writer can operate on behalf of the reader, effectively switching to the reading process’s context to restart the read system call!

Our implementation of this uses a helper function, complete_read, which is called by both the write and read system call implementations. The writer’s call looks like this:

if (pos > 0 && pb_waiter > 0) {
    processes[pb_waiter].p_state = P_RUNNABLE;

complete_read(&processes[pb_waiter], (char*) processes[pb_waiter].p_registers.reg_rdi, (size_t) processes[pb_waiter].p_registers.reg_rsi);

    pb_waiter = 0;
}

The result is one system call per write or read.

1 syscalls: wrote I am a plumber.
1 syscalls: read I am a plumber.
2 syscalls: wrote Hello, friend!
2 syscalls: read Hello, friend!
3 syscalls: wrote I am a plumber.
3 syscalls: read I am a plumber.
5 syscalls: wrote I'm talking through a pipe.
5 syscalls: read I'm talking through a pipe.
6 syscalls: wrote I am a plumber.
6 syscalls: read I am a plumber.
7 syscalls: wrote Hello, friend!
7 syscalls: read Hello, friend!

Extensions

How could you handle a blocked writer? (You’d need another wait queue.)
How could you handle a pipe that could be read by more than one process? (You’d need a different wait queue structure—an integer would not suffice.)

Multithreaded bounded buffers

We now take the bounded buffer out of the WeensyOS context and move it into a situation with true multiprocessing, and thus, more epic synchronization problems.

Imagine two or more processors—say, multiple cores in a kernel, or multiple threads in a user-level process—that are sharing the same bounded buffer memory: the same pipebuffer, pb_sz (that is,

sizeof(pipebuffer)`), `pb_pos`, and `pb_len`. The `pipebuffer

and pb_sz variables are constant: the address of the buffer memory and the size of the buffer never change. However, the contents of the buffer, and the pb_pos and pb_len variables, change on every operation.

Our goal is to provide operation atomicity for pipe operations. Every read should read a contiguous range of characters from the pipe. Every write should write a contiguous range of characters to the pipe. And, finally, no written characters should be dropped: every successfully-written character can be read.

How can an implementation ensure correctness, including operation atomicity, even though these variables are being accessed by different processors simultaneously?

Synchronization problems

Concretely, imagine a 16-character buffer with the following write implementation:

...
while (pos < sz && pb_len < pb_sz) {
    size_t pb_index = (pb_pos + pb_len) % pb_sz;
    size_t ncopy = /* calculate ncopy */;
    memcpy(&pipebuffer[pb_index], &buf[pos], ncopy);
    pb_len += ncopy;
    pos += ncopy;
}
...

And imagine the following write calls being executed on different processors, starting from an empty buffer:

1. pipewrite("AA", 2);
2. pipewrite("BB", 2);

These operations on this state should lead to one of two outcomes, a buffer containing AABB or a buffer containing BBAA. But without further synchronization, we can get illegal outcomes.

For example, we can get ABBA—an outcome that violates the atomicity constraint on write. Imagine the following series of events:

Initially, pb_pos == 15 and pb_len == 0.
The AA writer runs through the loop once. It copies one character into the buffer, for pb_len == 1 and contents A.
The BB writer runs through the loop once. It copies two characters into the buffer, for pb_len == 3 and contents ABB. This completes the BB write (since all characters were written).
The AA writer runs through the loop a second time. It copies the remaining character into the buffer.

And we can get AA followed by two characters of garbage.

Initially, pb_pos == 0 and pb_len == 0.
The BB writer runs through the loop once, but stops before updating pb_len. That copies two characters into the buffer, for contents AA—but pb_len is still 0.
The AA writer runs through the loop. That overwrites the Bs, leaving for pb_len == 2 and contents AA.
The BB writer completes the loop. This sets pb_len == 4—but the 3rd and 4th characters in the buffer are garbage!

We can even get AA on its own—the BB write disappears altogether. This is because the line pb_len += ncopy is actually implemented in multiple steps.

Initially, pb_pos == 0 and pb_len == 0.
The BB writer runs through the loop once, but stops before updating pb_len. That copies two characters into the buffer, for contents AA—but pb_len is still 0.
The BB writer loads the old value of pb_len (0) into a register.
The AA writer runs through the loop. That overwrites the Bs, leaving for pb_len == 2 and contents AA.
The BB writer completes the loop. It first increments the register holding the old version of pb_len. But that sets the register to 2, not 4, because the BB writer missed the AA writer’s update. It then writes the register out to memory, setting pb_len == 2.

To fix this, we need a new property: mutual exclusion.

Mutual exclusion in general

Mutual exclusion is the foundation of atomicity. Its most general definition is this:

Mutual exclusion holds when no more than one thread of execution can simultaneously execute in a region of code with respect to a region of shared storage.

Here, a “thread of execution” refers to a generic instruction pointer: a user-level thread, a process, or a processor. A “region of code” refers to instructions; we might refer to one or more C functions, in whole or in part, or to machine instructions. Finally, a “region of shared storage” refers abstractly to shared memory accessible to the threads of execution. When memory is isolated (e.g., in isolated processes), mutual exclusion generally becomes less important. But we refer to “storage,” rather than shared memory, because in some synchronization situations, we need mutual exclusion with respect to other objects, such as files or pipe buffers.

When mutual exclusion holds, the restricted region of code is called the critical section. When we’re trying to be precise, we call it the induced critical section, since it is enforced by whatever mechanism was used to ensure mutual exclusion. That is, a critical section is a region of code subject to mutual exclusion. The induced critical section may differ from the minimal sufficient critical section, which is the smallest critical section that suffices for correctness (for instance, the smallest critical section required to enforce pipe operation atomicity).

If mutual exclusion holds, and one thread’s instruction pointer is located in the critical section, then no other thread can enter the critical section until the first thread’s instruction pointer leaves it.

Mutual exclusion can be enforced by many different mechanisms. For example, if a process has only one thread, then its entire code is an induced critical section, since no other thread can enter the process with respect to its memory! (Note that even fork does not violate this, since fork creates a new, isolated process with different memory.) The WeensyOS kernel is a critical section, whose mutual exclusion is enforced by disabling interrupts on kernel entry (and by running WeensyOS only on uniprocessor machines).

Sufficient and insufficient critical sections

What critical sections suffice to enforce pipe operation atomicity?

First, let’s return to the pipewrite operation. Say that, through magic, we were able to enforce mutual exclusion on the starred lines:

    ...
    while (pos < sz && pb_len < pb_sz) {
        size_t pb_index = (pb_pos + pb_len) % pb_sz;
        size_t ncopy = /* calculate ncopy */;
***     memcpy(&pipebuffer[pb_index], &buf[pos], ncopy);  ***
***     pb_len += ncopy;                                  ***
        pos += ncopy;
    }
    ...

That is, when one thread begins executing memcpy, no other threads will execute memcpy until the first thread updates pb_len. Does this induced critical section suffice?

This critical section does prevent some bugs. For instance, it prevents the bug leading to a 2-character AA buffer. Since the highlighted critical section includes the pb_len += ncopy line, it is impossible for multiple threads of execution to update pb_len simultaneously, and every pb_len update will see its most up-to-date value. However, there are other bugs that it does not prevent. For instance, it doesn’t prevent the AA[garbage] bug.

What about this larger critical section?

    while (pos < sz && pb_len < pb_sz) {
***     size_t pb_index = (pb_pos + pb_len) % pb_sz;      ***
***     size_t ncopy = /* calculate ncopy */;             ***
***     memcpy(&pipebuffer[pb_index], &buf[pos], ncopy);  ***
***     pb_len += ncopy;                                  ***
***     pos += ncopy;                                     ***
    }

That prevents AA, and also prevents AA[garbage], but it does not prevent ABBA. To prevent all write/write synchronization errors, we must include the whole while loop in the critical section:

*** while (pos < sz && pb_len < pb_sz) {                  ***
***     size_t pb_index = (pb_pos + pb_len) % pb_sz;      ***
***     size_t ncopy = /* calculate ncopy */;             ***
***     memcpy(&pipebuffer[pb_index], &buf[pos], ncopy);  ***
***     pb_len += ncopy;                                  ***
***     pos += ncopy;                                     ***
*** }                                                     ***

No smaller critical section suffices.

Note that critical sections are not always contiguous ranges of code. For instance, a corresponding range of code in the pipe read operation’s implementation would be part of the same sufficient critical section. This is because the read and write code both access and modify the same shared state (here, the pipe buffer).

Locks

Mutual exclusion is often enforced, and critical sections induced, by synchronization objects, which are libraries designed as building blocks for synchronization.

The most basic synchronization object is the mutual exclusion lock, or the lock for short. A lock is a small piece of state (often an integer) associated with two operations, lock and unlock. The lock has two states, locked and unlocked; it starts in unlocked state. The two operations, lock and unlock, behave like the following:

void lock(lock_type* l) {

// The following executes in one atomic step

    while (l is in locked state) {
        /* do nothing */
    }
    *l := locked;
}

void unlock(lock_type* l) {
    assert(l is in locked state);
    *l := unlocked;
}

Sometimes a third operation, trylock, is provided:

int trylock(lock_type* l) {
    if (l is in locked state) {
        return 0;
    } else {
        *l := locked;
        return 1;
    }
}

Lock operations directly map onto the requirements of mutual exclusion. If:

Every entry into a region of code is guarded by a lock operation on a shared lock;
The body of the region doesn’t modify the lock; and
Every exit from the region is guarded by unlock,

then that region of code obeys mutual exclusion. For instance, we can provide mutual exclusion for our pipe’s minimal critical section like so:

static lock_type pb_lock;

...

lock(&pb_lock);

while (pos < sz && pb_len < pb_sz) {
    size_t pb_index = (pb_pos + pb_len) % pb_sz;
    size_t ncopy = /* calculate ncopy */;
    memcpy(&pipebuffer[pb_index], &buf[pos], ncopy);
    pb_len += ncopy;
    pos += ncopy;
}

unlock(&pb_lock);

This provides mutual exclusion by construction. The only entry to the region is via the lock() line. If any thread gets past the

lock()` line, then every *other* thread must block in `lock()

until the first thread calls unlock().