Shell 2: Race conditions and process communication

Interrupts, polling & blocking

In the context of the kernel, we mostly discussed interrupts (and exceptional control transfer) as a mechanism for enforcing process isolation. This goal doesn’t apply within a single process. But interrupts have other advantages, including performance advantages. For instance, interrupts let a kernel handle hardware events with low latency and low CPU overhead.

Imagine an operating system that’s running two processes, a process that’s actively computing and a process that’s waiting for a disk read to complete. We want the waiting process to wake up as soon as the data is available. How can this be done without interrupts?

The actively computing process could return control to the kernel every 1ms, so the kernel could check the disk status. But this is relatively high latency—the reading process will have to wait 0.5ms on average to discover the data.
This latency can be reduced by polling more often. The actively computing process could return control to the kernel every 10µs! But this is high overhead: all the protected control transfers required to transfer control into and out of the kernel take time, and will reduce the work the computing process can complete.

Interrupts, in contrast, let us get both low latency and low overhead. The disk interrupts the CPU when the read completes—even if the computing process is actively computing. This interrupt is processed immediately, and the kernel can, if it likes, run the disk-reading process. This gives low latency and no polling.

Signals

The operating system kernel’s goal is to provide safe abstractions for all useful hardware features. This includes interrupts, and signals are the C abstract machine’s abstraction for hardware interrupts.

A signal is an interrupt that is delivered to a process. Processes can install signal handlers that “catch” the signal. A signal handler is just a function—think exception in WeensyOS—but the operating system might call it at any time, including while the process is actively computing, or while the process is blocked in a system call!

Signal handlers are installed with the sigaction system call. We provide a wrapper, set_signal_handler, to make dealing with sigaction easier. Using sigaction looks like this:

void signal_handler(int signal_number) {
    // ...
}

int main() {
    set_signal_handler(SIGCHLD, signal_handler);
`     // Now the kernel will call `signal_handler` every time one of `
    // this process’s children dies.

But signal handlers run in a really, really weird environment. The kernel might call a handler at any moment, including right in the middle of a malloc() call that’s modifying global state! Most application code, and most C library functions, aren’t written to survive being called from signal handlers (the technical term is that they are not reentrant). There is a specific list of “async-signal-safe” library functions you may call from within a signal handler. Most of those functions are wrappers for system calls; for instance, it is safe to call read or write. (But it is not safe to call printf!) On Linux, you can see this list with man 7 signal; on Mac, man 2 sigaction.

For this reason, most signal handlers do very little work. They might set a flag (a global variable of type volatile sig_atomic_t), or modify a global variable, or write something to a file descriptor.

The operating system uses signals to represent any asynchronous, unpredictable event a process might be interested in. Different signals are assigned different mnemonic names. Some of them are:

SIGCHLD: a child died.
SIGHUP: the terminal on which this process was running has closed.
SIGINT: the user pressed Ctrl-C.
SIGSEGV: a page fault happened that the kernel could not handle.
SIGWINCH: the terminal window size changed.
SIGFPE: divide by zero.

A signal has a default disposition, which is what happens to the program if the signal is received and without a handler. Often the signal will be ignored, as with SIGWINCH and SIGCHLD. Sometimes the program will terminate, as with SIGSEGV. There are two signals that cannot be handled; the one that matters most is SIGKILL, invoked as kill -9, which always terminates a process.

Signals and blocking system calls

Signals abort many long-running system calls. For instance, if a signal is received during a blocking read, the read will return -1 with error code EINTR, meaning “interrupted by signal.” The specific list of interruptible system calls is given on Linux in the man 7 signal page.

Signal interruption can be a pain, requiring loops like this to robustly read data:

ssize_t robust_read(int fd, char* buf, size_t amt) {
    size_t pos = 0;
    while (pos < amt) {
        size_t nr = read(fd, buf + pos, amt - pos);
        if (nr > 0) {
            pos += nr;
        } else if (nr == 0) {
            break;
        } else if (nr == -1 && errno == EINTR) {
            /* just try again; signals are not errors! */
        } else if (nr == -1 && pos == 0) {
            return pos ? pos : -1;
        }
    }
    return pos;
}

But sometimes signal interruption is critically important! It’s signal interruption that makes signals analogous to hardware interrupts, which in turn lets process code handle asynchromnous events with low latency and low overhead.

It looks like we can use signals to solve our blocking problem! Here’s how waitblock.c works:

set_signal_handler(SIGCHLD, signal_handler);
pid_t p1 = fork();
if (p1 == 0) {
    // ...
    exit(0);
}
usleep(750000); // will be interrupted by SIGCHLD!

This works most of the time…but, sadly, not all the time.

Race conditions

A race condition is a bug that only manifests sometimes, depending on how events are scheduled. Race conditions are terrible to debug, because they can happen so rarely that they’re nearly impossible to track down.

And sadly, there’s a race condition in waitblock.c. We can see that with a loop like this:

` for i in `seq 1 100000`; do ./waitblock; done `

This should print nothing, but every now and then—on my VM, every couple thousand times—it will print “Child exited after 0.75+ sec”, even though the child exited immediately!

The race condition arises because signals are delivered asynchronously, and because the operating system makes no guarantees about the order in which processes are scheduled. And if the signal is delivered right here:

set_signal_handler(SIGCHLD, signal_handler);
pid_t p1 = fork();
if (p1 == 0) {
    // ...
    exit(0);
}
` /* SIGNAL DELIVERED HERE, BEFORE `usleep` CALLED!! */ `
usleep(750000);

then the usleep system call will not be interrupted!

Can we solve this race condition by changing our signal handler? Well, maybe if the signal handler sets a flag when it is called…

static volatile sig_atomic_t flag = 0;
void signal_handler(int signo) {
    flag = 1;
}
int main(...) { ...
    set_signal_handler(SIGCHLD, signal_handler);
    pid_t p1 = fork();
    if (p1 == 0) {
        // ...
        exit(0);
    }
    if (!flag) {
        usleep(750000);
    }
    // ...

And this does, indeed, make the race condition less likely. But it does not fully solve the problem. In particular, the signal could be delivered here:

set_signal_handler(SIGCHLD, signal_handler);
pid_t p1 = fork();
if (p1 == 0) {
    // ...
    exit(0);
}
if (!flag) {
    /* SIGNAL DELIVERED HERE!!! */
    usleep(750000);
}

which recreates the problem we were trying to solve.

Suspenseful change of topic

Interprocess communication and pipes

Interprocess communication is what it sounds like: operating system mechanisms that allow processes to communicate. Interprocess communication punches holes in process isolation, but that’s OK: recall that process isolation means processes are isolated according to OS policy.

Processes can of course interact over the file system, but the operating system also offers special IPC mechanisms. The most important in Unix, and for your pset, is the pipe.

A pipe is a buffered stream abstraction for IPC. A pipe is a stream because it’s not seekable. It’s buffered because the operating system reserves some space in the buffer cache for piped data. Concretely, a pipe is a pair of file descriptors, one of which is open for writing and the other of which is open for reading, where data written to the write end of the pipe may be read from the read end of the pipe.

Pipes are created in Unix using the pipe system call, which works like this:

int pipefd[2];
int r = pipe(pipefd);
// Aftewards, if r == 0:
//    pipefd[0] >= 0 is the file descriptor for the read end of the pipe
//    pipefd[1] >= 0 is the file descriptor for the write end of the pipe

Initially, every pipe is a self-pipe: only the current process has access to the pipe. The pipedemo.c program illustrates a self-pipe. Pipes become useful for IPC when they are shared between processes, which happens when a process with a pipe calls fork.

The read end of a pipe returns EOF only when all references to the write end of the pipe are closed. It’s easy to forget to close one end of a pipe, which can lead to some surprising behavior (see pipedemo2.c).

There’s also some interesting behavior when a process writes to a pipe whose read end is closed. By default, this delivers the EPIPE signal to the writing process, which then dies. This prevents writers from speaking endlessly into a void. Again, the behavior only occurs when all references to the read end are closed.

The process of closing all unneeded parts of a pipe is called pipe hygiene.

Pipe manipulation

To understand pipe manipulation in practice, we worked through an example of using pipes in a shell-like context. Here’s the goal.

We start with a single “shell” process.
The shell process’s goal is to create two child processes, a and b, with a pipeline connecting them (a \| b). Specifically:
- The a child’s stdout should be connected to the write end of the pipe.
- The b child’s stdin should be connected to the read end of the pipe.
- All other child file descriptors should be connected to whatever the shell had.

We start out with this situation: a single shell process with standard input open to the keyboard, and standard output and error open to the screen. (Really, all three file descriptors are open to the terminal.)

The first system call we must make is pipe(), called by the shell. The pipe must be created first, before forking children, because a child can have a file descriptor reference open to the pipe only by inheriting the file descriptor from the parent. That creates this picture:

And here is our goal: a picture like this.

Keeping the goal in mind will help us make the next steps. Here’s one order that works; there are many other orders that work.

Shell calls pipe() (which, here, returns fds {3, 4})
Shell calls fork(), creating proto-a
Shell calls fork(), creating proto-b
Shell calls close(3) (it doesn’t need the read end of the pipe)
Shell calls close(4) (it doesn’t need the write end of the pipe)
Proto-a calls close(3) (it doesn’t need the read end of the pipe)
Proto-a calls dup2(4, 1) (this closes the old stdout, and makes stdout a reference to the write end of the pipe)
Proto-a calls close(4) (now that stdout is set up, we need to close the old reference to the write end of the pipe)
Proto-a calls exec("a"), becoming a
Proto-b calls close(4)
Proto-b calls dup2(3, 0)
Proto-b calls close(3)
Proto-b calls exec("b"), becoming b

This also shows the benefits Unix derives from separate fork and exec system calls (rather than, say, a combined start_new_process system call that started a new process running a fresh copy of a named program). In between fork and exec, the child processes manipulate file descriptors to set up their desired environments. The fork-exec separation lets shell code perform these manipulations! Although proto-a and proto-b are independent, isolated processes—they are not the main shell—they are still running the same program (the same instructions) as the main shell. That means that the shell’s programmer can control their actions (by, for example, writing code in an if (child_pid == 0) block), and can use that control to set up exactly the right environment for the a and b programs to run.

This dance introduced a new system call, dup2(oldfd, newfd), which makes newfd point to the same file structure as oldfd.

Solving the race condition

OK, so we know how pipes work. How can we use pipes to solve the race condition from before?

At root, the race condition issue is that process code cannot guarantee atomicity except through system calls. Signal handlers can’t provide atomicity on their own—we can’t delay a signal handler until the usleep call begins. But what we can do, is add a system call to the signal handler. Then the kernel provides atomicity!

When the waitblocksafe2.c process begins, it opens a self-pipe. It then installs signal handlers for SIGCHLD and SIGALRM, an “alarm clock” signal that goes off after a given number of seconds. Each signal handler writes a byte to the self-pipe. And to detect either timeout or child death, it suffices to read from the pipe!