Synchronization 1: Signals, race condition, threads

Note on zombie processes

This is relevant to the problem set. When a process forks a child, the child eventually exits and will have a exit status. The child process then enters a "zombie state": the process no longer exits, but the kernel still keeps around its pid and its exit status, waiting for waitpid() to be called on the process. Zombie processes consume kernel resources and we should avoid having zombies lying around whenever possible.

The trick for avoiding zombie processes is to call waitpid() at least once for each child. Invoking waitpid() with -1 as the pid argument will check on an exit status of an arbitrary child.

Signals

Recall that one very important functionality of an operating system is to provide user-accessible abstractions of the underlying hardware features that are inconvenient or unsafe for a user process to access directly. UNIX-like systems abstract interrupts (as hardware feature) to user-space signals.

Comparisons between interrupts and signals:

Interrupts	Signals
Can happen at any time, can be blocked	Can happen any any time, can be blocked
Protected control transfer, saves machine state as part of kernel state	Previous machine state was saved by kernel to ptable
Kernel invokes interrupt handler in kernel mode on the kernel stack	Kernel invokes signal handler in user mode on a signal stack

We use the following system calls to manipulate signals:

sigaction(sig, handler, oldhandler): installs a signal handler
kill(pid, sig): sends a signal

We use sigaction() to install a signal handler, which should be a very simple function that runs when the signal gets delivered. Since the signal can arrive at any time between 2 instructions, the signal handler should really be kept at minimum complexity and it definitely should not call user space library functions like malloc(). Imagine that a signal may arrive in between 2 instructions in the malloc() routine itself, and the internal state and data structure used by malloc() may not be consistent. Typical things we do in a signal handler is setting a flag variable or checking the value of some other variable, and we leave the complex task to be triggered by these flag variables in the main process.

For a list of functions that can be safely used in signal handlers, see manual page man 7 signal-safety.

Solving the race condition...

Recall from the last lecture where we were trying to implement the timed wait. Our plan was to have a signal (SIGCHLD) unblock the usleep() system call in the parent if the child exited within 0.75 seconds. Our solution has a condition: there is small window during which if the signal gets delivered, the parent would miss the effect of the signal and overlook the fact that the child has already exited. Before the end of the last lecture we hinted that the problem can be reliably solved using a pipe.

... with pipe

Take a look at code in shell4/timedwait-selfpipe.cc for how we could do it. The relevant portion of the code is pasted below.

int signalpipe[2];

void signal_handler(int signal) {
    (void) signal;
    ssize_t r = write(signalpipe[1], "!", 1);
    assert(r == 1);
}

int main(int, char** argv) {
    ...

    // Wait for 0.75 sec, or until a byte is written to `signalpipe`,
    // whichever happens first
    struct timeval timeout = { 0, 750000 };
    fd_set fds;
    FD_SET(signalpipe[0], &fds);
    r = select(signalpipe[0] + 1, &fds, nullptr, nullptr, &timeout);

    ...
}

We simply write one character into the pipe in the signal handler (it is fine to invoke system calls in signal handlers), and then we use the select() system call in the parent to wait for the read end of the pipe to become ready, with a 0.75-second timeout value.

Some notes on select(). The select() system call is very commonly seen in even-driven programs (like a server). It has the following signature:
select (
    nfds,     // largest fd value to monitor + 1
    read_set,
    write_set,
    exp_set,
    timeout
)
read_set and write_set are fd_set objects.

The system call will return at the earliest of:

Some fd in read_set becomes readable (meaning that read(fd) will not block);

Some fd in write_set becomes writable (write(fd) will not block);

Timeout elapses;

A signal is delivered;

This is indeed one of the few correct ways to implement timed wait, eliminating all potential race conditions.

It may be surprising that we need to involve so many machinery to solve this seemingly simple problem. The race condition we discussed in this example is a very common occurrence at all levels of concurrent systems design. It is so important that there is a name for it, and it's called the sleep-wakeup race condition. As we learn later in this unit, there are synchronization mechanisms designed specifically to address sleep-wakeup race conditions.

... with `signalfd`

There is a system call, signalfd(), that automatically builds a signal-pipe-like vehicle for us to handle these race conditions. Instead of explicitly creating a pipe with 2 ends and define a signal handler that explicitly writes to the pipe, we simply tell the operating to create a file descriptor from which information about a signal can be read once a signal arrives. Code for doing this is in shell4/timedwait-signalfd.cc. Note that there is no definition of a signal handler any more. The relevant code is shown below:

int main(int, char** argv) {
    sigset_t mask;
    sigemptyset(&mask);
    sigaddset(&mask, SIGCHLD);
    int r = sigprocmask(SIG_BLOCK, &mask, nullptr);
    assert(r == 0);
    int sigfd = signalfd(-1, &mask, SFD_CLOEXEC);
    assert(sigfd >= 0);

    ...

    // Wait for 0.75 sec, or until something is written to `sigfd`,
    // whichever happens first
    struct timeval timeout = { 0, 750000 };
    fd_set fds;
    FD_SET(sigfd, &fds);
    r = select(sigfd + 1, &fds, nullptr, nullptr, &timeout);

    ...
}

Threads

Modern day computers usually have multiple processors (CPUs) built in. The operating system provides a user-level abstraction of multiple processors call threads. Here is the comparison table of threads (as an abstraction) and multiple processors (as a hardware feature):

Multiple processors	Threads
One primary memory	One virtual address space
Multiple sets of registers and logical computations	Multiple sets of registers, stacks, and logical computations

A process can contain multiple threads. All threads within the same process share the same virtual address space and file descriptor table. Each thread will however have its own set of registers and stack. The processes we have looked at so far all have a single thread running. For multi-threaded processes the kernel will need to store a set of registers for each thread, rather than for each process, to maintain the full process state.

Let's look at how to use threads using our example program in incr-basic.cc:

void threadfunc(unsigned* x) {
    // This is a correct way to increment a shared variable!
    // ... OR IS IT?!?!?!?!?!?!??!?!
    for (int i = 0; i != 10000000; ++i) {
        *x += 1;
    }
}

int main() {
    std::thread th[4];
    unsigned n = 0;
    for (int i = 0; i != 4; ++i) {
        th[i] = std::thread(threadfunc, &n);
    }
    for (int i = 0; i != 4; ++i) {
        th[i].join();
    }
    printf("%u\n", n);
}

In this code we run the function threadfunc() in parallel in 4 threads. The std:🧵:join() function makes the main thread block until the thread upon which the join() is called finishes execution. So the final value of n will be printed by the main thread after all 4 threads finish.

In each thread we increment a shared variable 10 million times. There are 4 threads incrementing in total and the variable starts at zero. What should the final value of the variable be after all incrementing threads finish?

40 million seems like a reasonable answer, but by running the program we observe that sometimes it prints out values like 30 million or 20 million. What's going on?

First we noticed that the compiler can optimize away the loop in threadfunc(). It recognizes that the loop is simply incrementing the shared variable by an aggregate of 10 million, so it will transform the loop into a single addl instruction with immediate value 10 million.

Secondly, there is a race condition in the addl instruction itself! Up until this point in the course, we've been thinking x86 instructions as being indivisible and atomic. In fact, they are not, and their lack of atomicity shows up in a multi-processor environment.

Inside the processor hardware, the addl $10000000, (%rdi) is actually implemented as 3 separate "micro-op" instructions:

movl (%rdi), %temp (load)
addl $10000000, %temp (add)
movl %temp, (%rdi) (store)

Imagine 2 threads executing this addl instruction at the same time (concurrently). Each thread loads the same value of (%rdi) from memory, then adds 10 million to it in their own separate temporary registers, and then write the same value back to (%rdi) in memory. The last write to memory by each thread will overwrite each other with the same value, and one increment by 10 million will essentially be lost.

This is the behavior of running 2 increments on the same variable in x86 assembly. In the C/C++ abstract machine, accessing shared memory from different threads without proper synchronization is undefined behavior, unless all accesses are reads.

Re-running this experiment with compiler optimizations turned off (so the loop does not get optimized away), we will get a result like 15285711, where roughly 2/3 of the updates are lost.

There are 2 ways to synchronize shared memory accesses in C++. We will describe a low-level approach, using C++'s std::atomic library, first, before introducing a higher level and more general way of performing synchronization.

Atomics

incr-atomic.cc implements synchronized shared-memory access using C++ atomics. Relevant code in threadfunc() is shown below.

void threadfunc(std::atomic<unsigned>* x) {
    for (int i = 0; i != 10000000; ++i) {
        x->fetch_add(1);
        // `*x += 1` and `(*x)++` also work!
    }
}

C++'s atomics library implements atomic additions using an x86's atomic instruction. When we use objdump to inspect the assembly of threadfunc(), we see an lock addl ... instruction instead of just addl .... The lock prefix of the addl instruction asks the processor the hold on to the cache line with the shared variable (or in Intel terms, lock the memory bus) until the entire addl instruction has completed.

C++ atomics and lock-prefixed instructions only work with word-sized variables that do not span multiple cache lines. A more general way to perform synchronized access to arbitrary data in memory is called a mutex (short for mutual exclusion), and we will cover it more in the next lecture.