Note on zombie processes
This is relevant to the problem set. When a process forks a child, the child
eventually exits and will have a exit status. The child process then enters a
"zombie state": the process no longer exits, but the kernel still keeps around
its pid and its exit status, waiting for waitpid()
to be called on the
process. Zombie processes consume kernel resources and we should avoid having
zombies lying around whenever possible.
The trick for avoiding zombie processes is to call waitpid()
at least once
for each child. Invoking waitpid()
with -1 as the pid argument will check on
an exit status of an arbitrary child.
Signals
Recall that one very important functionality of an operating system is to provide user-accessible abstractions of the underlying hardware features that are inconvenient or unsafe for a user process to access directly. UNIX-like systems abstract interrupts (as hardware feature) to user-space signals.
Comparisons between interrupts and signals:
Interrupts | Signals |
---|---|
Can happen at any time, can be blocked | Can happen any any time, can be blocked |
Protected control transfer, saves machine state as part of kernel state | Previous machine state was saved by kernel to ptable |
Kernel invokes interrupt handler in kernel mode on the kernel stack | Kernel invokes signal handler in user mode on a signal stack |
We use the following system calls to manipulate signals:
sigaction(sig, handler, oldhandler)
: installs a signal handlerkill(pid, sig)
: sends a signal
We use sigaction()
to install a signal handler, which should be a very
simple function that runs when the signal gets delivered. Since the signal
can arrive at any time between 2 instructions, the signal handler should
really be kept at minimum complexity and it definitely should not call user
space library functions like malloc()
. Imagine that a signal may arrive in
between 2 instructions in the malloc()
routine itself, and the internal
state and data structure used by malloc()
may not be consistent. Typical
things we do in a signal handler is setting a flag variable or checking the
value of some other variable, and we leave the complex task to be triggered
by these flag variables in the main process.
For a list of functions that can be safely used in signal handlers, see manual page
man 7 signal-safety
.
Solving the race condition...
Recall from the last lecture where we were trying to implement the timed wait.
Our plan was to have a signal (SIGCHLD) unblock the usleep()
system call in
the parent if the child exited within 0.75 seconds. Our solution has a
condition: there is small window during which if the signal gets delivered,
the parent would miss the effect of the signal and overlook the fact that the
child has already exited. Before the end of the last lecture we hinted that
the problem can be reliably solved using a pipe.
... with pipe
Take a look at code in shell4/timedwait-selfpipe.cc
for how we could do it.
The relevant portion of the code is pasted below.
int signalpipe[2];
void signal_handler(int signal) {
(void) signal;
ssize_t r = write(signalpipe[1], "!", 1);
assert(r == 1);
}
int main(int, char** argv) {
...
// Wait for 0.75 sec, or until a byte is written to `signalpipe`,
// whichever happens first
struct timeval timeout = { 0, 750000 };
fd_set fds;
FD_SET(signalpipe[0], &fds);
r = select(signalpipe[0] + 1, &fds, nullptr, nullptr, &timeout);
...
}
We simply write one character into the pipe in the signal handler (it is fine
to invoke system calls in signal handlers), and then we use the select()
system call in the parent to wait for the read end of the pipe to become
ready, with a 0.75-second timeout value.
Some notes on
select()
. Theselect()
system call is very commonly seen in even-driven programs (like a server). It has the following signature:select ( nfds, // largest fd value to monitor + 1 read_set, write_set, exp_set, timeout )
read_set
andwrite_set
arefd_set
objects.The system call will return at the earliest of:
- Some fd in
read_set
becomes readable (meaning thatread(fd)
will not block);- Some fd in
write_set
becomes writable (write(fd)
will not block);- Timeout elapses;
- A signal is delivered;
This is indeed one of the few correct ways to implement timed wait, eliminating all potential race conditions.
It may be surprising that we need to involve so many machinery to solve this seemingly simple problem. The race condition we discussed in this example is a very common occurrence at all levels of concurrent systems design. It is so important that there is a name for it, and it's called the sleep-wakeup race condition. As we learn later in this unit, there are synchronization mechanisms designed specifically to address sleep-wakeup race conditions.
... with signalfd
There is a system call, signalfd()
, that automatically builds a signal-pipe-like
vehicle for us to handle these race conditions. Instead of explicitly
creating a pipe with 2 ends and define a signal handler that explicitly writes
to the pipe, we simply tell the operating to create a file descriptor from
which information about a signal can be read once a signal arrives. Code for
doing this is in shell4/timedwait-signalfd.cc
. Note that there is no
definition of a signal handler any more. The relevant code is shown below:
int main(int, char** argv) {
sigset_t mask;
sigemptyset(&mask);
sigaddset(&mask, SIGCHLD);
int r = sigprocmask(SIG_BLOCK, &mask, nullptr);
assert(r == 0);
int sigfd = signalfd(-1, &mask, SFD_CLOEXEC);
assert(sigfd >= 0);
...
// Wait for 0.75 sec, or until something is written to `sigfd`,
// whichever happens first
struct timeval timeout = { 0, 750000 };
fd_set fds;
FD_SET(sigfd, &fds);
r = select(sigfd + 1, &fds, nullptr, nullptr, &timeout);
...
}
Threads
Modern day computers usually have multiple processors (CPUs) built in. The operating system provides a user-level abstraction of multiple processors call threads. Here is the comparison table of threads (as an abstraction) and multiple processors (as a hardware feature):
Multiple processors | Threads |
---|---|
One primary memory | One virtual address space |
Multiple sets of registers and logical computations | Multiple sets of registers, stacks, and logical computations |
A process can contain multiple threads. All threads within the same process share the same virtual address space and file descriptor table. Each thread will however have its own set of registers and stack. The processes we have looked at so far all have a single thread running. For multi-threaded processes the kernel will need to store a set of registers for each thread, rather than for each process, to maintain the full process state.
Let's look at how to use threads using our example program in incr-basic.cc
:
void threadfunc(unsigned* x) {
// This is a correct way to increment a shared variable!
// ... OR IS IT?!?!?!?!?!?!??!?!
for (int i = 0; i != 10000000; ++i) {
*x += 1;
}
}
int main() {
std::thread th[4];
unsigned n = 0;
for (int i = 0; i != 4; ++i) {
th[i] = std::thread(threadfunc, &n);
}
for (int i = 0; i != 4; ++i) {
th[i].join();
}
printf("%u\n", n);
}
In this code we run the function threadfunc()
in parallel in 4 threads. The
std:🧵:join()
function makes the main thread block until the thread
upon which the join()
is called finishes execution. So the final value of
n
will be printed by the main thread after all 4 threads finish.
In each thread we increment a shared variable 10 million times. There are 4 threads incrementing in total and the variable starts at zero. What should the final value of the variable be after all incrementing threads finish?
40 million seems like a reasonable answer, but by running the program we observe that sometimes it prints out values like 30 million or 20 million. What's going on?
First we noticed that the compiler can optimize away the loop in
threadfunc()
. It recognizes that the loop is simply incrementing the shared
variable by an aggregate of 10 million, so it will transform the loop into a
single addl
instruction with immediate value 10 million.
Secondly, there is a race condition in the addl
instruction itself! Up until
this point in the course, we've been thinking x86 instructions as being
indivisible and atomic. In fact, they are not, and their lack of atomicity
shows up in a multi-processor environment.
Inside the processor hardware, the addl $10000000, (%rdi)
is actually
implemented as 3 separate "micro-op" instructions:
movl (%rdi), %temp
(load)addl $10000000, %temp
(add)movl %temp, (%rdi)
(store)
Imagine 2 threads executing this addl
instruction at the same time
(concurrently). Each thread loads the same value of (%rdi)
from memory, then
adds 10 million to it in their own separate temporary registers, and then
write the same value back to (%rdi)
in memory. The last write to memory by
each thread will overwrite each other with the same value, and one increment
by 10 million will essentially be lost.
This is the behavior of running 2 increments on the same variable in x86 assembly. In the C/C++ abstract machine, accessing shared memory from different threads without proper synchronization is undefined behavior, unless all accesses are reads.
Re-running this experiment with compiler optimizations turned off (so the loop does not get optimized away), we will get a result like 15285711, where roughly 2/3 of the updates are lost.
There are 2 ways to synchronize shared memory accesses in C++. We will describe
a low-level approach, using C++'s std::atomic
library, first, before
introducing a higher level and more general way of performing synchronization.
Atomics
incr-atomic.cc
implements synchronized shared-memory access using C++ atomics.
Relevant code in threadfunc()
is shown below.
void threadfunc(std::atomic<unsigned>* x) {
for (int i = 0; i != 10000000; ++i) {
x->fetch_add(1);
// `*x += 1` and `(*x)++` also work!
}
}
C++'s atomics library implements atomic additions using an x86's atomic
instruction. When we use objdump
to inspect the assembly of threadfunc()
,
we see an lock addl ...
instruction instead of just addl ...
. The lock
prefix of the addl
instruction asks the processor the hold on to the cache
line with the shared variable (or in Intel terms, lock the memory bus) until
the entire addl
instruction has completed.
C++ atomics and lock
-prefixed instructions only work with word-sized
variables that do not span multiple cache lines. A more general way to perform
synchronized access to arbitrary data in memory is called a mutex (short for
mutual exclusion), and we will cover it more in the next lecture.