Waiting, Signals, and Race Conditions (Oh my!)

We've learned how to use waitpid to allow a parent to wait on its child, and we've experimented with implementing that. In the videos, we've introduced signals, which lets you coordinate among processes, and we've learned how pipe work, which provides another mechanism to allow one process to wait on another. Today, we'll construct somewhat more complex situations that involve waiting. Doing so will introduce race conditions, which we'll explain and then try to solve. This will lead very naturally into the select system call and our next unit on Concurrency and Synchronization.

Learning Objectives

Write signal handlers

Write code that uses signal handlers

Define race condition

Identify possible race conditions in code

Develop solutions that avoid race conditions

Develop some intuition for thinking about what happens when multiple things are happening simultaneously

Getting Started

Today's code is in the l17 directory of the cs61-exercises repository.

The Scenario

In class, we've experimented with waiting for children. In the video, we examined how to do this using either blocking or polling. In general, blocking is more efficient. So, how might you solve the following dilemma: I would like to wait for my child processes, but my child processes are unruly and sometimes they do naughty things -- like looping infinitely or taking too long to run. I would prefer to only wait up to 1 second for my children to exit. How might I do this?

The Alarm Clock

Let's start by writing programs with signal handlers.

Take a look at alarm1.c and alarm2.c. Both of these programs handle sleeping for at most one second. However, they are vastly different in terms of which one might let you wait for a child while you limit yourself to a 1 second wait.

1. Which program might you be able to modify to support the waiting for a child?

Waiting for Children in the Presence of an alarm clock

We've implemented a program that combines one of the alarm clocks with waiting for a child. Take a look at waitalarm.c. Read the program, predict what will happen, and only after you've made a prediction, run it a few times. Now, test your understanding of the program by changing one number to make the program print out a different message.

2. What number did you change?

Note: you might have noticed that we are using something called nfork instead of fork. This function is implemented in the file helpers.h. If you take a look, what you'll see is that we nondeterministically put the parent to sleep after it forks. Here's why: the operating system makes absolutely no guarantee about the order in which things will run. However, empirically, we observe that Linux seems to always run the parent first. You should never ever depend on this behavior and by nondeterministically adding a sleep, we make sure that you get different process orderings.

Using Signals instead of Blocking Waitpid

In the last example, we used a signal to tell us when our time limit elapsed; and we used a blocking system call to wait for the child to terminate. Let's try to flip that -- can we use a signal to tell us when a child terminates and block for our time limit? Let's try it!

Examine the file waitblock.c. This has all the pieces you need to implement this alternate solution. You will need to A) write a signal handler for the SIGCHLD signal and then B) register that signal handler. Your code will be quite similar to the code in alarm2.c, but you should try to write it on your own. Test your code by changing the value of the child's sleep so that you make sure that you trigger both the "parent waited too long" condition as well as the "child exited" condition.

All is not well

It turns out that this solutions isn't quite a solution at all. It has a bug called a race condition. This means that the correctness of your code depends on the precise order in which processes get scheduled. And you should never, ever, ever depend on such ordering.

Can you find the race condition? We've used nfork again to try to make it happen, but to be honest, it's difficult to make this happen reliably. Note: One reason synchronization and timing problems are so difficult to find is because they are so hard to reproduce!

3. Go through the code and at each step, ask yourself, "If a specific event happened at this instance, would the program run as intended?" When you find a case where the answer is no, you've found a race condition. Check with a member of the teaching staff to confirm that you've found the race condition!

Can we get rid of the race condition?

One would hope so!

Sisyphus suggests that we shouldn't sleep. Instead, we should use the non-blocking version of waitpid to check when the process has exited. He wrote his solution in the program waittimeout.c. Review this code.

4. What do you think of Sisyphus' solution? What problem does it exhibit? Can you think of a way to demonstrate this problem?

Can you modify Sisyphus' solution so that it trades off CPU utilization and how responsive it is to a child's exit? Try it and experiment with different points in that tradeoff space.

Based on the numbers you got above, about how long does it take to make a waitpid call on the appliance?

Better than Sisyphus

Hermes watched the video on signals and remembers that it's possible to block signals. He thinks that he can fix the waitblock.c race condition by turning signals on and off at the right time.

5. Do you agree or disagree with him?

Atalanta has an even better idea! What if we just do a waitpid with WNOHANG at the place in our program where we have the race condition.

6. Will this solve the problem? Can you identify the fundamental problem that continues to plague us here?

Wrapping Up

You have just entered the realm of concurrency and the challenge of getting processes running at the same time to coordinate correctly. These sorts of problems arise all the time in computer systems -- learning how to deal with concurrency correctly is vital! Next week we'll begin digging into this area in earnest!

Please take a moment to complete this survey.