Waiting, Signals, and Race Conditions (Oh my!)
We've learned how to use waitpid to allow a parent to wait on its child,
and we've experimented with implementing that. In the videos, we've
introduced signals, which lets you coordinate among processes, and we've
learned how pipe work, which provides another mechanism to allow one
process to wait on another. Today, we'll construct somewhat more complex
situations that involve waiting. Doing so will introduce race
conditions, which we'll explain and then try to solve. This will lead
very naturally into the select
system call and our next unit on
Concurrency and Synchronization.
Learning Objectives
- Write signal handlers
- Write code that uses signal handlers
- Define race condition
- Identify possible race conditions in code
- Develop solutions that avoid race conditions
- Develop some intuition for thinking about what happens when multiple things are happening simultaneously
Getting Started
Today's code is in the l17
directory of the cs61-exercises
repository.
The Scenario
In class, we've experimented with waiting for children. In the video, we examined how to do this using either blocking or polling. In general, blocking is more efficient. So, how might you solve the following dilemma: I would like to wait for my child processes, but my child processes are unruly and sometimes they do naughty things -- like looping infinitely or taking too long to run. I would prefer to only wait up to 1 second for my children to exit. How might I do this?
The Alarm Clock
Let's start by writing programs with signal handlers.
Take a look at alarm1.c
and alarm2.c
. Both of these programs handle
sleeping for at most one second. However, they are vastly different in
terms of which one might let you wait for a child while you limit
yourself to a 1 second wait.
1. Which program might you be able to modify to support the waiting for a child?
Waiting for Children in the Presence of an alarm clock
We've implemented a program that combines one of the alarm clocks with
waiting for a child. Take a look at waitalarm.c
. Read the program,
predict what will happen, and only after you've made a prediction, run
it a few times. Now, test your understanding of the program by changing
one number to make the program print out a different message.
2. What number did you change?
Note: you might have noticed that we are using something called
nfork
instead of fork
. This function is implemented in the file
helpers.h
. If you take a look, what you'll see is that we
nondeterministically put the parent to sleep after it forks. Here's why:
the operating system makes absolutely no guarantee about the order in
which things will run. However, empirically, we observe that Linux seems
to always run the parent first. You should never ever depend on this
behavior and by nondeterministically adding a sleep, we make sure that
you get different process orderings.
Using Signals instead of Blocking Waitpid
In the last example, we used a signal to tell us when our time limit elapsed; and we used a blocking system call to wait for the child to terminate. Let's try to flip that -- can we use a signal to tell us when a child terminates and block for our time limit? Let's try it!
Examine the file waitblock.c
. This has all the pieces you need to
implement this alternate solution. You will need to A) write a signal
handler for the SIGCHLD signal and then B) register that signal handler.
Your code will be quite similar to the code in alarm2.c
, but you
should try to write it on your own. Test your code by changing the value
of the child's sleep so that you make sure that you trigger both the
"parent waited too long" condition as well as the "child exited"
condition.
All is not well
It turns out that this solutions isn't quite a solution at all. It has a bug called a race condition. This means that the correctness of your code depends on the precise order in which processes get scheduled. And you should never, ever, ever depend on such ordering.
Can you find the race condition? We've used nfork
again to try to make
it happen, but to be honest, it's difficult to make this happen
reliably. Note: One reason synchronization and timing problems are
so difficult to find is because they are so hard to reproduce!
3. Go through the code and at each step, ask yourself, "If a specific event happened at this instance, would the program run as intended?" When you find a case where the answer is no, you've found a race condition. Check with a member of the teaching staff to confirm that you've found the race condition!
Can we get rid of the race condition?
One would hope so!
Sisyphus suggests that we shouldn't sleep. Instead, we should use the
non-blocking version of waitpid to check when the process has exited. He
wrote his solution in the program waittimeout.c
. Review this code.
4. What do you think of Sisyphus' solution? What problem does it exhibit? Can you think of a way to demonstrate this problem?
Can you modify Sisyphus' solution so that it trades off CPU utilization and how responsive it is to a child's exit? Try it and experiment with different points in that tradeoff space.
Based on the numbers you got above, about how long does it take to make a waitpid call on the appliance?
Better than Sisyphus
Hermes watched the video on signals and remembers that it's possible to
block signals. He thinks that he can fix the waitblock.c
race
condition by turning signals on and off at the right time.
5. Do you agree or disagree with him?
Atalanta has an even better idea! What if we just do a waitpid with WNOHANG at the place in our program where we have the race condition.
6. Will this solve the problem? Can you identify the fundamental problem that continues to plague us here?
Wrapping Up
You have just entered the realm of concurrency and the challenge of getting processes running at the same time to coordinate correctly. These sorts of problems arise all the time in computer systems -- learning how to deal with concurrency correctly is vital! Next week we'll begin digging into this area in earnest!
Please take a moment to complete this survey.