12/1/16: Implementing synchronization

Slides

Atomic instructions

This section concerns the behavior of an x86-64 processor, not the C abstract machine.

Many instructions have atomic effect
- Loading an aligned 8-byte value (or 1, 2, 4-byte value)
- Storing an aligned 8-byte value (or 1, 2, 4-byte value)
- Thus, when multiple threads simultaneously store 8-byte values at the same address, any load of that address will see a value that was stored—not, like, some crazy combination of values
- (Note: this only applies to aligned loads and stores—if a load or store crosses a cache-line boundary, a load can see a crazy combination of values!)
But many instructions do not have atomic effect!
- Consider incq (%rax), which increments the 64-bit integer value whose address is %rax
- That instruction logically comprises a load (uint64_t tmp = \*(%rax);), an arithmetic operation (tmp = tmp + 1;), and a store (\*(%rax) = tmp;)
- The load and the store each have atomic effect (assuming an aligned address)—but the instruction as a whole does not!
- Say a memory location holds zero initially, and then two threads each call incq on that location 100,000 times each. If incq was atomic, the final value stored in that location would always be 200,000. But in real processors, the final result might be much less!
- x86-64 instructions that combine reads and writes have non-atomic effects by default.
So how can we build a lock?
- Mutual exclusion locks effectively require reading and writing the same variable in one atomic step!
- Although smart people have figured out lock implementations that appear to work without atomic read/write instructions (see Lamport’s bakery algorithm), those algorithms do not work on modern machines!
The solution is atomic instructions.
- x86-64 supports special instructions that implement reading and writing in one atomic step.
- This is usually indicated by a lock prefix on the instruction.
- For example, lock incq (%rax) does have atomic effect.
C compilers give programmers access to atomic instructions through special functions.
- For example, the __sync_fetch_and_add builtin function compiles down to the x86-64 instruction lock add. It atomically reads a location, adds a number to the location, and returns the old value stored at that location. It works for any suitable type, but is most commonly used for integers. TYPE __sync_fetch_and_add(TYPE* object, TYPE value) { // The following is implemented in one atomic step: TYPE tmp = *object; *object += value; return tmp; }
- You can use __sync_fetch_and_add to build a lock as follows: typedef int mutex_t; void mutex_lock(mutex_t* m) { while (__sync_fetch_and_add(m, 1) != 0) __sync_fetch_and_add(m, -1); }
- More __sync builtins
- This support is relatively new and is still changing. Newer releases of GCC discourage __sync functions in favor of __atomic functions; you don’t need to understand the difference.
The most fundamental atomic instruction is compare-and-swap, also known as cmpxchg (compare-exchange).
- Using compare-and-swap, you can implement any other atomic function (though not necessarily as efficiently!). TYPE cmpxchg(TYPE* object, TYPE expected, TYPE desired) { // The following is implemented in one atomic step: TYPE actual = *object; if (actual == expected) *object = desired; return actual; }
- Compare-and-swap can, for example, implement “atomic multiply by 7 and add 2”. Here’s how: uint64_t atomic_mul7_add2(uint64_t* object) { while (1) { uint64_t expected = *object; if (cmpxchg(object, expected, expected * 7 + 2) == expected) return; } }
- Compare-and-swap is almost always used in a loop, because the swap might fail. In this it resembles the condition-variable wait operation.
- In GCC, this is called __sync_val_compare_and_swap or (though the interface is slightly different) __atomic_compare_exchange.