12/1/16: Implementing synchronization
Atomic instructions
This section concerns the behavior of an x86-64 processor, not the C abstract machine.
- Many instructions have atomic effect
- Loading an aligned 8-byte value (or 1, 2, 4-byte value)
- Storing an aligned 8-byte value (or 1, 2, 4-byte value)
- Thus, when multiple threads simultaneously store 8-byte values at the same address, any load of that address will see a value that was stored—not, like, some crazy combination of values
- (Note: this only applies to aligned loads and stores—if a load or store crosses a cache-line boundary, a load can see a crazy combination of values!)
- But many instructions do not have atomic effect!
- Consider
incq (%rax)
, which increments the 64-bit integer value whose address is%rax
- That instruction logically comprises a load (
uint64_t tmp = \*(%rax);
), an arithmetic operation (tmp = tmp + 1;
), and a store (\*(%rax) = tmp;
) - The load and the store each have atomic effect (assuming an aligned address)—but the instruction as a whole does not!
- Say a memory location holds zero initially, and then two threads
each call
incq
on that location 100,000 times each. Ifincq
was atomic, the final value stored in that location would always be 200,000. But in real processors, the final result might be much less! - x86-64 instructions that combine reads and writes have non-atomic effects by default.
- Consider
- So how can we build a lock?
- Mutual exclusion locks effectively require reading and writing the same variable in one atomic step!
- Although smart people have figured out lock implementations that appear to work without atomic read/write instructions (see Lamport’s bakery algorithm), those algorithms do not work on modern machines!
- The solution is atomic instructions.
- x86-64 supports special instructions that implement reading and writing in one atomic step.
- This is usually indicated by a
lock
prefix on the instruction. - For example,
lock incq (%rax)
does have atomic effect.
- C compilers give programmers access to atomic instructions through
special functions.
- For example, the
__sync_fetch_and_add
builtin function compiles down to the x86-64 instructionlock add
. It atomically reads a location, adds a number to the location, and returns the old value stored at that location. It works for any suitable type, but is most commonly used for integers. TYPE __sync_fetch_and_add(TYPE* object, TYPE value) { // The following is implemented in one atomic step: TYPE tmp = *object; *object += value; return tmp; } - You can use
__sync_fetch_and_add
to build a lock as follows: typedef int mutex_t; void mutex_lock(mutex_t* m) { while (__sync_fetch_and_add(m, 1) != 0) __sync_fetch_and_add(m, -1); } - More
__sync
builtins - This support is relatively new and is still changing. Newer releases
of GCC discourage
__sync
functions in favor of__atomic
functions; you don’t need to understand the difference.
- For example, the
- The most fundamental atomic instruction is compare-and-swap, also
known as
cmpxchg
(compare-exchange).- Using compare-and-swap, you can implement any other atomic function (though not necessarily as efficiently!). TYPE cmpxchg(TYPE* object, TYPE expected, TYPE desired) { // The following is implemented in one atomic step: TYPE actual = *object; if (actual == expected) *object = desired; return actual; }
- Compare-and-swap can, for example, implement “atomic multiply by 7 and add 2”. Here’s how: uint64_t atomic_mul7_add2(uint64_t* object) { while (1) { uint64_t expected = *object; if (cmpxchg(object, expected, expected * 7 + 2) == expected) return; } }
- Compare-and-swap is almost always used in a loop, because the swap might fail. In this it resembles the condition-variable wait operation.
- In GCC, this is called
__sync_val_compare_and_swap
or (though the interface is slightly different)__atomic_compare_exchange
.