flesh out atomics

10 years ago · 37228b9d7d
parent 7851f27cb6
commit 37228b9d7d
1 changed files with 197 additions and 17 deletions
--- a/atomics.md
+++ b/atomics.md
@ -7,27 +7,138 @@ it is a pragmatic concession to the fact that *everyone* is pretty bad at modeli
 atomics. At very least, we can benefit from existing tooling and research around
 C.
-Trying to fully explain the model is fairly hopeless. If you want all the
+Trying to fully explain the model in this book is fairly hopeless. It's defined
-nitty-gritty details, you should check out [C's specification][C11-model].
+in terms of madness-inducing causality graphs that require a full book to properly
-Still, we'll try to cover the basics and some of the problems Rust developers
+understand in a practical way. If you want all the nitty-gritty details, you
-face.
+should check out [C's specification][C11-model]. Still, we'll try to cover the
 basics and some of the problems Rust developers face.
-The C11 memory model is fundamentally about trying to bridge the gap between C's
+The C11 memory model is fundamentally about trying to bridge the gap between
-single-threaded semantics, common compiler optimizations, and hardware peculiarities
+the semantics we want, the optimizations compilers want, and the inconsistent
-in the face of a multi-threaded environment. It does this by splitting memory
+chaos our hardware wants. *We* would like to just write programs and have them
-accesses into two worlds: data accesses, and atomic accesses.
+do exactly what we said but, you know, *fast*. Wouldn't that be great?
 # Compiler Reordering
 Compilers fundamentally want to be able to do all sorts of crazy transformations
 to reduce data dependencies and eleminate dead code. In particular, they may
 radically change the actual order of events, or make events never occur! If we
 write something like
 ```rust,ignore
 x = 1;
 y = 3;
 x = 2;
 ```
 The compiler may conclude that it would *really* be best if your program did
 ```rust,ignore
 x = 2;
 y = 3;
 ```
 This has inverted the order of events *and* completely eliminated one event. From
 a single-threaded perspective this is completely unobservable: after all the
 statements have executed we are in exactly the same state. But if our program is
 multi-threaded, we may have been relying on `x` to *actually* be assigned to 1 before
 `y` was assigned. We would *really* like the compiler to be able to make these kinds
 of optimizations, because they can seriously improve performance. On the other hand,
 we'd really like to be able to depend on our program *doing the thing we said*.
 # Hardware Reordering
 On the other hand, even if the compiler totally understood what we wanted and
 respected our wishes, our *hardware* might instead get us in trouble. Trouble comes
 from CPUs in the form of memory hierarchies. There is indeed a global shared memory
 space somewhere in your hardware, but from the perspective of each CPU core it is
 *so very far away* and *so very slow*. Each CPU would rather work with its local
 cache of the data and only go through all the *anguish* of talking to shared
 memory *only* when it doesn't actually have that memory in cache.
 After all, that's the whole *point* of the cache, right? If every read from the
 cache had to run back to shared memory to double check that it hadn't changed,
 what would the point be? The end result is that the hardware doesn't guarantee
 that events that occur in the same order on *one* thread, occur in the same order
 on *another* thread. To guarantee this, we must issue special instructions to
 the CPU telling it to be a bit less smart.
 For instance, say we convince the compiler to emit this logic:
 ```text
 initial state: x = 0, y = 1
 THREAD 1        THREAD2
 y = 3;          if x == 1 {
 x = 1;              y *= 2;
                }
 ```
 Ideally this program has 2 possible final states:
 * `y = 3`: (thread 2 did the check before thread 1 completed)
 * `y = 6`: (thread 2 did the check after thread 1 completed)
 However there's a third potential state that the hardware enables:
 * `y = 2`: (thread 2 saw `x = 2`, but not `y = 3`, and then overwrote `y = 3`)
 ```
 It's worth noting that different kinds of CPU provide different guarantees. It
 is common to seperate hardware into two categories: strongly-ordered and weakly-
 ordered. Most notably x86/64 provides strong ordering guarantees, while ARM and
 provides weak ordering guarantees. This has two consequences for
 concurrent programming:
 * Asking for stronger guarantees on strongly-ordered hardware may be cheap or
  even *free* because they already provide strong guarantees unconditionally.
  Weaker guarantees may only yield performance wins on weakly-ordered hardware.
 * Asking for guarantees that are *too* weak on strongly-ordered hardware
  is more likely to *happen* to work, even though your program is strictly
  incorrect. If possible, concurrent algorithms should be tested on
  weakly-ordered hardware.
 # Data Accesses
 The C11 memory model attempts to bridge the gap by allowing us to talk about
 the *causality* of our program. Generally, this is by establishing a
 *happens before* relationships between parts of the program and the threads
 that are running them. This gives the hardware and compiler room to optimize the
 program more aggressively where a strict happens-before relationship isn't
 established, but forces them to be more careful where one *is* established.
 The way we communicate these relationships are through *data accesses* and
 *atomic accesses*.
 Data accesses are the bread-and-butter of the programming world. They are
 fundamentally unsynchronized and compilers are free to aggressively optimize
-them. In particular data accesses are free to be reordered by the compiler
+them. In particular, data accesses are free to be reordered by the compiler
 on the assumption that the program is single-threaded. The hardware is also free
-to propagate the changes made in data accesses as lazily and inconsistently as
+to propagate the changes made in data accesses to other threads
-it wants to other threads. Mostly critically, data accesses are where we get data
+as lazily and inconsistently as it wants. Mostly critically, data accesses are
-races. These are pretty clearly awful semantics to try to write a multi-threaded
+how data races happen. Data accesses are very friendly to the hardware and
-program with.
+compiler, but as we've seen they offer *awful* semantics to try to
 write synchronized code with.
-Atomic accesses are the answer to this. Each atomic access can be marked with
+Atomic accesses are how we tell the hardware and compiler that our program is
-an *ordering*. The set of orderings Rust exposes are:
+multi-threaded. Each atomic access can be marked with
 an *ordering* that specifies what kind of relationship it establishes with
 other accesses. In practice, this boils down to telling the compiler and hardware
 certain things they *can't* do. For the compiler, this largely revolves
 around re-ordering of instructions. For the hardware, this largely revolves
 around how writes are propagated to other threads. The set of orderings Rust
 exposes are:
 * Sequentially Consistent (SeqCst)
 * Release
@ -36,11 +147,80 @@ an *ordering*. The set of orderings Rust exposes are:
 (Note: We explicitly do not expose the C11 *consume* ordering)
-TODO: give simple "basic" explanation of these
+TODO: negative reasoning vs positive reasoning?
-TODO: implementing Arc example (why does Drop need the trailing barrier?)
+
 # Sequentially Consistent
 Sequentially Consistent is the most powerful of all, implying the restrictions
 of all other orderings. A Sequentially Consistent operation *cannot*
 be reordered: all accesses on one thread that happen before and after it *stay*
 before and after it. A program that has sequential consistency has the very nice
 property that there is a single global execution of the program's instructions
 that all threads agree on. This execution is also particularly nice to reason
 about: it's just an interleaving of each thread's individual executions.
 The relative developer-friendliness of sequential consistency doesn't come for
 free. Even on strongly-ordered platforms, sequential consistency involves
 emitting memory fences.
 In practice, sequential consistency is rarely necessary for program correctness.
 However sequential consistency is definitely the right choice if you're not
 confident about the other memory orders. Having your program run a bit slower
 than it needs to is certainly better than it running incorrectly! It's also
 completely trivial to downgrade to a weaker consistency later.
 # Acquire-Release
 Acquire and Release are largely intended to be paired. Their names hint at
 their use case: they're perfectly suited for acquiring and releasing locks,
 and ensuring that critical sections don't overlap.
 An acquire access ensures that every access after it *stays* after it. However
 operations that occur before an acquire are free to be reordered to occur after
 it.
 A release access ensures that every access before it *stays* before it. However
 operations that occur after a release are free to be reordered to occur before
 it.
 Basic use of release-acquire is simple: you acquire a location of memory to
 begin the critical section, and the release that location to end it. If
 thread A releases a location of memory and thread B acquires that location of
 memory, this establishes that A's critical section *happened before* B's
 critical section. All accesses that happened before the release will be observed
 by anything that happens after the acquire.
 On strongly-ordered platforms most accesses have release or acquire semantics,
 making release and acquire often totally free. This is not the case on
 weakly-ordered platforms.
 # Relaxed
 Relaxed accesses are the absolute weakest. They can be freely re-ordered and
 provide no happens-before relationship. Still, relaxed operations *are* still
 atomic, which is valuable. Relaxed operations are appropriate for things that
 you definitely want to happen, but don't particularly care about much else. For
 instance, incrementing a counter can be relaxed if you're not using the
 counter to synchronize any other accesses.
 There's rarely a benefit in making an operation relaxed on strongly-ordered
 platforms, since they usually provide release-acquire semantics anyway. However
 relaxed operations can be cheaper on weakly-ordered platforms.
 TODO: implementing Arc example (why does Drop need the trailing barrier?)
 [C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf