From 37228b9d7d56baf82861c70243736ce6b19197a2 Mon Sep 17 00:00:00 2001
From: Alexis Beingessner <a.beingessner@gmail.com>
Date: Tue, 7 Jul 2015 21:19:04 -0700
Subject: [PATCH] flesh out atomics

---
 atomics.md | 214 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 197 insertions(+), 17 deletions(-)

diff --git a/atomics.md b/atomics.md
index e13fb01..9bafb76 100644
--- a/atomics.md
+++ b/atomics.md
@@ -7,27 +7,138 @@ it is a pragmatic concession to the fact that *everyone* is pretty bad at modeli
 atomics. At very least, we can benefit from existing tooling and research around
 C.
 
-Trying to fully explain the model is fairly hopeless. If you want all the
-nitty-gritty details, you should check out [C's specification][C11-model].
-Still, we'll try to cover the basics and some of the problems Rust developers
-face.
+Trying to fully explain the model in this book is fairly hopeless. It's defined
+in terms of madness-inducing causality graphs that require a full book to properly
+understand in a practical way. If you want all the nitty-gritty details, you
+should check out [C's specification][C11-model]. Still, we'll try to cover the
+basics and some of the problems Rust developers face.
 
-The C11 memory model is fundamentally about trying to bridge the gap between C's
-single-threaded semantics, common compiler optimizations, and hardware peculiarities
-in the face of a multi-threaded environment. It does this by splitting memory
-accesses into two worlds: data accesses, and atomic accesses.
+The C11 memory model is fundamentally about trying to bridge the gap between
+the semantics we want, the optimizations compilers want, and the inconsistent
+chaos our hardware wants. *We* would like to just write programs and have them
+do exactly what we said but, you know, *fast*. Wouldn't that be great?
+
+
+
+
+# Compiler Reordering
+
+Compilers fundamentally want to be able to do all sorts of crazy transformations
+to reduce data dependencies and eleminate dead code. In particular, they may
+radically change the actual order of events, or make events never occur! If we
+write something like
+
+```rust,ignore
+x = 1;
+y = 3;
+x = 2;
+```
+
+The compiler may conclude that it would *really* be best if your program did
+
+```rust,ignore
+x = 2;
+y = 3;
+```
+
+This has inverted the order of events *and* completely eliminated one event. From
+a single-threaded perspective this is completely unobservable: after all the
+statements have executed we are in exactly the same state. But if our program is
+multi-threaded, we may have been relying on `x` to *actually* be assigned to 1 before
+`y` was assigned. We would *really* like the compiler to be able to make these kinds
+of optimizations, because they can seriously improve performance. On the other hand,
+we'd really like to be able to depend on our program *doing the thing we said*.
+
+
+
+
+# Hardware Reordering
+
+On the other hand, even if the compiler totally understood what we wanted and
+respected our wishes, our *hardware* might instead get us in trouble. Trouble comes
+from CPUs in the form of memory hierarchies. There is indeed a global shared memory
+space somewhere in your hardware, but from the perspective of each CPU core it is
+*so very far away* and *so very slow*. Each CPU would rather work with its local
+cache of the data and only go through all the *anguish* of talking to shared
+memory *only* when it doesn't actually have that memory in cache.
+
+After all, that's the whole *point* of the cache, right? If every read from the
+cache had to run back to shared memory to double check that it hadn't changed,
+what would the point be? The end result is that the hardware doesn't guarantee
+that events that occur in the same order on *one* thread, occur in the same order
+on *another* thread. To guarantee this, we must issue special instructions to
+the CPU telling it to be a bit less smart.
+
+For instance, say we convince the compiler to emit this logic:
+
+```text
+initial state: x = 0, y = 1
+
+THREAD 1        THREAD2
+y = 3;          if x == 1 {
+x = 1;              y *= 2;
+                }
+```
+
+Ideally this program has 2 possible final states:
+
+* `y = 3`: (thread 2 did the check before thread 1 completed)
+* `y = 6`: (thread 2 did the check after thread 1 completed)
+
+However there's a third potential state that the hardware enables:
+
+* `y = 2`: (thread 2 saw `x = 2`, but not `y = 3`, and then overwrote `y = 3`)
+
+```
+
+It's worth noting that different kinds of CPU provide different guarantees. It
+is common to seperate hardware into two categories: strongly-ordered and weakly-
+ordered. Most notably x86/64 provides strong ordering guarantees, while ARM and
+provides weak ordering guarantees. This has two consequences for
+concurrent programming:
+
+* Asking for stronger guarantees on strongly-ordered hardware may be cheap or
+  even *free* because they already provide strong guarantees unconditionally.
+  Weaker guarantees may only yield performance wins on weakly-ordered hardware.
+
+* Asking for guarantees that are *too* weak on strongly-ordered hardware
+  is more likely to *happen* to work, even though your program is strictly
+  incorrect. If possible, concurrent algorithms should be tested on
+  weakly-ordered hardware.
+
+
+
+
+
+# Data Accesses
+
+The C11 memory model attempts to bridge the gap by allowing us to talk about
+the *causality* of our program. Generally, this is by establishing a
+*happens before* relationships between parts of the program and the threads
+that are running them. This gives the hardware and compiler room to optimize the
+program more aggressively where a strict happens-before relationship isn't
+established, but forces them to be more careful where one *is* established.
+The way we communicate these relationships are through *data accesses* and
+*atomic accesses*.
 
 Data accesses are the bread-and-butter of the programming world. They are
 fundamentally unsynchronized and compilers are free to aggressively optimize
-them. In particular data accesses are free to be reordered by the compiler
+them. In particular, data accesses are free to be reordered by the compiler
 on the assumption that the program is single-threaded. The hardware is also free
-to propagate the changes made in data accesses as lazily and inconsistently as
-it wants to other threads. Mostly critically, data accesses are where we get data
-races. These are pretty clearly awful semantics to try to write a multi-threaded
-program with.
+to propagate the changes made in data accesses to other threads
+as lazily and inconsistently as it wants. Mostly critically, data accesses are
+how data races happen. Data accesses are very friendly to the hardware and
+compiler, but as we've seen they offer *awful* semantics to try to
+write synchronized code with.
 
-Atomic accesses are the answer to this. Each atomic access can be marked with
-an *ordering*. The set of orderings Rust exposes are:
+Atomic accesses are how we tell the hardware and compiler that our program is
+multi-threaded. Each atomic access can be marked with
+an *ordering* that specifies what kind of relationship it establishes with
+other accesses. In practice, this boils down to telling the compiler and hardware
+certain things they *can't* do. For the compiler, this largely revolves
+around re-ordering of instructions. For the hardware, this largely revolves
+around how writes are propagated to other threads. The set of orderings Rust
+exposes are:
 
 * Sequentially Consistent (SeqCst)
 * Release
@@ -36,11 +147,80 @@ an *ordering*. The set of orderings Rust exposes are:
 
 (Note: We explicitly do not expose the C11 *consume* ordering)
 
-TODO: give simple "basic" explanation of these
-TODO: implementing Arc example (why does Drop need the trailing barrier?)
+TODO: negative reasoning vs positive reasoning?
+
+
+
+
+# Sequentially Consistent
+
+Sequentially Consistent is the most powerful of all, implying the restrictions
+of all other orderings. A Sequentially Consistent operation *cannot*
+be reordered: all accesses on one thread that happen before and after it *stay*
+before and after it. A program that has sequential consistency has the very nice
+property that there is a single global execution of the program's instructions
+that all threads agree on. This execution is also particularly nice to reason
+about: it's just an interleaving of each thread's individual executions.
+
+The relative developer-friendliness of sequential consistency doesn't come for
+free. Even on strongly-ordered platforms, sequential consistency involves
+emitting memory fences.
+
+In practice, sequential consistency is rarely necessary for program correctness.
+However sequential consistency is definitely the right choice if you're not
+confident about the other memory orders. Having your program run a bit slower
+than it needs to is certainly better than it running incorrectly! It's also
+completely trivial to downgrade to a weaker consistency later.
+
+
+
+
+# Acquire-Release
 
+Acquire and Release are largely intended to be paired. Their names hint at
+their use case: they're perfectly suited for acquiring and releasing locks,
+and ensuring that critical sections don't overlap.
 
+An acquire access ensures that every access after it *stays* after it. However
+operations that occur before an acquire are free to be reordered to occur after
+it.
 
+A release access ensures that every access before it *stays* before it. However
+operations that occur after a release are free to be reordered to occur before
+it.
+
+Basic use of release-acquire is simple: you acquire a location of memory to
+begin the critical section, and the release that location to end it. If
+thread A releases a location of memory and thread B acquires that location of
+memory, this establishes that A's critical section *happened before* B's
+critical section. All accesses that happened before the release will be observed
+by anything that happens after the acquire.
+
+On strongly-ordered platforms most accesses have release or acquire semantics,
+making release and acquire often totally free. This is not the case on
+weakly-ordered platforms.
+
+
+
+
+# Relaxed
+
+Relaxed accesses are the absolute weakest. They can be freely re-ordered and
+provide no happens-before relationship. Still, relaxed operations *are* still
+atomic, which is valuable. Relaxed operations are appropriate for things that
+you definitely want to happen, but don't particularly care about much else. For
+instance, incrementing a counter can be relaxed if you're not using the
+counter to synchronize any other accesses.
+
+There's rarely a benefit in making an operation relaxed on strongly-ordered
+platforms, since they usually provide release-acquire semantics anyway. However
+relaxed operations can be cheaper on weakly-ordered platforms.
+
+
+
+
+
+TODO: implementing Arc example (why does Drop need the trailing barrier?)
 
 
 [C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf