Merge b139a3ce58 into b4448fa406

1 month ago · d9e07f1b03
parent b4448fa406 b139a3ce58
commit d9e07f1b03
10 changed files with 1921 additions and 241 deletions
--- a/book.toml
+++ b/book.toml
@ -31,5 +31,8 @@ git-repository-url = "https://github.com/rust-lang/nomicon"
 "./arc-layout.html" = "./arc-mutex/arc-layout.html"
 "./arc.html" = "./arc-mutex/arc.html"
 # Atomics chapter
 "./atomics.html" = "./atomics/atomics.html"
 [rust]
 edition = "2024"
--- a/src/SUMMARY.md
+++ b/src/SUMMARY.md
@ -41,7 +41,12 @@
 * [Concurrency](concurrency.md)
  * [Races](races.md)
  * [Send and Sync](send-and-sync.md)
-  * [Atomics](atomics.md)
+  * [Atomics](./atomics/atomics.md)
 	* [Multithreaded Execution](./atomics/multithread.md)
 	* [Relaxed](./atomics/relaxed.md)
 	* [Acquire and Release](./atomics/acquire-release.md)
 	* [SeqCst](./atomics/seqcst.md)
 	* [Fences](./atomics/fences.md)
 * [Implementing Vec](./vec/vec.md)
  * [Layout](./vec/vec-layout.md)
  * [Allocating](./vec/vec-alloc.md)
--- a/src/arc-mutex/arc-clone.md
+++ b/src/arc-mutex/arc-clone.md
@ -28,7 +28,7 @@ happens-before relationship but is atomic. When `Drop`ping the Arc, however,
 we'll need to atomically synchronize when decrementing the reference count. This
 is described more in [the section on the `Drop` implementation for
 `Arc`](arc-drop.md). For more information on atomic relationships and Relaxed
-ordering, see [the section on atomics](../atomics.md).
+ordering, see [the section on atomics](../atomics/atomics.md).
 Thus, the code becomes this:
--- a/src/atomics.md
+++ b/src/atomics.md
@ -1,239 +0,0 @@
 # Atomics
 Rust pretty blatantly just inherits the memory model for atomics from C++20. This is not
 due to this model being particularly excellent or easy to understand. Indeed,
 this model is quite complex and known to have [several flaws][C11-busted].
 Rather, it is a pragmatic concession to the fact that *everyone* is pretty bad
 at modeling atomics. At very least, we can benefit from existing tooling and
 research around the C/C++ memory model.
 (You'll often see this model referred to as "C/C++11" or just "C11". C just copies
 the C++ memory model; and C++11 was the first version of the model but it has
 received some bugfixes since then.)
 Trying to fully explain the model in this book is fairly hopeless. It's defined
 in terms of madness-inducing causality graphs that require a full book to
 properly understand in a practical way. If you want all the nitty-gritty
 details, you should check out the [C++ specification][C++-model].
 Still, we'll try to cover the basics and some of the problems Rust developers
 face.
 The C++ memory model is fundamentally about trying to bridge the gap between the
 semantics we want, the optimizations compilers want, and the inconsistent chaos
 our hardware wants. *We* would like to just write programs and have them do
 exactly what we said but, you know, fast. Wouldn't that be great?
 ## Compiler Reordering
 Compilers fundamentally want to be able to do all sorts of complicated
 transformations to reduce data dependencies and eliminate dead code. In
 particular, they may radically change the actual order of events, or make events
 never occur! If we write something like:
 <!-- ignore: simplified code -->
 ```rust,ignore
 x = 1;
 y = 3;
 x = 2;
 ```
 The compiler may conclude that it would be best if your program did:
 <!-- ignore: simplified code -->
 ```rust,ignore
 x = 2;
 y = 3;
 ```
 This has inverted the order of events and completely eliminated one event.
 From a single-threaded perspective this is completely unobservable: after all
 the statements have executed we are in exactly the same state. But if our
 program is multi-threaded, we may have been relying on `x` to actually be
 assigned to 1 before `y` was assigned. We would like the compiler to be
 able to make these kinds of optimizations, because they can seriously improve
 performance. On the other hand, we'd also like to be able to depend on our
 program *doing the thing we said*.
 ## Hardware Reordering
 On the other hand, even if the compiler totally understood what we wanted and
 respected our wishes, our hardware might instead get us in trouble. Trouble
 comes from CPUs in the form of memory hierarchies. There is indeed a global
 shared memory space somewhere in your hardware, but from the perspective of each
 CPU core it is *so very far away* and *so very slow*. Each CPU would rather work
 with its local cache of the data and only go through all the anguish of
 talking to shared memory only when it doesn't actually have that memory in
 cache.
 After all, that's the whole point of the cache, right? If every read from the
 cache had to run back to shared memory to double check that it hadn't changed,
 what would the point be? The end result is that the hardware doesn't guarantee
 that events that occur in some order on *one* thread, occur in the same
 order on *another* thread. To guarantee this, we must issue special instructions
 to the CPU telling it to be a bit less smart.
 For instance, say we convince the compiler to emit this logic:
 ```text
 initial state: x = 0, y = 1
 THREAD 1        THREAD 2
 y = 3;          if x == 1 {
 x = 1;              y *= 2;
                }
 ```
 Ideally this program has 2 possible final states:
 * `y = 3`: (thread 2 did the check before thread 1 completed)
 * `y = 6`: (thread 2 did the check after thread 1 completed)
 However there's a third potential state that the hardware enables:
 * `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`)
 It's worth noting that different kinds of CPU provide different guarantees. It
 is common to separate hardware into two categories: strongly-ordered and weakly-ordered.
 Most notably x86/64 provides strong ordering guarantees, while ARM
 provides weak ordering guarantees. This has two consequences for concurrent
 programming:
 * Asking for stronger guarantees on strongly-ordered hardware may be cheap or
  even free because they already provide strong guarantees unconditionally.
  Weaker guarantees may only yield performance wins on weakly-ordered hardware.
 * Asking for guarantees that are too weak on strongly-ordered hardware is
  more likely to *happen* to work, even though your program is strictly
  incorrect. If possible, concurrent algorithms should be tested on
  weakly-ordered hardware.
 ## Data Accesses
 The C++ memory model attempts to bridge the gap by allowing us to talk about the
 *causality* of our program. Generally, this is by establishing a *happens
 before* relationship between parts of the program and the threads that are
 running them. This gives the hardware and compiler room to optimize the program
 more aggressively where a strict happens-before relationship isn't established,
 but forces them to be more careful where one is established. The way we
 communicate these relationships are through *data accesses* and *atomic
 accesses*.
 Data accesses are the bread-and-butter of the programming world. They are
 fundamentally unsynchronized and compilers are free to aggressively optimize
 them. In particular, data accesses are free to be reordered by the compiler on
 the assumption that the program is single-threaded. The hardware is also free to
 propagate the changes made in data accesses to other threads as lazily and
 inconsistently as it wants. Most critically, data accesses are how data races
 happen. Data accesses are very friendly to the hardware and compiler, but as
 we've seen they offer *awful* semantics to try to write synchronized code with.
 Actually, that's too weak.
 **It is literally impossible to write correct synchronized code using only data
 accesses.**
 Atomic accesses are how we tell the hardware and compiler that our program is
 multi-threaded. Each atomic access can be marked with an *ordering* that
 specifies what kind of relationship it establishes with other accesses. In
 practice, this boils down to telling the compiler and hardware certain things
 they *can't* do. For the compiler, this largely revolves around re-ordering of
 instructions. For the hardware, this largely revolves around how writes are
 propagated to other threads. The set of orderings Rust exposes are:
 * Sequentially Consistent (SeqCst)
 * Release
 * Acquire
 * Relaxed
 (Note: We explicitly do not expose the C++ *consume* ordering)
 TODO: negative reasoning vs positive reasoning? TODO: "can't forget to
 synchronize"
 ## Sequentially Consistent
 Sequentially Consistent is the most powerful of all, implying the restrictions
 of all other orderings. Intuitively, a sequentially consistent operation
 cannot be reordered: all accesses on one thread that happen before and after a
 SeqCst access stay before and after it. A data-race-free program that uses
 only sequentially consistent atomics and data accesses has the very nice
 property that there is a single global execution of the program's instructions
 that all threads agree on. This execution is also particularly nice to reason
 about: it's just an interleaving of each thread's individual executions. This
 does not hold if you start using the weaker atomic orderings.
 The relative developer-friendliness of sequential consistency doesn't come for
 free. Even on strongly-ordered platforms sequential consistency involves
 emitting memory fences.
 In practice, sequential consistency is rarely necessary for program correctness.
 However sequential consistency is definitely the right choice if you're not
 confident about the other memory orders. Having your program run a bit slower
 than it needs to is certainly better than it running incorrectly! It's also
 mechanically trivial to downgrade atomic operations to have a weaker
 consistency later on. Just change `SeqCst` to `Relaxed` and you're done! Of
 course, proving that this transformation is *correct* is a whole other matter.
 ## Acquire-Release
 Acquire and Release are largely intended to be paired. Their names hint at their
 use case: they're perfectly suited for acquiring and releasing locks, and
 ensuring that critical sections don't overlap.
 Intuitively, an acquire access ensures that every access after it stays after
 it. However operations that occur before an acquire are free to be reordered to
 occur after it. Similarly, a release access ensures that every access before it
 stays before it. However operations that occur after a release are free to be
 reordered to occur before it.
 When thread A releases a location in memory and then thread B subsequently
 acquires *the same* location in memory, causality is established. Every write
 (including non-atomic and relaxed atomic writes) that happened before A's
 release will be observed by B after its acquisition. However no causality is
 established with any other threads. Similarly, no causality is established
 if A and B access *different* locations in memory.
 Basic use of release-acquire is therefore simple: you acquire a location of
 memory to begin the critical section, and then release that location to end it.
 For instance, a simple spinlock might look like:
 ```rust
 use std::sync::Arc;
 use std::sync::atomic::{AtomicBool, Ordering};
 use std::thread;
 fn main() {
    let lock = Arc::new(AtomicBool::new(false)); // value answers "am I locked?"
    // ... distribute lock to threads somehow ...
    // Try to acquire the lock by setting it to true
    while lock.compare_and_swap(false, true, Ordering::Acquire) { }
    // broke out of the loop, so we successfully acquired the lock!
    // ... scary data accesses ...
    // ok we're done, release the lock
    lock.store(false, Ordering::Release);
 }
 ```
 On strongly-ordered platforms most accesses have release or acquire semantics,
 making release and acquire often totally free. This is not the case on
 weakly-ordered platforms.
 ## Relaxed
 Relaxed accesses are the absolute weakest. They can be freely re-ordered and
 provide no happens-before relationship. Still, relaxed operations are still
 atomic. That is, they don't count as data accesses and any read-modify-write
 operations done to them occur atomically. Relaxed operations are appropriate for
 things that you definitely want to happen, but don't particularly otherwise care
 about. For instance, incrementing a counter can be safely done by multiple
 threads using a relaxed `fetch_add` if you're not using the counter to
 synchronize any other accesses.
 There's rarely a benefit in making an operation relaxed on strongly-ordered
 platforms, since they usually provide release-acquire semantics anyway. However
 relaxed operations can be cheaper on weakly-ordered platforms.
 [C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf
 [C++-model]: https://en.cppreference.com/w/cpp/atomic/memory_order
--- a/src/atomics/acquire-release.md
+++ b/src/atomics/acquire-release.md
@ -0,0 +1,354 @@
 # Acquire and Release
 Next, we’re going to try and implement one of the simplest concurrent utilities
 possible — a mutex, but without support for waiting (since that’s not really
 related to what we’re doing now). It will hold both an atomic flag that
 indicates whether it is locked or not, and the protected data itself. In code
 this translates to:
 ```rs
 use std::cell::UnsafeCell;
 use std::sync::atomic::AtomicBool;
 pub struct Mutex<T> {
    locked: AtomicBool,
    data: UnsafeCell<T>,
 }
 impl<T> Mutex<T> {
    pub const fn new(data: T) -> Self {
        Self {
            locked: AtomicBool::new(false),
            data: UnsafeCell::new(data),
        }
    }
 }
 ```
 Now for the lock function. We need to use an RMW here, since we need to both
 check whether it is locked and lock it if it isn’t in a single atomic step; this
 can be most simply done with a `compare_exchange` (unlike before, it doesn’t
 need to be in a loop this time). For the ordering, we’ll just use `Relaxed`
 since we don’t know of any others yet.
 ```rust
 # use std::cell::UnsafeCell;
 # use std::sync::atomic::{self, AtomicBool};
 # pub struct Mutex<T> {
 #     locked: AtomicBool,
 #     data: UnsafeCell<T>,
 # }
 impl<T> Mutex<T> {
    pub fn lock(&self) -> Option<Guard<'_, T>> {
        match self.locked.compare_exchange(
            false,
            true,
            atomic::Ordering::Relaxed,
            atomic::Ordering::Relaxed,
        ) {
            Ok(_) => Some(Guard(self)),
            Err(_) => None,
        }
    }
 }
 pub struct Guard<'mutex, T>(&'mutex Mutex<T>);
 // Deref impl omitted…
 ```
 We also need to implement `Drop` for `Guard` to make sure the lock on the mutex
 is released once the guard is destroyed. Again we’re just using the `Relaxed`
 ordering.
 ```rust
 # use std::cell::UnsafeCell;
 # use std::sync::atomic::{self, AtomicBool};
 # pub struct Mutex<T> {
 #     locked: AtomicBool,
 #     data: UnsafeCell<T>,
 # }
 # pub struct Guard<'mutex, T>(&'mutex Mutex<T>);
 impl<T> Drop for Guard<'_, T> {
    fn drop(&mut self) {
        self.0.locked.store(false, atomic::Ordering::Relaxed);
    }
 }
 ```
 Great! In the normal operation then, this primitive should allow unique access
 to the data of the mutex to be transferred across different threads. Usual usage
 could look like this:
 ```rust,ignore
 // Initial state
 let mutex = Mutex::new(0);
 // Thread 1
 if let Some(guard) = mutex.lock() {
    *guard += 1;
 }
 // Thread 2
 if let Some(guard) = mutex.lock() {
    println!("{}", *guard);
 }
 ```
 Now, there are many possible executions of this code. For example, Thread 2 (the
 reader thread) could lock the mutex first, and Thread 1 (the writer thread)
 could fail to lock it:
 ```text
 Thread 1      locked    data      Thread 2
 ╭───────╮   ┌────────┐ ┌───┐     ╭───────╮
 │  cas  ├─┐ │ false  │ │ 0 ├╌┐ ┌─┤  cas  │
 ╰───────╯ │ └────────┘ └───┘ ┊ │ ╰───╥───╯
          │ ┌────────┬───────┼─┘ ╭───⇓───╮
          └─┤  true  │       └╌╌╌┤ guard │
            └────────┘           ╰───╥───╯
            ┌────────┬─────────┐ ╭───⇓───╮
            │ false  │         └─┤ store │
            └────────┘           ╰───────╯
 ```
 Or potentially Thread _1_ could lock the mutex first, and Thread _2_ could fail
 to lock it:
 ```text
 Thread 1      locked      data    Thread 2
 ╭───────╮   ┌────────┐   ┌───┐   ╭───────╮
 │  cas  ├─┐ │ false  │ ┌─│ 0 │───┤  cas  │
 ╰───╥───╯ │ └────────┘ │┌┼╌╌╌┤   ╰───────╯
 ╭───⇓───╮ └─┬────────┐ │├┼╌╌╌┤
 │ += 1; ├╌┐ │  true  ├─┘┊│ 1 │
 ╰───╥───╯ ┊ └────────┘  ┊└───┘
 ╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌╌┘
 │ store ├───┬────────┐
 ╰───────╯   │ false  │
            └────────┘
 ```
 But the interesting case comes in when Thread 1 successfully locks and unlocks
 the mutex, and then Thread 2 locks it. Let’s draw that one out too:
 ```text
 Thread 1      locked     data       Thread 2
 ╭───────╮   ┌────────┐   ┌───┐     ╭───────╮
 │  cas  ├─┐ │ false  │   │ 0 │ ┌───┤  cas  │
 ╰───╥───╯ │ └────────┘  ┌┼╌╌╌┤ │   ╰───╥───╯
 ╭───⇓───╮ └─┬────────┐  ├┼╌╌╌┤ │   ╭───⇓───╮
 │ += 1; ├╌┐ │  true  │  ┊│ 1 │ │ ?╌┤ guard │
 ╰───╥───╯ ┊ └────────┘  ┊└───┘ │   ╰───╥───╯
 ╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌╌┘      │   ╭───⇓───╮
 │ store ├───┬────────┐         │ ┌─┤ store │
 ╰───────╯   │ false  │         │ │ ╰───────╯
            └────────┘         │ │
            ┌────────┬─────────┘ │
            │  true  │           │
            └────────┘           │
            ┌────────┬───────────┘
            │ false  │
            └────────┘
 ```
 Look at the second operation Thread 2 performs (the read of `data`), for which
 we haven’t yet joined the line. Where should it connect to? Well actually, it
 has multiple options…wait, we’ve seen this before! It’s a data race!
 That’s not good. Last time the solution was to use atomics instead — but in this
 case that doesn’t seem to be enough, since even if atomics were used it still
 would have the _option_ of reading `0` instead of `1`, and really if we want our
 mutex to be sane, it should only be able to read `1`.
 So it seems that what we _want_ is to be able to apply the coherence rules from
 before to completely rule out zero from the set of the possible values — if we
 were able to draw a large arrow from the Thread 1’s `+= 1;` to Thread 2’s
 `guard`, then we could trivially then use the rule to rule out `0` as a value
 that could be read.
 This is where the `Acquire` and `Release` orderings come in. Informally put, a
 _release store_ will cause an arrow instead of a line to be drawn from the
 operation to the destination; and similarly an _acquire load_ will cause an
 arrow to be drawn from the destination to the operation. To give a useless
 example that illustrates this, for the given program:
 ```rust
 # use std::sync::atomic::{self, AtomicU32};
 // Initial state
 let a = AtomicU32::new(0);
 // Thread 1
 a.store(1, atomic::Ordering::Release);
 // Thread 2
 a.load(atomic::Ordering::Acquire);
 ```
 The two possible executions look like this:
 ```text
    Possible Execution 1      ┃      Possible Execution 2
                              ┃
 Thread 1      a     Thread 2  ┃  Thread 1      a     Thread 2
 ╭───────╮   ┌───┐   ╭──────╮  ┃  ╭───────╮   ┌───┐   ╭──────╮
 │ store ├─┐ │ 0 │ ┌─→ load │  ┃  │ store ├─┐ │ 0 ├───→ load │
 ╰───────╯ │ └───┘ │ ╰──────╯  ┃  ╰───────╯ │ └───┘   ╰──────╯
          └─↘───┐ │           ┃            └─↘───┐
            │ 1 ├─┘           ┃              │ 1 │
            └───┘             ┃              └───┘
 ```
 These arrows are a new kind of arrow we haven’t seen yet; they are known as
 _happens-before_ (or happens-after) relations and are represented as thin arrows
 (→) on these diagrams. They are weaker than the _sequenced-before_
 double-arrows (⇒) that occur inside a single thread, but can still be used with
 the coherence rules to determine which values of a memory location are valid to
 read.
 When a happens-before arrow stores a data value to an atomic (via a release
 operation) which is then loaded by another happens-before arrow (via an acquire
 operation) we say that the release operation _synchronized-with_ the acquire
 operation, which in doing so establishes that the release operation
 _happens-before_ the acquire operation. Therefore, we can say that in the first
 possible execution, Thread 1’s `store` synchronizes-with Thread 2’s `load`,
 which causes that `store` and everything sequenced-before it to happen-before
 the `load` and everything sequenced-after it.
 > More formally, we can say that A happens-before B if any of the following
 > conditions are true:
 > 1. A is sequenced-before B (i.e. A occurs before B on the same thread)
 > 2. A synchronizes-with B (i.e. A is a `Release` operation and B is an
 >    `Acquire` operation that reads the value written by A)
 > 3. A happens-before X, and X happens-before B (transitivity)
 There is one more rule required for these to be useful, and that is _release
 sequences_: after a release store is performed on an atomic, happens-before
 arrows will connect together each subsequent value of the atomic as long as the
 new value is caused by an RMW and not just a plain store (this means any
 subsequent normal store, no matter the ordering, will end the sequence).
 > In the C++11 memory model, any subsequent store by the same thread that
 > performed the original `Release` store would also contribute to the release
 > sequence. However, this was removed in C++20 for simplicity and better
 > optimizations and so **must not** be relied upon.
 With those rules in mind, converting Thread 1’s second store to use a `Release`
 ordering as well as converting Thread 2’s CAS to use an `Acquire` ordering
 allows us to effectively draw that arrow we needed before:
 ```text
 Thread 1     locked     data       Thread 2
 ╭───────╮   ┌───────┐   ┌───┐     ╭───────╮
 │  cas  ├─┐ │ false │   │ 0 │ ┌───→  cas  │
 ╰───╥───╯ │ └───────┘  ┌┼╌╌╌┤ │   ╰───╥───╯
 ╭───⇓───╮ └─┬───────┐  ├┼╌╌╌┤ │   ╭───⇓───╮
 │ += 1; ├╌┐ │ true  │  ┊│ 1 ├╌│╌╌╌┤ guard │
 ╰───╥───╯ ┊ └───────┘  ┊└───┘ │   ╰───╥───╯
 ╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌┘      │   ╭───⇓───╮
 │ store ├───↘───────┐         │ ┌─┤ store │
 ╰───────╯   │ false │         │ │ ╰───────╯
            └───┬───┘         │ │
            ┌───↓───┬─────────┘ │
            │ true  │           │
            └───────┘           │
            ┌───────┬───────────┘
            │ false │
            └───────┘
 ```
 We now can trace back along the reverse direction of arrows from the `guard`
 bubble to the `+= 1` bubble; we have established that Thread 2’s load
 happens-after the `+= 1` side effect, because Thread 2’s CAS synchronizes-with
 Thread 1’s store. This both avoids the data race and gives the guarantee that
 `1` will be always read by Thread 2 (as long as it locks after Thread 1, of
 course).
 However, that is not the only execution of the program possible. Even with this
 setup, there is another execution that can also cause UB: if Thread 2 locks the
 mutex before Thread 1 does.
 ```text
 Thread 1       locked     data      Thread 2
 ╭───────╮     ┌───────┐   ┌───┐    ╭───────╮
 │  cas  ├───┐ │ false │┌──│ 0 │────→  cas  │
 ╰───╥───╯   │ └───────┘│ ┌┼╌╌╌┤    ╰───╥───╯
 ╭───⇓───╮   │ ┌───────┬┘ ├┼╌╌╌┤    ╭───⇓───╮
 │ += 1; ├╌┐ │ │ true  │  ┊│ 1 │  ?╌┤ guard │
 ╰───╥───╯ ┊ │ └───────┘  ┊└───┘    ╰───╥───╯
 ╭───⇓───╮ └╌│╌╌╌╌╌╌╌╌╌╌╌╌┘         ╭───⇓───╮
 │ store ├─┐ │ ┌───────┬────────────┤ store │
 ╰───────╯ │ │ │ false │            ╰───────╯
          │ │ └───────┘
          │ └─┬───────┐
          │   │ true  │
          │   └───────┘
          └───↘───────┐
              │ false │
              └───────┘
 ```
 Once again `guard` has multiple options for values to read. This one’s a bit
 more counterintuitive than the previous one, since it requires “travelling
 forward in time” to understand why the `1` is even there in the first place —
 but since the abstract machine has no concept of time, it’s just a valid UB as
 any other.
 Luckily, we’ve already solved this problem once, so it easy to solve again: just
 like before, we’ll have the CAS become acquire and the store become release, and
 then we can use the second coherence rule from before to follow _forward_ the
 arrow from the `guard` bubble all the way to the `+= 1;`, determining that it is
 only possible for that read to see `0` as its value, as in the execution below.
 ```text
 Thread 1       locked     data      Thread 2
 ╭───────╮     ┌───────┐   ┌───┐    ╭───────╮
 │  cas  ←───┐ │ false │┌──│ 0 ├╌┐──→  cas  │
 ╰───╥───╯   │ └───────┘│ ┌┼╌╌╌┤ ┊  ╰───╥───╯
 ╭───⇓───╮   │ ┌───────┬┘ ├┼╌╌╌┤ ┊  ╭───⇓───╮
 │ += 1; ├╌┐ │ │ true  │  ┊│ 1 │ └─╌┤ guard │
 ╰───╥───╯ ┊ │ └───────┘  ┊└───┘    ╰───╥───╯
 ╭───⇓───╮ └╌│╌╌╌╌╌╌╌╌╌╌╌╌┘         ╭───⇓───╮
 │ store ├─┐ │ ┌───────↙────────────┤ store │
 ╰───────╯ │ │ │ false │            ╰───────╯
          │ │ └───┬───┘
          │ └─┬───↓───┐
          │   │ true  │
          │   └───────┘
          └───↘───────┐
              │ false │
              └───────┘
 ```
 This leads us to the proper memory orderings for any mutex (and other locks like
 RW locks too, even): use `Acquire` to lock it, and `Release` to unlock it. So
 let’s go back to and update our original mutex definition with this knowledge.
 But wait, `compare_exchange` takes two ordering parameters, not just one! That’s
 right — it also takes a second one to apply when the exchange fails (in our case,
 when the mutex is already locked). But we don’t need an `Acquire` here, since in
 that case we won’t be reading from the `data` value anyway, so we’ll just stick
 with `Relaxed`.
 ```rust,ignore
 impl<T> Mutex<T> {
    pub fn lock(&self) -> Option<Guard<'_, T>> {
        match self.locked.compare_exchange(
            false,
            true,
            atomic::Ordering::Acquire,
            atomic::Ordering::Relaxed,
        ) {
            Ok(_) => Some(Guard(self)),
            Err(_) => None,
        }
    }
 }
 impl<T> Drop for Guard<'_, T> {
    fn drop(&mut self) {
        self.0.locked.store(false, atomic::Ordering::Release);
    }
 }
 ```
 Note that similarly to how atomic operations only make sense when paired with
 other atomic operations on the same locations, `Acquire` only makes sense when
 paired with `Release` and vice versa. That is, both an `Acquire` with no
 corresponding `Release` and a `Release` with no corresponding `Acquire` are
 useless, since the arrows will be unable to connect to anything.
--- a/src/atomics/atomics.md
+++ b/src/atomics/atomics.md
@ -0,0 +1,125 @@
 # Atomics
 Rust pretty blatantly just inherits the memory model for atomics from C++20. This is not
 due to this model being particularly excellent or easy to understand. Indeed,
 this model is quite complex and known to have [several flaws][C11-busted].
 Rather, it is a pragmatic concession to the fact that *everyone* is pretty bad
 at modeling atomics. At very least, we can benefit from existing tooling and
 research around the C/C++ memory model.
 (You'll often see this model referred to as "C/C++11" or just "C11". C just copies
 the C++ memory model; and C++11 was the first version of the model but it has
 received some bugfixes since then.)
 Trying to fully explain the model in this book is fairly hopeless. It's defined
 in terms of madness-inducing causality graphs that require a full book to
 properly understand in a practical way. If you want all the nitty-gritty
 details, you should check out the [C++ specification][C++-model] —
 note that Rust atomics correspond to C++’s `atomic_ref`, since Rust allows
 accessing atomics via non-atomic operations when it is safe to do so.
 In this section we aim to give an informal overview of the topic to cover the
 common problems that Rust developers face.
 ## Motivation
 The C++ memory model is very large and confusing with lots of seemingly
 arbitrary design decisions. To understand the motivation behind this, it can
 help to look at what got us in this situation in the first place. There are
 three main factors at play here:
 1. Users of the language, who want fast, cross-platform code;
 2. compilers, who want to optimize code to make it fast;
 3. and the hardware, which is ready to unleash a wrath of inconsistent chaos on
  your program at a moment's notice.
 The memory model is fundamentally about trying to bridge the gap between these
 three, allowing users to write the algorithms they want while the compiler and
 hardware perform the arcane magic necessary to make them run fast.
 ### Compiler Reordering
 Compilers fundamentally want to be able to do all sorts of complicated
 transformations to reduce data dependencies and eliminate dead code. In
 particular, they may radically change the actual order of events, or make events
 never occur! If we write something like:
 <!-- ignore: simplified code -->
 ```rust,ignore
 x = 1;
 y = 3;
 x = 2;
 ```
 The compiler may conclude that it would be best if your program did:
 <!-- ignore: simplified code -->
 ```rust,ignore
 x = 2;
 y = 3;
 ```
 This has inverted the order of events and completely eliminated one event.
 From a single-threaded perspective this is completely unobservable: after all
 the statements have executed we are in exactly the same state. But if our
 program is multi-threaded, we may have been relying on `x` to actually be
 assigned to 1 before `y` was assigned. We would like the compiler to be
 able to make these kinds of optimizations, because they can seriously improve
 performance. On the other hand, we'd also like to be able to depend on our
 program *doing the thing we said*.
 ### Hardware Reordering
 On the other hand, even if the compiler totally understood what we wanted and
 respected our wishes, our hardware might instead get us in trouble. Trouble
 comes from CPUs in the form of memory hierarchies. There is indeed a global
 shared memory space somewhere in your hardware, but from the perspective of each
 CPU core it is *so very far away* and *so very slow*. Each CPU would rather work
 with its local cache of the data and only go through all the anguish of
 talking to shared memory only when it doesn't actually have that memory in
 cache.
 After all, that's the whole point of the cache, right? If every read from the
 cache had to run back to shared memory to double check that it hadn't changed,
 what would the point be? The end result is that the hardware doesn't guarantee
 that events that occur in some order on *one* thread, occur in the same
 order on *another* thread. To guarantee this, we must issue special instructions
 to the CPU telling it to be a bit less smart.
 For instance, say we convince the compiler to emit this logic:
 ```text
 initial state: x = 0, y = 1
 THREAD 1        THREAD 2
 y = 3;          if x == 1 {
 x = 1;              y *= 2;
                }
 ```
 Ideally this program has 2 possible final states:
 * `y = 3`: (thread 2 did the check before thread 1 completed)
 * `y = 6`: (thread 2 did the check after thread 1 completed)
 However there's a third potential state that the hardware enables:
 * `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`)
 It's worth noting that different kinds of CPU provide different guarantees. It
 is common to separate hardware into two categories: strongly-ordered and
 weakly-ordered, where strongly-ordered hardware implements weak orderings like
 `Relaxed` using strong orderings like `Acquire`, while weakly-ordered hardware
 makes use of the optimization potential that weak orderings like `Relaxed` give.
 Most notably, x86/64 provides strong ordering guarantees, while ARM provides
 weak ordering guarantees. This has two consequences for concurrent programming:
 * Asking for stronger guarantees on strongly-ordered hardware may be cheap or
  even free because they already provide strong guarantees unconditionally.
  Weaker guarantees may only yield performance wins on weakly-ordered hardware.
 * Asking for guarantees that are too weak on strongly-ordered hardware is
  more likely to *happen* to work, even though your program is strictly
  incorrect. If possible, concurrent algorithms should be tested on
  weakly-ordered hardware.
 [C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf
 [C++-model]: https://en.cppreference.com/w/cpp/atomic/memory_order
--- a/src/atomics/fences.md
+++ b/src/atomics/fences.md
@ -0,0 +1,257 @@
 # Fences
 As well as loads, stores, and RMWs, there is one more kind of atomic operation
 to be aware of: fences. Fences can be triggered by the
 [`core::sync::atomic::fence`] function, which accepts a single ordering
 parameter and returns nothing. They don’t do anything on their own, but can be
 thought of as events that strengthen the ordering of nearby atomic operations.
 ## Acquire fences
 The most common kind of fence is an _acquire fence_, which can be triggered in
 three different ways:
 1. `atomic::fence(atomic::Ordering::Acquire)`
 1. `atomic::fence(atomic::Ordering::AcqRel)`
 1. `atomic::fence(atomic::Ordering::SeqCst)`
 An acquire fence retroactively makes every single non-`Acquire` operation that
 was sequenced-before it act like an `Acquire` operation that occurred at the
 fence — in other words, it causes every prior `Release`d value that was
 previously loaded on the thread to synchronize-with the fence. For example, the
 following code:
 ```rust
 # use std::sync::atomic::{self, AtomicU32};
 static X: AtomicU32 = AtomicU32::new(0);
 // t_1
 X.store(1, atomic::Ordering::Release);
 // t_2
 let value = X.load(atomic::Ordering::Relaxed);
 atomic::fence(atomic::Ordering::Acquire);
 ```
 Can result in two possible executions:
 ```text
      Possible Execution 1      ┃      Possible Execution 2
                                ┃
   t_1        X        t_2      ┃      t_1        X        t_2
 ╭───────╮   ┌───┐   ╭───────╮   ┃   ╭───────╮   ┌───┐   ╭───────╮
 │ store ├─┐ │ 0 │ ┌─┤ load  │   ┃   │ store ├─┐ │ 0 ├───┤ load  │
 ╰───────╯ │ └───┘ │ ╰───╥───╯   ┃   ╰───────╯ │ └───┘   ╰───╥───╯
          └─↘───┐ │ ╭───⇓───╮   ┃             └─↘───┐   ╭───⇓───╮
            │ 1 ├─┘┌→ fence │   ┃               │ 1 │   │ fence │
            └───┴──┘╰───────╯   ┃               └───┘   ╰───────╯
 ```
 In the first execution, `t_1`’s store synchronizes-with and therefore
 happens-before `t_2`’s fence due to the prior load, but note that it does _not_
 happen-before `t_2`’s load.
 Acquire fences work on any number of atomics, and on release sequences too. A
 more complex example is as follows:
 ```rust
 # use std::sync::atomic::{self, AtomicU32};
 static X: AtomicU32 = AtomicU32::new(0);
 static Y: AtomicU32 = AtomicU32::new(0);
 // t_1
 X.store(1, atomic::Ordering::Release);
 X.fetch_add(1, atomic::Ordering::Relaxed);
 // t_2
 Y.store(1, atomic::Ordering::Release);
 // t_3
 let x = X.load(atomic::Ordering::Relaxed);
 let y = Y.load(atomic::Ordering::Relaxed);
 atomic::fence(atomic::Ordering::Acquire);
 ```
 This can result in an execution like so:
 ```text
   t_1        X        t_3        Y        t_2
 ╭───────╮   ┌───┐   ╭───────╮   ┌───┐   ╭───────╮
 │ store ├─┐ │ 0 │ ┌─┤ load  │   │ 0 │ ┌─┤ store │
 ╰───╥───╯ │ └───┘ │ ╰───╥───╯   └───┘ │ ╰───────╯
 ╭───⇓───╮ └─↘───┐ │ ╭───⇓───╮   ┌───↙─┘
 │  rmw  ├─┐ │ 1 │ │ │ load  ├───┤ 1 │
 ╰───────╯ │ └─┬─┘ │ ╰───╥───╯ ┌─┴───┘
          └─┬─↓─┐ │ ╭───⇓───╮ │
            │ 2 ├─┘┌→ fence ←─┘
            └───┴──┘╰───────╯
 ```
 There are two common scenarios in which acquire fences are used:
 1. When an `Acquire` ordering is only necessary when a specific value is loaded.
 	For example, you may only wish to acquire when an `initialized` boolean is
 	`true`, since otherwise you won’t be reading the shared state at all. In
 	this case, you can load with a `Relaxed` ordering and then issue an
 	`Acquire` fence afterward only if that condition is met, which can aid in
 	performance sometimes (since the acquire operation is avoided when
 	`initialized == false`).
 2. When several `Acquire` operations on different locations need to be performed
 	in a row, but individually each operation doesn’t need `Acquire` ordering;
 	it is often faster to perform all the loads as `Relaxed` first and use a
 	single `Acquire` fence at the end then it is to make each one separately use
 	`Acquire`.
 ## Release fences
 Release fences are the natural complement to acquire fences, and they similarly
 can be triggered in three different ways:
 1. `atomic::fence(atomic::Ordering::Release)`
 1. `atomic::fence(atomic::Ordering::AcqRel)`
 1. `atomic::fence(atomic::Ordering::SeqCst)`
 Release fences convert every subsequent atomic access in the same thread into a
 release operation that has its arrow starting from the fence — in other words,
 every `Acquire` operation that sees a value that was written by the fence’s
 thread after the release fence will synchronize-with the release fence. For
 example, the following code:
 ```rust
 # use std::sync::atomic::{self, AtomicU32};
 static X: AtomicU32 = AtomicU32::new(0);
 // t_1
 atomic::fence(atomic::Ordering::Release);
 X.store(1, atomic::Ordering::Relaxed);
 // t_2
 X.load(atomic::Ordering::Acquire);
 ```
 Can result in this execution:
 ```text
   t_1        X        t_2
 ╭───────╮   ┌───┐   ╭───────╮
 │ fence ├─┐ │ 0 │ ┌─→ load  │
 ╰───╥───╯ │ └───┘ │ ╰───────╯
 ╭───⇓───╮ └─↘───┐ │
 │ store ├───┤ 1 ├─┘
 ╰───────╯   └───┘
 ```
 As well as it being possible for a release fence to synchronize-with an acquire
 load (fence–atomic synchronization) and a release store to synchronize-with an
 acquire fence (atomic–fence synchronization), it is also possible for release
 fences to synchronize with acquire fences (fence–fence synchronization). In this
 code snippet, only fences and `Relaxed` operations are used to establish a
 happens-before relation (in some executions):
 ```rust
 # use std::sync::atomic::{self, AtomicU32};
 static X: AtomicU32 = AtomicU32::new(0);
 // t_1
 atomic::fence(atomic::Ordering::Release);
 X.store(1, atomic::Ordering::Relaxed);
 // t_2
 X.load(atomic::Ordering::Relaxed);
 atomic::fence(atomic::Ordering::Acquire);
 ```
 The execution with the relation looks like this:
 ```text
   t_1        X        t_2
 ╭───────╮   ┌───┐   ╭───────╮
 │ fence ├─┐ │ 0 │ ┌─┤ load  │
 ╰───╥───╯ │ └───┘ │ ╰───╥───╯
 ╭───⇓───╮ └─↘───┐ │ ╭───⇓───╮
 │ store ├───┤ 1 ├─┘┌→ fence │
 ╰───────╯   └───┴──┘╰───────╯
 ```
 Like with acquire fences, release fences can be used to optimize over a series
 of atomic stores that don’t individually need to be `Release`, since in some
 conditions and on some architectures it’s faster to put a single release fence
 at the start and use `Relaxed` from that point on than it is to use `Release`
 every time.
 ## `AcqRel` fences
 `AcqRel` fences are just the combined behaviour of an `Acquire` fence and a
 `Release` fence in one operation. There isn’t much special to note about them,
 other than that they behave more like an acquire fence followed by a release
 fence than the other way around, which is useful to know in situations like the
 following:
 ```text
   t_1        X        t_2        Y        t_3
 ╭───────╮   ┌───┐   ╭───────╮   ┌───┐   ╭───────╮
 │   A   │   │ 0 │ ┌─┤ load  │   │ 0 │ ┌─→ load  │
 ╰───╥───╯   └───┘ │ ╰───╥───╯   └───┘ │ ╰───╥───╯
 ╭───⇓───╮ ┌─↘───┐ │ ╭───⇓───╮┌──↘───┐ │ ╭───⇓───╮
 │ store ├─┘ │ 1 ├─┘┌→ fence ├┘┌─┤ 1 ├─┘ │   B   │
 ╰───────╯   └───┴──┘╰───╥───╯ │ └───┘   ╰───────╯
                    ╭───⇓───╮ │
                    │ store ├─┘
                    ╰───────╯
 ```
 Here, A happens-before B, which is singularly due to the `AcqRel` fence’s
 ability to “carry over” happens-before relations within itself.
 ## `SeqCst` fences
 `SeqCst` fences are the strongest kind of fence. They first of all inherit the
 behaviour from an `AcqRel` fence, meaning they have both acquire and release
 semantics at the same time, but being `SeqCst` operations they also participate
 in _S_. Just as with all other `SeqCst` operations, their placement in _S_ is
 primarily determined by strongly happens-before relations (including the
 [mixed-`SeqCst` caveat] that comes with it), which then gives additional
 guarantees to your code.
 Namely, the power of `SeqCst` fences can be summarized in three points:
 * Everything that happens-before a `SeqCst` fence is not coherence-ordered-after
 	any `SeqCst` operation that the fence precedes in _S_.
 * Everything that happens-after a `SeqCst` fence is not coherence-ordered-before
 	any `SeqCst` operation that the fence succeeds in _S_.
 * Everything that happens-before a `SeqCst` fence X is not
 	coherence-ordered-after anything that happens-after another `SeqCst` fence
 	Y, if X preceeds Y in _S_.
 > In C++11, the above three statements were similar, except they only talked
 > about what was sequenced-before and sequenced-after the `SeqCst` fences; C++20
 > strengthened this to also include happens-before, because in practice this
 > theoretical optimization was not being exploited by anybody. However do note
 > that as of the time of writing, [Miri only implements the old, weaker
 > semantics][miri scfix] and so you may see false positives when testing with
 > it.
 The “motivating use-case” for `SeqCst` demonstrated in the `SeqCst` chapter can
 also be rewritten to use exclusively `SeqCst` fences and `Relaxed` operations,
 by inserting fences in between the operations in the two threads:
 ```text
     a        static X    static Y         b
 ╭─────────╮   ┌───────┐   ┌───────┐   ╭─────────╮
 │ store X ├─┐ │ false │   │ false │ ┌─┤ store Y │
 ╰────╥────╯ │ └───────┘   └───────┘ │ ╰────╥────╯
 ╭────⇓────╮ └─┬───────┐   ┌───────┬─┘ ╭────⇓────╮
 │ *fence* │   │ true  │   │ true  │   │ *fence* │
 ╰────╥────╯   └───────┘   └───────┘   ╰────╥────╯
 ╭────⇓────╮                           ╭────⇓────╮
 │ load Y  ├─?                       ?─┤ load X  │
 ╰─────────╯                           ╰─────────╯
 ```
 There are two executions to consider here, depending on which way round the
 fences appear in _S_. Should `a`’s fence appear first, the fence–fence `SeqCst`
 guarantee tells us that `b`’s load of `X` is not coherence-ordered-after `a`’s
 store of `X`, which forbids `b`’s load of `X` from seeing the value `false`. The
 same logic can be applied should the fences appear the other way around, proving
 that at least one thread must load `true` in the end.
 [`core::sync::atomic::fence`]: https://doc.rust-lang.org/stable/core/sync/atomic/fn.fence.html
 [mixed-`SeqCst` caveat]: seqcst.md#the-mixed-seqcst-special-case
 [miri scfix]: https://github.com/rust-lang/miri/issues/2301
--- a/src/atomics/multithread.md
+++ b/src/atomics/multithread.md
@ -0,0 +1,291 @@
 # Multithreaded Execution
 When you write Rust code to run on your computer, it may surprise you but you’re
 not actually writing Rust code to run on your computer — instead, you’re writing
 Rust code to run on the _abstract machine_ (or AM for short). The abstract
 machine, to be contrasted with the physical machine, is an abstract
 representation of a theoretical computer: it doesn’t actually exist _per se_,
 but the combination of a compiler, target architecture and target operating
 system is capable of emulating a subset of its possible behaviours.
 The Abstract Machine has a few properties that are essential to understand:
 1. It is architecture and OS-independent. The Abstract Machine doesn’t care
 	whether you’re on x86_64 or iOS or a Nintendo 3DS, the rules are the same
 	for everyone. This enables you to write code without having to think about
 	what the underlying system does or how it does it, as long as you obey the
 	Abstract Machine’s rules you know you’ll be fine.
 1. It is the lowest common denominator of all supported computer systems. This
 	means it is allowed to result in executions no sane computer would actually
 	generate in real life. It is also purposefully built with forward
 	compatibility in mind, giving compilers the opportunity to make better and
 	more aggressive optimizations in the future. As a result, it can be quite
 	hard to test code, especially if you’re on a system that exploits fewer of
 	the AM’s allowed semantics, so it is highly recommended to utilize tools
 	that intentionally produce these executions like [Loom] and [Miri].
 1. Its model is highly formalized and not representative of what goes on
 	underneath. Because C++ needs to be defined by a formal specification and
 	not just hand-wavy rules about “this is what is allowed and this is what
 	isn’t”, the Abstract Machine defines things in a very mathematical and,
 	well, _abstract_, way; instead of saying things like “the compiler is
 	allowed to do X” it will find a way to define the system such that the
 	compiler’s ability to do X simply follows as a natural consequence. This
 	makes it very elegant and keeps the mathematicians happy, but you should
 	keep in mind that this is not how computers actually function, it is merely
 	a representation of it.
 With that out of the way, let’s look into how the C++20 Abstract Machine is
 actually defined.
 The first important thing to understand is that **the abstract machine has no
 concept of time**. You might expect there to be a single global ordering of
 events across the program where each happens at the same time or one after the
 other, but under the abstract model no such ordering exists; instead, a possible
 execution of the program must be treated as a single event that happens
 instantaneously. There is never any such thing as “now”, or a “latest value”,
 and using that terminology will only lead you to more confusion. Of course, in
 reality there does exist a concept of time, but you must keep in mind that
 you’re not programming for the hardware, you’re programming for the AM.
 However, while no global ordering of operations exists _between_ threads, there
 does exist a single total ordering _within_ each thread, which is known as its
 _sequence_. For example, given this simple Rust program:
 ```rust
 println!("A");
 println!("B");
 ```
 its sequence during one possible execution can be visualized like so:
 ```text
 ╭───────────────╮
 │ println!("A") │
 ╰───────╥───────╯
 ╭───────⇓───────╮
 │ println!("B") │
 ╰───────────────╯
 ```
 That double arrow in between the two boxes (`⇒`) represents that the second
 statement is _sequenced-after_ the first (and similarly the first statement is
 _sequenced-before_ the second). This is the strongest kind of ordering guarantee
 between any two operations, and only comes about when those two operations
 happen one after the other and on the same thread.
 If we add a second thread to the mix:
 ```rust
 // Thread 1:
 println!("A");
 println!("B");
 // Thread 2:
 eprintln!("01");
 eprintln!("02");
 ```
 it will simply coexist in parallel, with each thread getting its own independent
 sequence:
 ```text
    Thread 1              Thread 2
 ╭───────────────╮    ╭─────────────────╮
 │ println!("A") │    │ eprintln!("01") │
 ╰───────╥───────╯    ╰────────╥────────╯
 ╭───────⇓───────╮    ╭────────⇓────────╮
 │ println!("B") │    │ eprintln!("02") │
 ╰───────────────╯    ╰─────────────────╯
 ```
 We can say that the prints of `A` and `B` are _unsequenced_ with regard to the
 prints of `01` and `02` that occur in the second thread, since they have no
 sequenced-before arrows connecting the boxes together.
 Note that these diagrams are **not** a representation of multiple things that
 _could_ happen at runtime — instead, this diagram describes exactly what _did_
 happen when the program ran once. This distinction is key, because it highlights
 that even the lowest-level representation of a program’s execution does not have
 a global ordering between threads; those two disconnected chains are all there
 is.
 Now let’s make things more interesting by introducing some shared data, and have
 both threads read it.
 ```rust
 // Initial state
 let data = 0;
 // Thread 1:
 println!("{data}");
 // Thread 2:
 eprintln!("{data}");
 ```
 Each memory location, similarly to threads, can be shown as another column on
 our diagram, but holding values instead of instructions, and each access (read
 or write) manifests as a line from the instruction that performed the access to
 the associated value in the column. So this code can produce (and is in fact
 guaranteed to produce) the following execution:
 ```text
 Thread 1     data     Thread 2
 ╭──────╮    ┌────┐    ╭──────╮
 │ data ├╌╌╌╌┤  0 ├╌╌╌╌┤ data │
 ╰──────╯    └────┘    ╰──────╯
 ```
 That is, both threads read the same value of `0` from `data`, and the two
 operations are unsequenced — they have no relative ordering between them.
 That’s reads done, so we’ll look at the other kind of data access next: writes.
 We’ll also return to a single thread for now, just to keep things simple.
 ```rust
 let mut data = 0;
 data = 1;
 ```
 Here, we have a single variable that the main thread writes to once — this means
 that in its lifetime, it holds two values, first `0`, and then `1`.
 Diagrammatically, this code’s execution can be represented like so:
 ```text
 Thread 1        data
 ╭───────╮       ┌────┐
 │  = 1  ├╌╌╌┐   │  0 │
 ╰───────╯   ├╌╌╌┼╌╌╌╌┤
            └╌╌╌┼╌╌╌╌┤
                │  1 │
                └────┘
 ```
 Note the use of dashed padding in between the values of `data`’s column. Those
 spaces won’t ever contain a value, but they’re used to represent an
 unsynchronized (non-atomic) write — it is garbage data and attempting to read it
 would result in a data race.
 Now let’s put all of our knowledge thus far together, and make a program both
 that reads _and_ writes data — woah, scary!
 ```rust
 let mut data = 0;
 data = 1;
 println!("{data}");
 data = 2;
 ```
 Working out executions of code like this is rather like solving a Sudoku puzzle:
 you must first lay out all the facts that you know, and then fill in the blanks
 with logical reasoning. The initial information we’ve been given is both the
 initial value of `data` and the sequential order of Thread 1; we also know that
 over its lifetime, `data` takes on a total of three different values that were
 caused by two different non-atomic writes. This allows us to start drawing out
 some boxes:
 ```text
 Thread 1        data
 ╭───────╮       ┌────┐
 │  = 1  ├╌?     │  0 │
 ╰───╥───╯     ?╌┼╌╌╌╌┤
 ╭───⇓───╮     ?╌┼╌╌╌╌┤
 │  data ├╌?     │  ? │
 ╰───╥───╯     ?╌┼╌╌╌╌┤
 ╭───⇓───╮     ?╌┼╌╌╌╌┤
 │  = 2  ├╌?     │  ? │
 ╰───────╯       └────┘
 ```
 We know all of those lines need to be joined _somewhere_, but we don’t quite
 know _where_ yet. This is where we need to bring in our first rule, a rule that
 universally governs all accesses to every location in memory:
 > From the point at which the access occurs, find every other point that can be
 > reached by following the reverse direction of arrows, then for each one of
 > those, take a single step across every line that connects to the relevant
 > memory location. **It is not allowed for the access to read or write any value
 > that appears above any one of these points**.
 In our case, there are two potential executions: one, where the first write
 corresponds to the first value in `data`, and two, where the first write
 corresponds to the second value in `data`. Considering the second case for a
 moment, it would also force the second write to correspond to the first
 value in `data`. Therefore its diagram would look something like this:
 ```text
 Thread 1        data
 ╭───────╮       ┌────┐
 │  = 1  ├╌╌┐    │  0 │
 ╰───╥───╯  ┊ ┌╌╌┼╌╌╌╌┤
 ╭───⇓───╮  ┊ ├╌╌┼╌╌╌╌┤
 │  data ├╌?┊ ┊  │  2 │
 ╰───╥───╯  ├╌┼╌╌┼╌╌╌╌┤
 ╭───⇓───╮  └╌┼╌╌┼╌╌╌╌┤
 │  = 2  ├╌╌╌╌┘  │  1 │
 ╰───────╯       └────┘
 ```
 However, that second line breaks the rule we just established! Following up the
 arrows from the third operation in Thread 1, we reach the first operation, and
 from there we can take a single step to reach the space in between the `2` and
 the `1`, which excludes the third access from writing any value above that point
 — including the `2` that it is currently writing!
 So evidently, this execution is no good. We can therefore conclude that the only
 possible execution of this program is the other one, in which the `1` appears
 above the `2`:
 ```text
 Thread 1     data
 ╭───────╮     ┌────┐
 │  = 1  ├╌╌┐  │  0 │
 ╰───╥───╯  ├╌╌┼╌╌╌╌┤
 ╭───⇓───╮  └╌╌┼╌╌╌╌┤
 │  data ├╌?   │  1 │
 ╰───╥───╯  ┌╌╌┼╌╌╌╌┤
 ╭───⇓───╮  ├╌╌┼╌╌╌╌┤
 │  = 2  ├╌╌┘  │  2 │
 ╰───────╯     └────┘
 ```
 Now to sort out the read operation in the middle. We can use the same rule as
 before to trace up to the first write and rule out us reading either the `0`
 value or the garbage that exists between it and `1`, but how do we choose
 between the `1` and the `2`? Well, as it turns out there is a complement to the
 rule we already defined which gives us the exact answer we need:
 > From the point at which the access occurs, find every other point that can be
 > reached by following the _forward_ direction of arrows, then for each one of
 > those, take a single step across every line that connects to the relevant
 > memory location. **It is not allowed for the access to read or write any value
 > that appears below any one of these points**.
 Using this rule, we can follow the arrow downwards and then across and finally
 rule out `2` as well as the garbage before it. This leaves us with exactly _one_
 value that the read operation can return, and exactly one possible execution
 guaranteed by the Abstract Machine:
 ```text
 Thread 1      data
 ╭───────╮     ┌────┐
 │  = 1  ├╌╌┐  │  0 │
 ╰───╥───╯  ├╌╌┼╌╌╌╌┤
 ╭───⇓───╮  └╌╌┼╌╌╌╌┤
 │  data ├╌╌╌╌╌┤  1 │
 ╰───╥───╯  ┌╌╌┼╌╌╌╌┤
 ╭───⇓───╮  ├╌╌┼╌╌╌╌┤
 │  = 2  ├╌╌┘  │  2 │
 ╰───────╯     └────┘
 ```
 These two rules combined make up the more generalized rule known as _coherence_,
 which is put in place to guarantee that a thread will never see a value earlier
 than the last one it read or later than a one it will in future write. Coherence
 is basically required for any program to act in a sane way, so luckily the C++20
 standard guarantees it as one of its most fundamental principles.
 You might be thinking that all this has been is the longest, most convoluted
 explanation ever of the most basic intuitive semantics of programming — and
 you’d be absolutely right. But it’s essential to grasp these fundamentals,
 because once you have this model in mind, the extension into multiple threads
 and the complicated semantics of real atomics becomes completely natural.
 [Loom]: https://docs.rs/loom
 [Miri]: https://github.com/rust-lang/miri
--- a/src/atomics/relaxed.md
+++ b/src/atomics/relaxed.md
@ -0,0 +1,452 @@
 # Relaxed
 Now we’ve got single-threaded mutation semantics out of the way, we can try
 reintroducing a second thread. We’ll have one thread perform a write to the
 memory location, and a second thread read from it, like so:
 ```rust
 // Initial state
 let mut data = 0;
 // Thread 1:
 data = 1;
 // Thread 2:
 println!("{data}");
 ```
 Of course, any Rust programmer will immediately tell you that this code doesn’t
 compile, and indeed it definitely does not, and for good reason. But suspend
 your disbelief for a moment, and imagine what would happen if it did. Let’s draw
 a diagram, leaving out the reading lines for now:
 ```text
 Thread 1     data    Thread 2
 ╭───────╮   ┌────┐   ╭───────╮
 │  = 1  ├╌┐ │  0 │ ?╌┤  data │
 ╰───────╯ ├╌┼╌╌╌╌┤   ╰───────╯
          └╌┼╌╌╌╌┤
            │  1 │
            └────┘
 ```
 Unfortunately, coherence doesn’t help us in finding out where Thread 2’s line
 joins up to, since there are no arrows connecting that operation to anything and
 therefore we can’t immediately rule any values out. As a result, we end up
 facing a situation we haven’t faced before: there is _more than one_ potential
 value for Thread 2 to read.
 And this is where we encounter the big limitation with unsynchronized data
 accesses: the price we pay for their speed and optimization capability is that
 this situation is considered **Undefined Behavior**. For an unsynchronized read
 to be acceptable, there has to be _exactly one_ potential value for it to read,
 and when there are multiple like in this situation it is considered a data race.
 So what can we do about this? Well, two things need to be changed. First of all,
 Thread 1 has to use an atomic store instead of an unsynchronized write, and
 secondly Thread 2 has to use an atomic load instead of an unsynchronized read.
 You’ll also notice that all the atomic functions accept one (and sometimes two)
 parameters of `atomic::Ordering`s — we’ll explore the details of the differences
 between them later, but for now we’ll use `Relaxed` because it is by far the
 simplest of the lot.
 ```rust
 # use std::sync::atomic::{self, AtomicU32};
 // Initial state
 let data = AtomicU32::new(0);
 // Thread 1:
 data.store(1, atomic::Ordering::Relaxed);
 // Thread 2:
 data.load(atomic::Ordering::Relaxed);
 ```
 The use of the atomic store provides one additional ability in comparison to an
 unsynchronized store, and that is that there is no “in-between” state between
 the old and new values — instead, it immediately updates, resulting in a diagram
 that look a bit more like this:
 ```text
 Thread 1     data
 ╭───────╮   ┌────┐
 │  = 1  ├─┐ │  0 │
 ╰───────╯ │ └────┘
          └─┬────┐
            │  1 │
            └────┘
 ```
 We have now established a _modification order_ for `data`: a total, ordered list
 of distinct, separated values that it takes over its lifetime.
 On the loading side, we also obtain one additional ability: when there are
 multiple possible values to choose from in the modification order, instead of it
 triggering UB, exactly one (but it is unspecified which) value is chosen. This
 means that there are now _two_ potential executions of our program, with no way
 for us to control which one occurs:
 ```text
     Possible Execution 1       ┃       Possible Execution 2
                                ┃
 Thread 1     data    Thread 2   ┃  Thread 1     data    Thread 2
 ╭───────╮   ┌────┐   ╭───────╮  ┃  ╭───────╮   ┌────┐   ╭───────╮
 │ store ├─┐ │  0 ├───┤  load │  ┃  │ store ├─┐ │  0 │ ┌─┤  load │
 ╰───────╯ │ └────┘   ╰───────╯  ┃  ╰───────╯ │ └────┘ │ ╰───────╯
          └─┬────┐              ┃            └─┬────┐ │
            │  1 │              ┃              │  1 ├─┘
            └────┘              ┃              └────┘
 ```
 Note that **both sides must be atomic to avoid the data race**: if only the
 writing side used atomic operations, the reading side would still have multiple
 values to choose from (UB), and if only the reading side used atomic operations
 it could end up reading the garbage data “in-between” `0` and `1` (also UB).
 > **NOTE:** This description of why both sides are needed to be atomic
 > operations, while neat and intuitive, is not strictly correct: in reality the
 > answer is simply “because the spec says so”. However, it is functionally
 > equivalent to the real rules, so it can aid in understanding.
 ## Read-modify-write operations
 Loads and stores are pretty neat in avoiding data races, but you can’t get very
 far with them. For example, suppose you wanted to implement a global shared
 counter that can be used to assign unique IDs to objects. Naïvely, you might try
 to write code like this:
 ```rust
 # use std::sync::atomic::{self, AtomicU64};
 static COUNTER: AtomicU64 = AtomicU64::new(0);
 pub fn get_id() -> u64 {
    let value = COUNTER.load(atomic::Ordering::Relaxed);
    COUNTER.store(value + 1, atomic::Ordering::Relaxed);
    value
 }
 ```
 But then calling that function from multiple threads opens you up to an
 execution like below that results in two threads obtaining the same ID (note
 that the duplication of `1` in the modification order is intentional; even if
 two values are the same, they always get separate entries in the order if they
 were caused by different accesses):
 ```text
 Thread 1   COUNTER   Thread 2
 ╭───────╮   ┌───┐   ╭───────╮
 │ load  ├───┤ 0 ├───┤  load │
 ╰───╥───╯   └───┘   ╰────╥──╯
 ╭───⇓───╮ ┌─┬───┐   ╭────⇓──╮
 │ store ├─┘ │ 1 │ ┌─┤ store │
 ╰───────╯   └───┘ │ ╰───────╯
            ┌───┬─┘
            │ 1 │
            └───┘
 ```
 This is known as a a **race condition** — a logic error in a program caused by a
 specific unintended execution of concurrent code. Note that this is distinct
 from a _data race_: while a data race is caused by two threads performing
 unsynchronized operations at the same time and is always undefined behaviour,
 race conditions are totally OK and defined behaviour from the AM’s perspective,
 but are only harmful because the programmer didn’t expect it to be possible. You
 can think of the distinction between the two as analagous to the difference
 between indexing out-of-bounds and indexing in-bounds, but to the wrong element:
 both are bugs, but only one is universally a bug, and the other is merely a
 logic problem.
 Technically, I believe it is _possible_ to solve this problem with just loads
 and stores, if you try hard enough and use several atomics. But luckily, you
 don’t have to because there also exists another kind of operation, the
 read-modify-write, which is specifically suited to this purpose.
 A read-modify-write operation (shortened to RMW) is a special kind of atomic
 operation that reads, changes and writes back a value _in one step_. This means
 that there are guaranteed to exist no other values in the modification order in
 between the read and the write; it happens as a single operation. I would also
 like to point out that this is true of **all** atomic orderings, since a common
 misconception is that the `Relaxed` ordering somehow negates this guarantee.
 > Another common confusion about RMWs is that they are guaranteed to “see the
 > latest value” of an atomic, which I believe came from a misinterpretation of
 > the C++ specification and was later spread by rumour. Of course, this makes no
 > sense, since atomics have no latest value due to the lack of the concept of
 > time. The original statement in the specification was actually just specifying
 > that atomic RMWs are atomic: they only consider the directly previous value in
 > the modification order and not any value before it, and gave no additional
 > guarantee.
 There are many different RMW operations to choose from, but the one most
 appropriate for this use case is `fetch_add`, which adds a number to the atomic,
 as well as returns the old value. So our code can be rewritten as this:
 ```rust
 # use std::sync::atomic::{self, AtomicU64};
 static COUNTER: AtomicU64 = AtomicU64::new(0);
 pub fn get_id() -> u64 {
    COUNTER.fetch_add(1, atomic::Ordering::Relaxed)
 }
 ```
 And then, no matter how many threads there are, that race condition from earlier
 can never occur. Executions will have to look more like this:
 ```text
  Thread 1     COUNTER     Thread 2
 ╭───────────╮   ┌───┐   ╭───────────╮
 │ fetch_add ├─┐ │ 0 │ ┌─┤ fetch_add │
 ╰───────────╯ │ └───┘ │ ╰───────────╯
              └─┬───┐ │
                │ 1 │ │
                └───┘ │
                ┌───┬─┘
                │ 2 │
                └───┘
 ```
 There is one problem with this code however, and that is that if `get_id()` is
 called over 18 446 744 073 709 551 615 times, the counter will overflow and it
 will start generating duplicate IDs. Of course, this won’t feasibly happen, but
 it can be problematic if you need to _prove_ that it can’t happen (e.g. for
 safety purposes) or you’re using a smaller integer type like `u32`.
 So we’re going to modify this function so that instead of returning a plain
 `u64` it returns an `Option<u64>`, where `None` is used to indicate that an
 overflow occurred and no more IDs could be generated. Additionally, it’s not
 enough to just return `None` once, because if there are multiple threads
 involved they will not see that result if it just occurs on a single thread —
 instead, it needs to continue to return `None` _until the end of time_ (or,
 well, this execution of the program).
 That means we have to do away with `fetch_add`, because `fetch_add` will always
 overflow and there’s no `checked_fetch_add` equivalent. We’ll return to our racy
 algorithm for a minute, this time thinking more about what went wrong. The steps
 look something like this:
 1. Load a value of the atomic
 1. Perform the checked add, propagating `None`
 1. Store in the new value of the atomic
 The problem here is that the store does not necessarily occur directly after the
 load in the atomic’s modification order, and that leads to the races. What we
 need is some way to say, “add this new value to the modification order, but
 _only if_ it occurs directly after the value we loaded”. And luckily for us,
 there exists a function that does exactly\* this: `compare_exchange`.
 `compare_exchange` is a bit like a store, but instead of unconditionally storing
 the value, it will first check the value directly before the `compare_exchange`
 in the modification order to see whether it is what we expect, and if not it
 will simply tell us that and not make any changes. It is an RMW operation, so
 all of this happens fully atomically — there is no chance for a race condition.
 > \* It’s not quite the same, because `compare_exchange` can suffer from ABA
 > problems in which it will see a later value in the modification order that
 > just happened to be same and succeed. For example, if the modification order
 > contained `1, 2, 1` and a thread loaded the first `1`,
 > `compare_exchange(1, 3)` could succeed in replacing either the first or second
 > `1`, giving either `1, 3, 2, 1` or `1, 2, 1, 3`.
 >
 > For some algorithms, this is problematic and needs to be taken into account
 > with additional checks; however for us, values can never be reused so we don’t
 > have to worry about it.
 In our case, we can simply replace the store with a compare exchange of the old
 value and itself plus one (returning `None` instead if the addition overflowed,
 to prevent overflowing the atomic). Should the `compare_exchange` fail, we know
 that some other thread inserted a value in the modification order after the
 value we loaded. This isn’t really a problem — we can just try again and again
 until we succeed, and `compare_exchange` is even nice enough to give us the
 updated value so we don’t have to load again. Also note that after we’ve updated
 our value of the atomic, we’re guaranteed to never see the old value again, by
 the coherence rules from the previous chapter.
 So here’s how it looks with these changes appplied:
 ```rust
 # use std::sync::atomic::{self, AtomicU64};
 static COUNTER: AtomicU64 = AtomicU64::new(0);
 pub fn get_id() -> Option<u64> {
    // Load the counter’s initial value from some place in the modification
    // order (it doesn’t matter where, because the compare exchange makes sure
    // that our new value appears directly after it).
    let mut value = COUNTER.load(atomic::Ordering::Relaxed);
    loop {
        // Attempt to add one to the atomic.
        let res = COUNTER.compare_exchange(
            value,
            value.checked_add(1)?,
            atomic::Ordering::Relaxed,
            atomic::Ordering::Relaxed,
        );
        // Check what happened…
        match res {
            // If there was no value in between the value we loaded and our
            // newly written value in the modification order, the compare
            // exchange suceeded and so we are done.
            Ok(_) => break,
            // Otherwise, there was a value in between and so we need to retry
            // the addition and continue looping.
            Err(updated_value) => value = updated_value,
        }
    }
    Some(value)
 }
 ```
 This `compare_exchange` loop enables the algorithm to succeed even under
 contention; it will simply try again (and again and again). In the below
 execution, Thread 1 gets raced to storing its value of `1` to the counter, but
 that’s okay because it will just add `1` to the `1`, making `2`, and retry the
 compare exchange with that, eventually resulting in a unique ID.
 ```text
 Thread 1   COUNTER   Thread 2
 ╭───────╮   ┌───┐   ╭───────╮
 │ load  ├───┤ 0 ├───┤ load  │
 ╰───╥───╯   └───┘   ╰───╥───╯
 ╭───⇓───╮   ┌───┬─┐ ╭───⇓───╮
 │  cas  ├───┤ 1 │ └─┤  cas  │
 ╰───╥───╯   └───┘   ╰───────╯
 ╭───⇓───╮ ┌─┬───┐
 │  cas  ├─┘ │ 2 │
 ╰───────╯   └───┘
 ```
 > `compare_exchange` is abbreviated to CAS here (which stands for
 > compare-and-swap), since that is the more general name for the operation. It
 > is not to be confused with `compare_and_swap`, a deprecated method on Rust
 > atomics that performs the same task as `compare_exchange` but has an inferior
 > design in some ways.
 There are two additional improvements we can make here. First, because our
 algorithm occurs in a loop, it is actually perfectly fine for the CAS to fail
 even when there wasn’t a value inserted in the modification order in between,
 since we’ll just run it again. This allows to switch out our call to
 `compare_exchange` with a call to the weaker `compare_exchange_weak`, that
 unlike the former function is allowed to _spuriously_ (i.e. randomly, from the
 programmer’s perspective) fail. This often results in better performance on
 architectures like ARM, since their `compare_exchange` is really just a loop
 around the underlying `compare_exchange_weak`. x86\_64 however will see no
 difference in performance.
 The second improvement is that this pattern is so common that the standard
 library even provides a helper function for it, called `fetch_update`. It
 implements the boilerplate `load`-`loop`-`match` parts for us, so all we have to
 do is provide the closure that calls `checked_add(1)` and it will all just work.
 This leads us to our final code for this example:
 ```rust
 # use std::sync::atomic::{self, AtomicU64};
 static COUNTER: AtomicU64 = AtomicU64::new(0);
 pub fn get_id() -> Option<u64> {
    COUNTER.fetch_update(
        atomic::Ordering::Relaxed,
        atomic::Ordering::Relaxed,
        |value| value.checked_add(1),
    )
    .ok()
 }
 ```
 These CAS loops are the absolute bread and butter of concurrent programming;
 they’re absolutely everywhere and essential to know about. Every other RMW
 operation on atomics can (and often is, if the hardware doesn’t have a more
 efficient implementation) be implemented via a CAS loop. This is why CAS is seen
 as the canonical example of an RMW — it’s pretty much the most fundamental
 operation you can get on atomics.
 I’d also like to briefly bring attention to the atomic orderings used in this
 section. They were mostly glossed over, but we were exclusively using `Relaxed`,
 and that’s because for something as simple as a global ID counter, _you never
 need more than `Relaxed`_. The more complex cases which we’ll look at later
 definitely do need stronger orderings, but as a general rule, if:
 - you only have one atomic, and
 - you have no other related pieces of data
 `Relaxed` is more than sufficient.
 ## “Out-of-thin-air” values
 One peculiar consequence of the semantics of `Relaxed` operations is that it is
 theoretically possible for values to come into existence “out-of-thin-air”
 (commonly abbreviated to OOTA) — that is, a value could appear despite not ever
 being calculated anywhere in code. In particular, consider this setup:
 ```rust
 # use std::sync::atomic::{self, AtomicU32};
 let x = AtomicU32::new(0);
 let y = AtomicU32::new(0);
 // Thread 1:
 let r1 = y.load(atomic::Ordering::Relaxed);
 x.store(r1, atomic::Ordering::Relaxed);
 // Thread 2:
 let r2 = x.load(atomic::Ordering::Relaxed);
 y.store(r2, atomic::Ordering::Relaxed);
 ```
 When starting to draw a diagram for a possible execution of this program, we
 have to first lay out the basic facts that we know:
 - `x` and `y` both start out as zero
 - Thread 1 performs a load of `y` followed by a store of `x`
 - Thread 2 performs a load of `x` followed by a store of `y`
 - Each of `x` and `y` take on exactly two values in their lifetime
 Then we can start to construct boxes:
 ```text
 Thread 1      x       y      Thread 2
 ╭───────╮   ┌───┐   ┌───┐   ╭───────╮
 │  load ├─┐ │ 0 │   │ 0 │ ┌─┤ load  │
 ╰───╥───╯ │ └───┘   └───┘ │ ╰───╥───╯
    ║     │   ?───────────┘     ║
 ╭───⇓───╮ └───────────?     ╭───⇓───╮
 │ store ├───┬───┐   ┌───┬───┤ store │
 ╰───────╯   │ ? │   │ ? │   ╰───────╯
            └───┘   └───┘
 ```
 At this point, if either of those lines were to connect to the higher box then
 the execution would be simple: that thread would forward the value to its lower
 box, which the other thread would then either read, or load the same value
 (zero) from the box above it, and we’d end up with zero in both atomics. But
 what if they were to connect downwards? Then we’d end up with an execution that
 looks like this:
 ```text
 Thread 1      x       y      Thread 2
 ╭───────╮   ┌───┐   ┌───┐   ╭───────╮
 │  load ├─┐ │ 0 │   │ 0 │ ┌─┤ load  │
 ╰───╥───╯ │ └───┘   └───┘ │ ╰───╥───╯
    ║     │   ┌───────────┘     ║
 ╭───⇓───╮ └───┼───────┐     ╭───⇓───╮
 │ store ├───┬─┴─┐   ┌─┴─┬───┤ store │
 ╰───────╯   │ ? │   │ ? │   ╰───────╯
            └───┘   └───┘
 ```
 But hang on — it’s not fully resolved yet, we still haven’t put in a value in
 those lower question marks. So what value should it be? Well, the second value
 of `x` is just copied from from the second value of `y`, so we just have to find
 the value of that — but the second value of `y` is itself copied from the second
 value of `x`! This means that we can actually put any value we like in that box,
 including `0` or `42`, and the logic will check out perfectly fine — meaning if
 this program were to execute in this fashion, it would end up reading a value
 produced out of thin air!
 Now, if we were to strictly follow the rules we’ve laid out thus far, then this
 would be totally valid thing to happen. But luckily, the authors of the C++
 specification have recognized this as a problem, and as such refined the
 semantics of `Relaxed` to implement a thorough, logically sound, mathematically
 proven formal model that prevents it, that’s just too complex and technical to
 explain here—
 > No “out-of-thin-air” values can be computed that circularly depend on their
 > own computations.
 Just kidding. Turns out, it’s a *really* difficult problem to solve, and to my
 knowledge even now there is no known formal way to express how to prevent it. So
 in the specification they just kind of hand-wave and say that it shouldn’t
 happen, and that the above program must always give zero in both atomics,
 despite the theoretical execution that could result in something else. Well, it
 generally works in practice so I can’t complain — it’s just a very interesting
 detail to know about.
--- a/src/atomics/seqcst.md
+++ b/src/atomics/seqcst.md
@ -0,0 +1,432 @@
 # SeqCst
 `SeqCst` is probably the most interesting ordering, because it is simultaneously
 the simplest and most complex atomic memory ordering in existence. It’s
 simple, because if you do only use `SeqCst` everywhere then you can kind of
 maybe pretend like the Abstract Machine has a concept of time; phrases like
 “latest value” make sense, the program can be thought of as a set of steps that
 interleave, there is a universal “now” and “before” and wouldn’t that be nice?
 But it’s also the most complex, because as soon as look under the hood you
 realize just how incredibly convoluted and hard to follow the actual rules
 behind it are, and it gets really ugly really fast as soon as you try to mix it
 with any other ordering.
 To understand `SeqCst`, we first have to understand the problem it exists to
 solve. A simple example used to show where weaker orderings produce
 counterintuitive results is this:
 ```rust
 # use std::sync::atomic::{self, AtomicBool};
 use std::thread;
 // Set this to Relaxed, Acquire, Release, AcqRel, doesn’t matter — the result is
 // the same (modulo panics caused by attempting acquire stores or release
 // loads).
 const ORDERING: atomic::Ordering = atomic::Ordering::Relaxed;
 static X: AtomicBool = AtomicBool::new(false);
 static Y: AtomicBool = AtomicBool::new(false);
 let a = thread::spawn(|| { X.store(true, ORDERING); Y.load(ORDERING) });
 let b = thread::spawn(|| { Y.store(true, ORDERING); X.load(ORDERING) });
 let a = a.join().unwrap();
 let b = b.join().unwrap();
 # return;
 // This assert is allowed to fail.
 assert!(a || b);
 ```
 The basic setup of this code, for all of its possible executions, looks like
 this:
 ```text
     a        static X    static Y         b
 ╭─────────╮   ┌───────┐   ┌───────┐   ╭─────────╮
 │ store X ├─┐ │ false │   │ false │ ┌─┤ store Y │
 ╰────╥────╯ │ └───────┘   └───────┘ │ ╰────╥────╯
 ╭────⇓────╮ └─┬───────┐   ┌───────┬─┘ ╭────⇓────╮
 │ load Y  ├─? │ true  │   │ true  │ ?─┤ load X  │
 ╰─────────╯   └───────┘   └───────┘   ╰─────────╯
 ```
 In other words, `a` and `b` are guaranteed to store `true` into `X` and `Y`
 respectively, and then attempt to load from the other thread’s atomic. The
 question now is: is it possible for them _both_ to load `false`?
 And looking at this diagram, there’s absolutely no reason why not. There isn’t
 even a single arrow connecting the left and right hand sides so far, so the
 loads have no coherence-based restrictions on which values they are allowed to
 pick, and we could end up with an execution like this:
 ```text
     a        static X    static Y         b
 ╭─────────╮   ┌───────┐   ┌───────┐   ╭─────────╮
 │ store X ├┐  │ false ├─┐┌┤ false │  ┌┤ store Y │
 ╰────╥────╯│  └───────┘┌─┘└───────┘  │╰────╥────╯
     ║     │ ┌─────────┘└───────────┐│     ║
 ╭────⇓────╮└─│┬───────┐   ┌───────┬─│┘╭────⇓────╮
 │ load Y  ├──┘│ true  │   │ true  │ └─┤ load X  │
 ╰─────────╯   └───────┘   └───────┘   ╰─────────╯
 ```
 Which results in a failed assert. This execution is brought about because the
 model of separate modification orders means that there is no relative ordering
 between `X` and `Y` being changed, and so each thread is allowed to “see” either
 order. However, some algorithms will require a globally agreed-upon ordering,
 and this is where `SeqCst` can come in useful.
 This ordering, first and foremost, inherits the guarantees from all the other
 orderings — it is an acquire operation for loads, a release operation for stores
 and an acquire-release operation for RMWs. In addition to this, it gives some
 guarantees unique to `SeqCst` about what values it is allowed to load. Note that
 these guarantees are not about preventing data races: unless you have some
 unrelated code that triggers a data race given an unexpected condition, using
 `SeqCst` can only prevent you from race conditions because its guarantees only
 apply to other `SeqCst` operations rather than all data accesses.
 ## S
 `SeqCst` is fundamentally about _S_, which is the global ordering of all
 `SeqCst` operations in an execution of the program. It is consistent between
 every atomic and every thread, and all stores, fences and RMWs that use a
 sequentially consistent ordering have a place in it (but no other operations
 do). It is in contrast to modification orders, which are similarly total but
 only scoped to a single atomic rather than the whole program.
 Other than an edge case involving `SeqCst` mixed with weaker orderings (detailed
 later on), _S_ is primarily controlled by the happens-before relations in a
 program: this means that if an action _A_ happens-before an action _B_, it is
 also guaranteed to appear before _B_ in _S_. Other than that restriction, _S_ is
 unspecified and will be chosen arbitrarily during execution.
 Once a particular _S_ has been established, every atomic’s modification order is
 then guaranteed to be consistent with it, so a `SeqCst` load will never see a
 value that has been overwritten by a write that occurred before it in _S_, or a
 value that has been written by a write that occured after it in _S_ (note that a
 `Relaxed`/`Acquire` load however might, since there is no “before” or “after” as
 it is not in _S_ in the first place).
 More formally, this guarantee can be described with _coherence orderings_, a
 relation which expresses which of two operations appears before the other in an
 atomic’s modification order. It is said that an operation _A_ is
 _coherence-ordered-before_ another operation _B_ if any of the following
 conditions are met:
 1. _A_ is a store or RMW, _B_ is a store or RMW, and _A_ appears before _B_ in
 	the modification order.
 1. _A_ is a store or RMW, _B_ is a load, and _B_ reads the value stored by _A_.
 1. _A_ is a load, _B_ is a store or RMW, and _A_ takes its value from a place in
 	the modification order that appears before _B_.
 1. _A_ is coherence-ordered-before a different operation _X_, and _X_ is
 	coherence-ordered-before _B_ (the basic transitivity property).
 The following diagram gives examples for the main three rules (in each case _A_
 is coherence-ordered-before _B_):
 ```text
        Rule 1        ┃         Rule 2        ┃         Rule 3
                      ┃                       ┃
 ╭───╮ ┌─┬───┐   ╭───╮ ┃ ╭───╮ ┌─┬───┐   ╭───╮ ┃ ╭───╮   ┌───┐   ╭───╮
 │ A ├─┘ │   │ ┌─┤ B │ ┃ │ A ├─┘ │   ├───┤ B │ ┃ │ A ├───┤   │ ┌─┤ B │
 ╰───╯   └───┘ │ ╰───╯ ┃ ╰───╯   └───┘   ╰───╯ ┃ ╰───╯   └───┘ │ ╰───╯
        ┌───┬─┘       ┃                       ┃         ┌───┬─┘
        │   │         ┃                       ┃         │   │
        └───┘         ┃                       ┃         └───┘
 ```
 The only important thing to note is that for two loads of the same value in the
 modification order, neither is coherence-ordered-before the other, as in the
 following example where _A_ has no coherence ordering relation to _B_:
 ```text
 ╭───╮   ┌───┐   ╭───╮
 │ A ├───┤   ├───┤ B │
 ╰───╯   └───┘   ╰───╯
 ```
 Because of this, “_A_ is coherence-ordered-before _B_” is subtly different from
 “_A_ is not coherence-ordered-after _B_”; only the latter phrase includes the
 above situation, and is synonymous with “either _A_ is coherence-ordered-before
 _B_ or _A_ and _B_ are both loads, and see the same value in the atomic’s
 modification order”. “Not coherence-ordered-after” is generally a more useful
 relation than “coherence-ordered-before”, and so it’s important to understand
 what it means.
 With this terminology applied, we can use a more precise definition of
 `SeqCst`’s guarantee: for two `SeqCst` operations on the same atomic _A_ and
 _B_, where _A_ precedes _B_ in _S_, _A_ is not coherence-ordered-after _B_.
 Effectively, this one rule ensures that _S_’s order “propagates”
 throughout all the atomics of the program — you can imagine each operation in
 _S_ as storing a snapshot of the world, so that every subsequent operation is
 consistent with it.
 ## Applying `SeqCst`
 So, looking back at our program, let’s consider how we could use `SeqCst` to
 make that execution invalid. As a refresher, here’s the framework for every
 possible execution of the program:
 ```text
     a        static X    static Y         b
 ╭─────────╮   ┌───────┐   ┌───────┐   ╭─────────╮
 │ store X ├─┐ │ false │   │ false │ ┌─┤ store Y │
 ╰────╥────╯ │ └───────┘   └───────┘ │ ╰────╥────╯
 ╭────⇓────╮ └─┬───────┐   ┌───────┬─┘ ╭────⇓────╮
 │ load Y  ├─? │ true  │   │ true  │ ?─┤ load X  │
 ╰─────────╯   └───────┘   └───────┘   ╰─────────╯
 ```
 First of all, both the final loads (`a` and `b`’s second operations) need to
 become `SeqCst`, because they need to be aware of the total ordering that
 determines whether `X` or `Y` becomes `true` first. And secondly, we need to
 establish that ordering in the first place, and that needs to be done by making
 sure that there is always one operation in _S_ that both sees one of the atomics
 as `true` and precedes both final loads in _S_, so that the coherence ordering
 guarantee will apply (the final loads themselves don’t work for this since
 although they “know” that their corresponding atomic is `true` they don’t
 interact with it directly so _S_ doesn’t care) — for this, we must set both
 stores to use the `SeqCst` ordering.
 This leaves us with the correct version of the above program, which is
 guaranteed to never panic:
 ```rust
 # use std::sync::atomic::{self, AtomicBool};
 use std::thread;
 const ORDERING: atomic::Ordering = atomic::Ordering::SeqCst;
 static X: AtomicBool = AtomicBool::new(false);
 static Y: AtomicBool = AtomicBool::new(false);
 let a = thread::spawn(|| { X.store(true, ORDERING); Y.load(ORDERING) });
 let b = thread::spawn(|| { Y.store(true, ORDERING); X.load(ORDERING) });
 let a = a.join().unwrap();
 let b = b.join().unwrap();
 # return;
 // This assert is **not** allowed to fail.
 assert!(a || b);
 ```
 As there are four `SeqCst` operations with a partial order between two pairs in
 them (caused by the sequenced-before relation), there are six possible
 executions of this program:
 - All of `a`’s operations precede `b`’s operations:
 	1. `a` stores `true` into `X`
 	1. `a` loads `Y` (gives `false`)
 	1. `b` stores `true` into `Y`
 	1. `b` loads `X` (required to give `true`)
 - All of `b`’s operations precede `a`’s operations:
 	1. `b` stores `true` into `Y`
 	1. `b` loads `X` (gives `false`)
 	1. `a` stores `true` into `X`
 	1. `a` loads `Y` (required to give `true`)
 - The stores precede the loads,
 	`a`’s store precedes `b`’s and `a`’s load precedes `b`’s:
 	1. `a` stores `true` to `X`
 	1. `b` stores `true` into `Y`
 	1. `a` loads `Y` (required to give `true`)
 	1. `b` loads `X` (required to give `true`)
 - The stores precede the loads,
 	`a`’s store precedes `b`’s and `b`’s load precedes `a`’s:
 	1. `a` stores `true` to `X`
 	1. `b` stores `true` into `Y`
 	1. `b` loads `X` (required to give `true`)
 	1. `a` loads `Y` (required to give `true`)
 - The stores precede the loads,
 	`b`’s store precedes `a`’s and `a`’s load precedes `b`’s:
 	1. `b` stores `true` into `Y`
 	1. `a` stores `true` to `X`
 	1. `a` loads `Y` (required to give `true`)
 	1. `b` loads `X` (required to give `true`)
 - The stores precede the loads,
 	`b`’s store precedes `a`’s and `b`’s load precedes `a`’s:
 	1. `b` stores `true` into `Y`
 	1. `a` stores `true` to `X`
 	1. `b` loads `X` (required to give `true`)
 	1. `a` loads `Y` (required to give `true`)
 All the places where the load was required to give `true` were caused by a
 preceding store in _S_ of the same atomic of `true` — otherwise, the load would
 be coherence-ordered-before a store which precedes it in _S_, and that is
 impossible.
 ## The mixed-`SeqCst` special case
 As I’ve been alluding to for a while, I wasn’t being totally truthful when I
 said that _S_ is consistent with happens-before relations — in reality, it is
 only consistent with _strongly happens-before_ relations, which presents a
 subtly-defined subset of happens-before relations. In particular, it excludes
 two situations:
 1. The `SeqCst` operation A synchronizes-with an `Acquire` or `AcqRel` operation
   B which is sequenced-before another `SeqCst` operation C. Here, despite the
   fact that A happens-before C, A does not _strongly_ happen-before C and so is
   not guaranteed to precede C in _S_.
 2. The `SeqCst` operation A is sequenced-before the `Release` or `AcqRel`
   operation B, which synchronizes-with another `SeqCst` operation C. Similarly,
   despite the fact that A happens-before C, A might not precede C in _S_.
 The first situation is illustrated below, with `SeqCst` accesses repesented with
 asterisks:
 ```text
  t_1       x       t_2
 ╭─────╮ ┌─↘───┐   ╭─────╮
 │ *A* ├─┘ │ 1 ├───→  B  │
 ╰─────╯   └───┘   ╰──╥──╯
                  ╭──⇓──╮
                  │ *C* │
                  ╰─────╯
 ```
 A happens-before, but does not strongly happen-before, C — and anything
 sequenced-after C will have the same treatment (unless more synchronization is
 used). This means that C is actually allowed to _precede_ A in _S_, despite
 conceptually occuring after it. However, anything sequenced-before A, because
 there is at least one sequence on either side of the synchronization, will
 strongly happen-before C.
 But this is all highly theoretical at the moment, so let’s make an example to
 show how that rule can actually affect the execution of code. So, if C were to
 precede A in _S_ (and they are not both loads) then that means C is always
 coherence-ordered-before A. Let’s say then that C loads from `x` (the atomic
 that A has to access), it may load the value that came before A if it were to
 precede A in _S_:
 ```text
  t_1       x       t_2
 ╭─────╮   ┌───┐   ╭─────╮
 │ *A* ├─┐ │ 0 ├─┐┌→  B  │
 ╰─────╯ │ └───┘ ││╰──╥──╯
        └─↘───┐┌─┘╭──⇓──╮
          │ 1 ├┘└─→ *C* │
          └───┘   ╰─────╯
 ```
 Ah wait no, that doesn’t work because regular coherence still mandates that `1`
 is the only value that can be loaded. In fact, once `1` is loaded _S_’s required
 consistency with coherence orderings means that A _is_ required to precede C in
 _S_ after all.
 So somehow, to observe this difference we need to have a _different_ `SeqCst`
 operation, let’s call it E, be the one that loads from `x`, where C is
 guaranteed to precede it in _S_ (so we can observe the “weird” state in between
 C and A) but C also doesn’t happen-before it (to avoid coherence getting in the
 way) — and to do that, all we have to do is have C appear before a `SeqCst`
 operation D in the modification order of another atomic, but have D be a store
 so as to avoid C synchronizing with it, and then our desired load E can simply
 be sequenced-after D (this will carry over the “precedes in _S_” guarantee, but
 does not restore the happens-after relation to C since that was already dropped
 by having D be a store).
 In diagram form, that looks like this:
 ```text
  t_1       x       t_2     helper      t_3
 ╭─────╮   ┌───┐   ╭─────╮   ┌─────┐   ╭─────╮
 │ *A* ├─┐ │ 0 ├┐┌─→  B  │ ┌─┤  0  │ ┌─┤ *D* │
 ╰─────╯ │ └───┘││ ╰──╥──╯ │ └─────┘ │ ╰──╥──╯
        │      └│────║────│─────────│┐   ║
        └─↘───┐ │ ╭──⇓──╮ │ ┌─────↙─┘│╭──⇓──╮
          │ 1 ├─┘ │ *C* ←─┘ │  1  │  └→ *E* │
          └───┘   ╰─────╯   └─────┘   ╰─────╯
 S = C → D → E → A
 ```
 C is guaranteed to precede D in _S_, and D is guaranteed to precede E, but
 because this exception means that A is _not_ guaranteed to precede C, it is
 totally possible for it to come at the end, resulting in the surprising but
 totally valid outcome of E loading `0` from `x`. In code, this can be expressed
 as the following code _not_ being guaranteed to panic:
 ```rust
 # use std::sync::atomic::{AtomicU8, Ordering::{Acquire, SeqCst}};
 # return;
 static X: AtomicU8 = AtomicU8::new(0);
 static HELPER: AtomicU8 = AtomicU8::new(0);
 // thread_1
 X.store(1, SeqCst); // A
 // thread_2
 assert_eq!(X.load(Acquire), 1); // B
 assert_eq!(HELPER.load(SeqCst), 0); // C
 // thread_3
 HELPER.store(1, SeqCst); // D
 assert_eq!(X.load(SeqCst), 0); // E
 ```
 The second situation listed above has very similar consequences. Its abstract
 form is the following execution in which A is not guaranteed to precede C in
 _S_, despite A happening-before C:
 ```text
  t_1       x       t_2
 ╭─────╮ ┌─↘───┐   ╭─────╮
 │ *A* │ │ │ 0 ├───→ *C* │
 ╰──╥──╯ │ └───┘   ╰─────╯
 ╭──⇓──╮ │
 │  B  ├─┘
 ╰─────╯
 ```
 Similarly to before, we can’t just have A access `x` to show why A not
 necessarily preceding C in _S_ matters; instead, we have to introduce a second
 atomic and third thread to break the happens-before chain first. And finally, a
 single relaxed load F at the end is added just to prove that the weird execution
 actually happened (leaving `x` as 2 instead of 1).
 ```text
  t_3     helper      t_1       x       t_2
 ╭─────╮   ┌─────┐   ╭─────╮   ┌───┐   ╭─────╮
 │ *D* ├┐┌─┤  0  │ ┌─┤ *A* │   │ 0 │ ┌─→ *C* │
 ╰──╥──╯││ └─────┘ │ ╰──╥──╯   └───┘ │ ╰──╥──╯
   ║   └│─────────│────║─────┐      │    ║
 ╭──⇓──╮ │ ┌─────↙─┘ ╭──⇓──╮ ┌─↘───┐ │ ╭──⇓──╮
 │ *E* ←─┘ │  1  │   │  B  ├─┘││ 1 ├─┘┌┤  F  │
 ╰─────╯   └─────┘   ╰─────╯  │└───┘  │╰─────╯
                             └↘───┐  │
                              │ 2 ├──┘
                              └───┘
 S = C → D → E → A
 ```
 This execution mandates both C preceding A in _S_ and A happening-before C,
 something that is only possible through these two mixed-`SeqCst` special
 exceptions. It can be expressed in code as well:
 ```rust
 # use std::sync::atomic::{AtomicU8, Ordering::{Release, Relaxed, SeqCst}};
 # return;
 static X: AtomicU8 = AtomicU8::new(0);
 static HELPER: AtomicU8 = AtomicU8::new(0);
 // thread_3
 X.store(2, SeqCst); // D
 assert_eq!(HELPER.load(SeqCst), 0); // E
 // thread_1
 HELPER.store(1, SeqCst); // A
 X.store(1, Release); // B
 // thread_2
 assert_eq!(X.load(SeqCst), 1); // C
 assert_eq!(X.load(Relaxed), 2); // F
 ```
 If this seems ridiculously specific and obscure, that’s because it is.
 Originally, back in C++11, this special case didn’t exist — but then six years
 later it was discovered that in practice atomics on Power, Nvidia GPUs and
 sometimes ARMv7 _would_ have this special case, and fixing the implementations
 would make atomics significantly slower. So instead, in C++20 they simply
 encoded it into the specification.
 Generally however, this rule is so complex it’s best to just avoid it entirely
 by never mixing `SeqCst` and non-`SeqCst` on a single atomic in the first place.