From a26eab47d283b7a9a66a38ab7b6a432cbfbb85e6 Mon Sep 17 00:00:00 2001
From: SabrinaJewson <sejewson@gmail.com>
Date: Fri, 5 Aug 2022 12:14:36 +0100
Subject: [PATCH] =?UTF-8?q?Write=20the=20=E2=80=9CRelaxed=E2=80=9D=20secti?=
 =?UTF-8?q?on?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 src/atomics/relaxed.md | 390 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 385 insertions(+), 5 deletions(-)

diff --git a/src/atomics/relaxed.md b/src/atomics/relaxed.md
index f20b69a..4c4c4c6 100644
--- a/src/atomics/relaxed.md
+++ b/src/atomics/relaxed.md
@@ -28,11 +28,11 @@ Thread 1     data    Thread 2
             └────┘
 ```
 
-Let’s try to figure out where the line in Thread 2’s access joins up. The rules
-from before don’t help us much unfortunately since there are no arrows
-connecting that operation to anything, so we can’t immediately rule anything
-out. As a result, we end up facing a situation we haven’t faced before: there is
-_more than one_ potential value for Thread 2 to read.
+Unfortunately, the rules from before don’t help us in finding out where Thread
+2’s line joins up to, since there are no arrows connecting that operation to
+anything and therefore we can’t immediately rule any values out. As a result, we
+end up facing a situation we haven’t faced before: there is _more than one_
+potential value for Thread 2 to read.
 
 And this is where we encounter the big limitation with unsynchronized data
 accesses: the price we pay for their speed and optimization capability is that
@@ -40,4 +40,384 @@ this situation is considered **Undefined Behavior**. For an unsynchronized read
 to be acceptable, there has to be _exactly one_ potential value for it to read,
 and when there are multiple like in this situation it is considered a data race.
 
+So what can we do about this? Well, two things need to be changed. First of all,
+Thread 1 has to use an atomic store instead of an unsynchronized write, and
+secondly Thread 2 has to use an atomic load instead of an unsynchronized read.
+You’ll also notice that all the atomic functions accept one (and sometimes two)
+parameters of `atomic::Ordering`s — we’ll explore the details of the differences
+between them later, but for now we’ll use `Relaxed` because it is by far the
+simplest of the lot.
+
+```rust
+# use std::sync::atomic::{self, AtomicU32};
+// Initial state
+let data = AtomicU32::new(0);
+// Thread 1:
+data.store(1, atomic::Ordering::Relaxed);
+// Thread 2:
+data.load(atomic::Ordering::Relaxed);
+```
+
+The use of the atomic store provides one additional ability in comparison to an
+unsynchronized store, and that is that there is no “in-between” state between
+the old and new values — instead, it immediately updates, resulting in a diagram
+that look a bit more like this:
+
+```text
+Thread 1     data
+╭───────╮   ┌────┐
+│  = 1  ├─┐ │  0 │
+╰───────╯ │ └────┘
+          └─┬────┐
+            │  1 │
+            └────┘
+```
+
+We have now established a _modification order_ for `data`: a total, ordered list
+of distinct, separated values that it takes over its lifetime.
+
+On the loading side, we also obtain one additional ability: when there are
+multiple possible values to choose from in the modification order, instead of it
+triggering UB, exactly one (but it is unspecified which) value is chosen. This
+means that there are now _two_ potential executions of our program, with no way
+for us to control which one occurs:
+
+```text
+     Possible Execution 1       ┃       Possible Execution 2
+                                ┃
+Thread 1     data    Thread 2   ┃  Thread 1     data    Thread 2
+╭───────╮   ┌────┐   ╭───────╮  ┃  ╭───────╮   ┌────┐   ╭───────╮
+│ store ├─┐ │  0 ├───┤  load │  ┃  │ store ├─┐ │  0 │ ┌─┤  load │
+╰───────╯ │ └────┘   ╰───────╯  ┃  ╰───────╯ │ └────┘ │ ╰───────╯
+          └─┬────┐              ┃            └─┬────┐ │
+            │  1 │              ┃              │  1 ├─┘
+            └────┘              ┃              └────┘
+```
+
+Note that **both sides must be atomic to avoid the data race**: if only the
+writing side used atomic operations, the reading side would still have multiple
+values to choose from (UB), and if only the reading side used atomic operations
+it could end up reading the garbage data “in-between” `0` and `1` (also UB).
+
+> **NOTE:** This description of why both sides are needed to be atomic
+> operations, while neat and intuitive, is not strictly correct: in reality the
+> answer is simply “because the spec says so”. However, it is isomorphic to the
+> real rules, so it can aid in understanding.
+
+## Read-modify-write operations
+
+Loads and stores are pretty neat in avoiding data races, but you can’t get very
+far with them. For example, suppose you wanted to implement a global shared
+counter that can be used to assign unique IDs to objects. Naïvely, you might try
+to write code like this:
+
+```rust
+# use std::sync::atomic::{self, AtomicU64};
+static COUNTER: AtomicU64 = AtomicU64::new(0);
+pub fn get_id() -> u64 {
+    let value = COUNTER.load(atomic::Ordering::Relaxed);
+    COUNTER.store(value + 1, atomic::Ordering::Relaxed);
+    value
+}
+```
+
+But then calling that function from multiple threads opens you up to an
+execution like below that results in two threads obtaining the same ID:
+
+```text
+Thread 1   COUNTER   Thread 2
+╭───────╮   ┌───┐   ╭───────╮
+│ load  ├───┤ 0 ├───┤  load │
+╰───╥───╯   └───┘   ╰────╥──╯
+╭───⇓───╮ ┌─┬───┐   ╭────⇓──╮
+│ store ├─┘ │ 1 │ ┌─┤ store │
+╰───────╯   └───┘ │ ╰───────╯
+            ┌───┬─┘
+            │ 1 │
+            └───┘
+```
+
+Technically, I believe it is _possible_ to implement this kind of thing with
+just loads and stores, if you try hard enough and use several atomics. But
+luckily, you don’t have to because there also exists another kind of operation,
+the read-modify-write, which is specifically suited to this purpose.
+
+A read-modify-write operation (shortened to RMW) is a special kind of atomic
+operation that reads, changes and writes back a value _in one step_. This means
+that there are guaranteed to exist no other values in the modification order in
+between the read and the write; it happens as a single operation. I would also
+like to point out that this is true of **all** atomic orderings, since a common
+misconception is that the `Relaxed` ordering somehow negates this guarantee.
+
+There are many different RMW operations to choose from, but the one most
+appropriate for this use case is `fetch_add`, which adds a number to the atomic,
+as well as returns the old value. So our code can be rewritten as this:
+
+```rust
+# use std::sync::atomic::{self, AtomicU64};
+static COUNTER: AtomicU64 = AtomicU64::new(0);
+pub fn get_id() -> u64 {
+    COUNTER.fetch_add(1, atomic::Ordering::Relaxed)
+}
+```
+
+And then, no matter how many threads there are, that race condition from earlier
+can never occur. Executions will have to look more like this:
+
+```text
+  Thread 1     COUNTER     Thread 2
+╭───────────╮   ┌───┐   ╭───────────╮
+│ fetch_add ├─┐ │ 0 │ ┌─┤ fetch_add │
+╰───────────╯ │ └───┘ │ ╰───────────╯
+              └─┬───┐ │
+                │ 1 │ │
+                └───┘ │
+                ┌───┬─┘
+                │ 2 │
+                └───┘
+```
+
+There is one problem with this code however, and that is that if `get_id()` is
+called over 18 446 744 073 709 551 615 times, the counter will overflow and it
+will start generating duplicate IDs. Of course, this won’t feasibly happen, but
+it can be problematic if you need to _prove_ that it can’t happen (e.g. for
+safety purposes) or you’re using a smaller integer type like `u32`.
+
+So we’re going to modify this function so that instead of returning a plain
+`u64` it returns an `Option<u64>`, where `None` is used to indicate that an
+overflow occurred and no more IDs could be generated. Additionally, it’s not
+enough to just return `None` once, because if there are multiple threads
+involved they will not see that result if it just occurs on a single thread —
+instead, it needs to continue to return `None` _until the end of time_ (or,
+well, this execution of the program).
+
+That means we have to do away with `fetch_add`, because `fetch_add` will always
+overflow and there’s no `checked_fetch_add` equivalent. We’ll return to our racy
+algorithm for a minute, this time thinking more about what went wrong. The steps
+look something like this:
+
+1. Load a value of the atomic
+1. Perform the checked add, propagating `None`
+1. Store in the new value of the atomic
+
+The problem here is that the store does not necessarily occur directly after the
+load in the atomic’s modification order, and that leads to the races. What we
+need is some way to say, “add this new value to the modification order, but
+_only if_ it occurs directly after the value we loaded”. And luckily for us,
+there exists a function that does exactly\* this: `compare_exchange`.
+
+`compare_exchange` is a bit like a store, but instead of unconditionally storing
+the value, it will first check the previous value in the modification order to
+see whether it is what we expect, and if not it will simply tell us that and not
+make any changes. It is an RMW operation, so all of this happens fully
+atomically — there is no chance for a race condition.
+
+> \* It’s not quite the same, because `compare_exchange` can suffer from ABA
+> problems in which it will see a later value in the modification order that
+> just happened to be same and succeed. However, in this code values can never
+> be reused so we don’t have to worry about that.
+
+In our case, we can simply replace the store with a compare exchange of the old
+value and itself plus one (returning `None` instead if the addition overflowed,
+to prevent overflowing the atomic). Should the `compare_exchange` fail, we know
+that some other thread inserted a value in the modification order after the
+value we loaded. This isn’t really a problem — we can just try again and again
+until we succeed, and `compare_exchange` is even nice enough to give us the
+updated value so we don’t have to load again. Also note that after we’ve updated
+our value of the atomic, we’re guaranteed to never see the old value again, by
+the arrow rules from the previous chapter.
+
+So here’s how it looks with these changes appplied:
+
+```rust
+# use std::sync::atomic::{self, AtomicU64};
+static COUNTER: AtomicU64 = AtomicU64::new(0);
+pub fn get_id() -> Option<u64> {
+    // Load the counter’s initial value from some place in the modification
+    // order (it doesn’t matter where, because the compare exchange makes sure
+    // that our new value appears directly after it).
+    let mut value = COUNTER.load(atomic::Ordering::Relaxed);
+    loop {
+        // Attempt to add one to the atomic.
+        let res = COUNTER.compare_exchange(
+            value,
+            value.checked_add(1)?,
+            atomic::Ordering::Relaxed,
+            atomic::Ordering::Relaxed,
+        );
+        // Check what happened…
+        match res {
+            // If there was no value in between the value we loaded and our
+            // newly written value in the modification order, the compare
+            // exchange suceeded and so we are done.
+            Ok(_) => break,
+
+            // Otherwise, there was a value in between and so we need to retry
+            // the addition and continue looping.
+            Err(updated_value) => value = updated_value,
+        }
+    }
+    Some(value)
+}
+```
+
+This `compare_exchange` loop enables the algorithm to succeed even under
+contention; it will simply try again (and again and again). In the below
+execution, Thread 1 gets raced to storing its value of `1` to the counter, but
+that’s okay because it will just add `1` to the `1`, making `2`, and retry the
+compare exchange with that, eventually resulting in a unique ID.
+
+```text
+Thread 1   COUNTER   Thread 2
+╭───────╮   ┌───┐   ╭───────╮
+│ load  ├───┤ 0 ├───┤ load  │
+╰───╥───╯   └───┘   ╰───╥───╯
+╭───⇓───╮   ┌───┬─┐ ╭───⇓───╮
+│  cas  ├───┤ 1 │ └─┤  cas  │
+╰───╥───╯   └───┘   ╰───────╯
+╭───⇓───╮ ┌─┬───┐
+│  cas  ├─┘ │ 2 │
+╰───────╯   └───┘
+```
+
+> `compare_exchange` is abbreviated to CAS here (which stands for
+> compare-and-swap), since that is the more general name for the operation. It
+> is not to be confused with `compare_and_swap`, a deprecated method on Rust
+> atomics that performs the same task as `compare_exchange` but has an inferior
+> design in some ways.
+
+There are two additional improvements we can make here. First, because our
+algorithm occurs in a loop, it is actually perfectly fine for the CAS to fail
+even when there wasn’t a value inserted in the modification order in between,
+since we’ll just run it again. This allows to switch out our call to
+`compare_exchange` with a call to the weaker `compare_exchange_weak`, that
+unlike the former function is allowed to _spuriously_ (i.e. randomly, from the
+programmer’s perspective) fail. This often results in better performance on
+architectures like ARM, since their `compare_exchange` is really just a loop
+around the underlying `compare_exchange_weak`. x86\_64 however will see no
+difference in performance.
+
+The second improvement is that this pattern is so common that the standard
+library even provides a helper function for it, called `fetch_update`. It
+implements the boilerplate `load`-`loop`-`match` parts for us, so all we have to
+do is provide the closure that calls `checked_add(1)` and it will all just work.
+This leads us to our final code for this example:
+
+```rust
+# use std::sync::atomic::{self, AtomicU64};
+static COUNTER: AtomicU64 = AtomicU64::new(0);
+pub fn get_id() -> Option<u64> {
+    COUNTER.fetch_update(
+        atomic::Ordering::Relaxed,
+        atomic::Ordering::Relaxed,
+        |value| value.checked_add(1),
+    )
+    .ok()
+}
+```
+
+These CAS loops are the absolute bread and butter of concurrent programming;
+they’re absolutely everywhere and essential to know about. Every other RMW
+operation on atomics can (and often is, if the hardware doesn’t have a more
+efficient implementation) be implemented via a CAS loop. This is why CAS is seen
+as the canonical example of an RMW — it’s pretty much the most fundamental
+operation you can get on atomics.
+
+I’d also like to briefly bring attention to the atomic orderings used in this
+section. They were mostly glossed over, but we were exclusively using `Relaxed`,
+and that’s because for something as simple as a global ID counter, _you never
+need more than `Relaxed`_. The more complex cases which we’ll look at later
+definitely do need stronger orderings, but as a general rule, if:
+
+- you only have one atomic, and
+- you have no other related pieces of data
+
+`Relaxed` is more than sufficient.
+
 ## “Out-of-thin-air” values
+
+One peculiar consequence of the semantics of `Relaxed` operations is that it is
+theoretically possible for values to come into existence “out-of-thin-air”
+(commonly abbreviated to OOTA) — that is, a value could appear despite not ever
+being calculated anywhere in code. In particular, consider this setup:
+
+```rust
+# use std::sync::atomic::{self, AtomicU32};
+let x = AtomicU32::new(0);
+let y = AtomicU32::new(0);
+
+// Thread 1:
+let r1 = y.load(atomic::Ordering::Relaxed);
+x.store(r1, atomic::Ordering::Relaxed);
+
+// Thread 2:
+let r2 = x.load(atomic::Ordering::Relaxed);
+y.store(r2, atomic::Ordering::Relaxed);
+```
+
+When starting to draw a diagram for a possible execution of this program, we
+have to first lay out the basic facts that we know:
+- `x` and `y` both start out as zero
+- Thread 1 performs a load of `y` followed by a store of `x`
+- Thread 2 performs a load of `x` followed by a store of `y`
+- Each of `x` and `y` take on exactly two values in their lifetime
+
+Then we can start to construct boxes:
+
+```text
+Thread 1      x       y      Thread 2
+╭───────╮   ┌───┐   ┌───┐   ╭───────╮
+│  load ├─┐ │ 0 │   │ 0 │ ┌─┤ load  │
+╰───╥───╯ │ └───┘   └───┘ │ ╰───╥───╯
+    ║     │   ?───────────┘     ║
+╭───⇓───╮ └───────────?     ╭───⇓───╮
+│ store ├───┬───┐   ┌───┬───┤ store │
+╰───────╯   │ ? │   │ ? │   ╰───────╯
+            └───┘   └───┘
+```
+
+At this point, if either of those lines were to connect to the higher box then
+the execution would be simple: that thread would forward the value to its lower
+box, which the other thread would then either read, or load the same value
+(zero) from the box above it, and we’d end up with zero in both atomics. But
+what if they were to connect downwards? Then we’d end up with an execution that
+looks like this:
+
+```text
+Thread 1      x       y      Thread 2
+╭───────╮   ┌───┐   ┌───┐   ╭───────╮
+│  load ├─┐ │ 0 │   │ 0 │ ┌─┤ load  │
+╰───╥───╯ │ └───┘   └───┘ │ ╰───╥───╯
+    ║     │   ┌───────────┘     ║
+╭───⇓───╮ └───┼───────┐     ╭───⇓───╮
+│ store ├───┬─┴─┐   ┌─┴─┬───┤ store │
+╰───────╯   │ ? │   │ ? │   ╰───────╯
+            └───┘   └───┘
+```
+
+But hang on — it’s not fully resolved yet, we still haven’t put in a value in
+those lower question marks. So what value should it be? Well, the second value
+of `x` is just copied from from the second value of `y`, so we just have to find
+the value of that — but the second value of `y` is itself copied from the second
+value of `x`! This means that we can actually put any value we like in that box,
+including `0` or `42`, and the logic will check out perfectly fine — meaning if
+this program were to execute in this fashion, it would end up reading a value
+produced out of thin air!
+
+Now, if we were to strictly follow the rules we’ve laid out thus far, then this
+would be totally valid thing to happen. But luckily, the authors of the C++
+specification have recognized this as a problem, and as such refined the
+semantics of `Relaxed` to implement a thorough, logically sound, mathematically
+proven formal model that prevents it, that’s just too complex and technical to
+explain here—
+
+> No “out-of-thin-air” values can be computed that circularly depend on their
+> own computations.
+
+Just kidding. Turns out, it’s a *really* difficult problem to solve, and to my
+knowledge even now there is no known formal way to express how to prevent it. So
+in the specification they just kind of hand-wave and say that it shouldn’t
+happen, and that the above program must always give zero in both atomics,
+despite the theoretical execution that could result in something else. Well, it
+generally works in practice so I can’t complain — it’s just a very interesting
+detail to know about.