pull/378/merge
Sabrina Jewson 1 month ago committed by GitHub
commit d9e07f1b03
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -31,5 +31,8 @@ git-repository-url = "https://github.com/rust-lang/nomicon"
"./arc-layout.html" = "./arc-mutex/arc-layout.html"
"./arc.html" = "./arc-mutex/arc.html"
# Atomics chapter
"./atomics.html" = "./atomics/atomics.html"
[rust]
edition = "2024"

@ -41,7 +41,12 @@
* [Concurrency](concurrency.md)
* [Races](races.md)
* [Send and Sync](send-and-sync.md)
* [Atomics](atomics.md)
* [Atomics](./atomics/atomics.md)
* [Multithreaded Execution](./atomics/multithread.md)
* [Relaxed](./atomics/relaxed.md)
* [Acquire and Release](./atomics/acquire-release.md)
* [SeqCst](./atomics/seqcst.md)
* [Fences](./atomics/fences.md)
* [Implementing Vec](./vec/vec.md)
* [Layout](./vec/vec-layout.md)
* [Allocating](./vec/vec-alloc.md)

@ -28,7 +28,7 @@ happens-before relationship but is atomic. When `Drop`ping the Arc, however,
we'll need to atomically synchronize when decrementing the reference count. This
is described more in [the section on the `Drop` implementation for
`Arc`](arc-drop.md). For more information on atomic relationships and Relaxed
ordering, see [the section on atomics](../atomics.md).
ordering, see [the section on atomics](../atomics/atomics.md).
Thus, the code becomes this:

@ -1,239 +0,0 @@
# Atomics
Rust pretty blatantly just inherits the memory model for atomics from C++20. This is not
due to this model being particularly excellent or easy to understand. Indeed,
this model is quite complex and known to have [several flaws][C11-busted].
Rather, it is a pragmatic concession to the fact that *everyone* is pretty bad
at modeling atomics. At very least, we can benefit from existing tooling and
research around the C/C++ memory model.
(You'll often see this model referred to as "C/C++11" or just "C11". C just copies
the C++ memory model; and C++11 was the first version of the model but it has
received some bugfixes since then.)
Trying to fully explain the model in this book is fairly hopeless. It's defined
in terms of madness-inducing causality graphs that require a full book to
properly understand in a practical way. If you want all the nitty-gritty
details, you should check out the [C++ specification][C++-model].
Still, we'll try to cover the basics and some of the problems Rust developers
face.
The C++ memory model is fundamentally about trying to bridge the gap between the
semantics we want, the optimizations compilers want, and the inconsistent chaos
our hardware wants. *We* would like to just write programs and have them do
exactly what we said but, you know, fast. Wouldn't that be great?
## Compiler Reordering
Compilers fundamentally want to be able to do all sorts of complicated
transformations to reduce data dependencies and eliminate dead code. In
particular, they may radically change the actual order of events, or make events
never occur! If we write something like:
<!-- ignore: simplified code -->
```rust,ignore
x = 1;
y = 3;
x = 2;
```
The compiler may conclude that it would be best if your program did:
<!-- ignore: simplified code -->
```rust,ignore
x = 2;
y = 3;
```
This has inverted the order of events and completely eliminated one event.
From a single-threaded perspective this is completely unobservable: after all
the statements have executed we are in exactly the same state. But if our
program is multi-threaded, we may have been relying on `x` to actually be
assigned to 1 before `y` was assigned. We would like the compiler to be
able to make these kinds of optimizations, because they can seriously improve
performance. On the other hand, we'd also like to be able to depend on our
program *doing the thing we said*.
## Hardware Reordering
On the other hand, even if the compiler totally understood what we wanted and
respected our wishes, our hardware might instead get us in trouble. Trouble
comes from CPUs in the form of memory hierarchies. There is indeed a global
shared memory space somewhere in your hardware, but from the perspective of each
CPU core it is *so very far away* and *so very slow*. Each CPU would rather work
with its local cache of the data and only go through all the anguish of
talking to shared memory only when it doesn't actually have that memory in
cache.
After all, that's the whole point of the cache, right? If every read from the
cache had to run back to shared memory to double check that it hadn't changed,
what would the point be? The end result is that the hardware doesn't guarantee
that events that occur in some order on *one* thread, occur in the same
order on *another* thread. To guarantee this, we must issue special instructions
to the CPU telling it to be a bit less smart.
For instance, say we convince the compiler to emit this logic:
```text
initial state: x = 0, y = 1
THREAD 1 THREAD 2
y = 3; if x == 1 {
x = 1; y *= 2;
}
```
Ideally this program has 2 possible final states:
* `y = 3`: (thread 2 did the check before thread 1 completed)
* `y = 6`: (thread 2 did the check after thread 1 completed)
However there's a third potential state that the hardware enables:
* `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`)
It's worth noting that different kinds of CPU provide different guarantees. It
is common to separate hardware into two categories: strongly-ordered and weakly-ordered.
Most notably x86/64 provides strong ordering guarantees, while ARM
provides weak ordering guarantees. This has two consequences for concurrent
programming:
* Asking for stronger guarantees on strongly-ordered hardware may be cheap or
even free because they already provide strong guarantees unconditionally.
Weaker guarantees may only yield performance wins on weakly-ordered hardware.
* Asking for guarantees that are too weak on strongly-ordered hardware is
more likely to *happen* to work, even though your program is strictly
incorrect. If possible, concurrent algorithms should be tested on
weakly-ordered hardware.
## Data Accesses
The C++ memory model attempts to bridge the gap by allowing us to talk about the
*causality* of our program. Generally, this is by establishing a *happens
before* relationship between parts of the program and the threads that are
running them. This gives the hardware and compiler room to optimize the program
more aggressively where a strict happens-before relationship isn't established,
but forces them to be more careful where one is established. The way we
communicate these relationships are through *data accesses* and *atomic
accesses*.
Data accesses are the bread-and-butter of the programming world. They are
fundamentally unsynchronized and compilers are free to aggressively optimize
them. In particular, data accesses are free to be reordered by the compiler on
the assumption that the program is single-threaded. The hardware is also free to
propagate the changes made in data accesses to other threads as lazily and
inconsistently as it wants. Most critically, data accesses are how data races
happen. Data accesses are very friendly to the hardware and compiler, but as
we've seen they offer *awful* semantics to try to write synchronized code with.
Actually, that's too weak.
**It is literally impossible to write correct synchronized code using only data
accesses.**
Atomic accesses are how we tell the hardware and compiler that our program is
multi-threaded. Each atomic access can be marked with an *ordering* that
specifies what kind of relationship it establishes with other accesses. In
practice, this boils down to telling the compiler and hardware certain things
they *can't* do. For the compiler, this largely revolves around re-ordering of
instructions. For the hardware, this largely revolves around how writes are
propagated to other threads. The set of orderings Rust exposes are:
* Sequentially Consistent (SeqCst)
* Release
* Acquire
* Relaxed
(Note: We explicitly do not expose the C++ *consume* ordering)
TODO: negative reasoning vs positive reasoning? TODO: "can't forget to
synchronize"
## Sequentially Consistent
Sequentially Consistent is the most powerful of all, implying the restrictions
of all other orderings. Intuitively, a sequentially consistent operation
cannot be reordered: all accesses on one thread that happen before and after a
SeqCst access stay before and after it. A data-race-free program that uses
only sequentially consistent atomics and data accesses has the very nice
property that there is a single global execution of the program's instructions
that all threads agree on. This execution is also particularly nice to reason
about: it's just an interleaving of each thread's individual executions. This
does not hold if you start using the weaker atomic orderings.
The relative developer-friendliness of sequential consistency doesn't come for
free. Even on strongly-ordered platforms sequential consistency involves
emitting memory fences.
In practice, sequential consistency is rarely necessary for program correctness.
However sequential consistency is definitely the right choice if you're not
confident about the other memory orders. Having your program run a bit slower
than it needs to is certainly better than it running incorrectly! It's also
mechanically trivial to downgrade atomic operations to have a weaker
consistency later on. Just change `SeqCst` to `Relaxed` and you're done! Of
course, proving that this transformation is *correct* is a whole other matter.
## Acquire-Release
Acquire and Release are largely intended to be paired. Their names hint at their
use case: they're perfectly suited for acquiring and releasing locks, and
ensuring that critical sections don't overlap.
Intuitively, an acquire access ensures that every access after it stays after
it. However operations that occur before an acquire are free to be reordered to
occur after it. Similarly, a release access ensures that every access before it
stays before it. However operations that occur after a release are free to be
reordered to occur before it.
When thread A releases a location in memory and then thread B subsequently
acquires *the same* location in memory, causality is established. Every write
(including non-atomic and relaxed atomic writes) that happened before A's
release will be observed by B after its acquisition. However no causality is
established with any other threads. Similarly, no causality is established
if A and B access *different* locations in memory.
Basic use of release-acquire is therefore simple: you acquire a location of
memory to begin the critical section, and then release that location to end it.
For instance, a simple spinlock might look like:
```rust
use std::sync::Arc;
use std::sync::atomic::{AtomicBool, Ordering};
use std::thread;
fn main() {
let lock = Arc::new(AtomicBool::new(false)); // value answers "am I locked?"
// ... distribute lock to threads somehow ...
// Try to acquire the lock by setting it to true
while lock.compare_and_swap(false, true, Ordering::Acquire) { }
// broke out of the loop, so we successfully acquired the lock!
// ... scary data accesses ...
// ok we're done, release the lock
lock.store(false, Ordering::Release);
}
```
On strongly-ordered platforms most accesses have release or acquire semantics,
making release and acquire often totally free. This is not the case on
weakly-ordered platforms.
## Relaxed
Relaxed accesses are the absolute weakest. They can be freely re-ordered and
provide no happens-before relationship. Still, relaxed operations are still
atomic. That is, they don't count as data accesses and any read-modify-write
operations done to them occur atomically. Relaxed operations are appropriate for
things that you definitely want to happen, but don't particularly otherwise care
about. For instance, incrementing a counter can be safely done by multiple
threads using a relaxed `fetch_add` if you're not using the counter to
synchronize any other accesses.
There's rarely a benefit in making an operation relaxed on strongly-ordered
platforms, since they usually provide release-acquire semantics anyway. However
relaxed operations can be cheaper on weakly-ordered platforms.
[C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf
[C++-model]: https://en.cppreference.com/w/cpp/atomic/memory_order

@ -0,0 +1,354 @@
# Acquire and Release
Next, were going to try and implement one of the simplest concurrent utilities
possible — a mutex, but without support for waiting (since thats not really
related to what were doing now). It will hold both an atomic flag that
indicates whether it is locked or not, and the protected data itself. In code
this translates to:
```rs
use std::cell::UnsafeCell;
use std::sync::atomic::AtomicBool;
pub struct Mutex<T> {
locked: AtomicBool,
data: UnsafeCell<T>,
}
impl<T> Mutex<T> {
pub const fn new(data: T) -> Self {
Self {
locked: AtomicBool::new(false),
data: UnsafeCell::new(data),
}
}
}
```
Now for the lock function. We need to use an RMW here, since we need to both
check whether it is locked and lock it if it isnt in a single atomic step; this
can be most simply done with a `compare_exchange` (unlike before, it doesnt
need to be in a loop this time). For the ordering, well just use `Relaxed`
since we dont know of any others yet.
```rust
# use std::cell::UnsafeCell;
# use std::sync::atomic::{self, AtomicBool};
# pub struct Mutex<T> {
# locked: AtomicBool,
# data: UnsafeCell<T>,
# }
impl<T> Mutex<T> {
pub fn lock(&self) -> Option<Guard<'_, T>> {
match self.locked.compare_exchange(
false,
true,
atomic::Ordering::Relaxed,
atomic::Ordering::Relaxed,
) {
Ok(_) => Some(Guard(self)),
Err(_) => None,
}
}
}
pub struct Guard<'mutex, T>(&'mutex Mutex<T>);
// Deref impl omitted…
```
We also need to implement `Drop` for `Guard` to make sure the lock on the mutex
is released once the guard is destroyed. Again were just using the `Relaxed`
ordering.
```rust
# use std::cell::UnsafeCell;
# use std::sync::atomic::{self, AtomicBool};
# pub struct Mutex<T> {
# locked: AtomicBool,
# data: UnsafeCell<T>,
# }
# pub struct Guard<'mutex, T>(&'mutex Mutex<T>);
impl<T> Drop for Guard<'_, T> {
fn drop(&mut self) {
self.0.locked.store(false, atomic::Ordering::Relaxed);
}
}
```
Great! In the normal operation then, this primitive should allow unique access
to the data of the mutex to be transferred across different threads. Usual usage
could look like this:
```rust,ignore
// Initial state
let mutex = Mutex::new(0);
// Thread 1
if let Some(guard) = mutex.lock() {
*guard += 1;
}
// Thread 2
if let Some(guard) = mutex.lock() {
println!("{}", *guard);
}
```
Now, there are many possible executions of this code. For example, Thread 2 (the
reader thread) could lock the mutex first, and Thread 1 (the writer thread)
could fail to lock it:
```text
Thread 1 locked data Thread 2
╭───────╮ ┌────────┐ ┌───┐ ╭───────╮
│ cas ├─┐ │ false │ │ 0 ├╌┐ ┌─┤ cas │
╰───────╯ │ └────────┘ └───┘ ┊ │ ╰───╥───╯
│ ┌────────┬───────┼─┘ ╭───⇓───╮
└─┤ true │ └╌╌╌┤ guard │
└────────┘ ╰───╥───╯
┌────────┬─────────┐ ╭───⇓───╮
│ false │ └─┤ store │
└────────┘ ╰───────╯
```
Or potentially Thread _1_ could lock the mutex first, and Thread _2_ could fail
to lock it:
```text
Thread 1 locked data Thread 2
╭───────╮ ┌────────┐ ┌───┐ ╭───────╮
│ cas ├─┐ │ false │ ┌─│ 0 │───┤ cas │
╰───╥───╯ │ └────────┘ │┌┼╌╌╌┤ ╰───────╯
╭───⇓───╮ └─┬────────┐ │├┼╌╌╌┤
│ += 1; ├╌┐ │ true ├─┘┊│ 1 │
╰───╥───╯ ┊ └────────┘ ┊└───┘
╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌╌┘
│ store ├───┬────────┐
╰───────╯ │ false │
└────────┘
```
But the interesting case comes in when Thread 1 successfully locks and unlocks
the mutex, and then Thread 2 locks it. Lets draw that one out too:
```text
Thread 1 locked data Thread 2
╭───────╮ ┌────────┐ ┌───┐ ╭───────╮
│ cas ├─┐ │ false │ │ 0 │ ┌───┤ cas │
╰───╥───╯ │ └────────┘ ┌┼╌╌╌┤ │ ╰───╥───╯
╭───⇓───╮ └─┬────────┐ ├┼╌╌╌┤ │ ╭───⇓───╮
│ += 1; ├╌┐ │ true │ ┊│ 1 │ │ ?╌┤ guard │
╰───╥───╯ ┊ └────────┘ ┊└───┘ │ ╰───╥───╯
╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌╌┘ │ ╭───⇓───╮
│ store ├───┬────────┐ │ ┌─┤ store │
╰───────╯ │ false │ │ │ ╰───────╯
└────────┘ │ │
┌────────┬─────────┘ │
│ true │ │
└────────┘ │
┌────────┬───────────┘
│ false │
└────────┘
```
Look at the second operation Thread 2 performs (the read of `data`), for which
we havent yet joined the line. Where should it connect to? Well actually, it
has multiple options…wait, weve seen this before! Its a data race!
Thats not good. Last time the solution was to use atomics instead — but in this
case that doesnt seem to be enough, since even if atomics were used it still
would have the _option_ of reading `0` instead of `1`, and really if we want our
mutex to be sane, it should only be able to read `1`.
So it seems that what we _want_ is to be able to apply the coherence rules from
before to completely rule out zero from the set of the possible values — if we
were able to draw a large arrow from the Thread 1s `+= 1;` to Thread 2s
`guard`, then we could trivially then use the rule to rule out `0` as a value
that could be read.
This is where the `Acquire` and `Release` orderings come in. Informally put, a
_release store_ will cause an arrow instead of a line to be drawn from the
operation to the destination; and similarly an _acquire load_ will cause an
arrow to be drawn from the destination to the operation. To give a useless
example that illustrates this, for the given program:
```rust
# use std::sync::atomic::{self, AtomicU32};
// Initial state
let a = AtomicU32::new(0);
// Thread 1
a.store(1, atomic::Ordering::Release);
// Thread 2
a.load(atomic::Ordering::Acquire);
```
The two possible executions look like this:
```text
Possible Execution 1 ┃ Possible Execution 2
Thread 1 a Thread 2 ┃ Thread 1 a Thread 2
╭───────╮ ┌───┐ ╭──────╮ ┃ ╭───────╮ ┌───┐ ╭──────╮
│ store ├─┐ │ 0 │ ┌─→ load │ ┃ │ store ├─┐ │ 0 ├───→ load │
╰───────╯ │ └───┘ │ ╰──────╯ ┃ ╰───────╯ │ └───┘ ╰──────╯
└─↘───┐ │ ┃ └─↘───┐
│ 1 ├─┘ ┃ │ 1 │
└───┘ ┃ └───┘
```
These arrows are a new kind of arrow we havent seen yet; they are known as
_happens-before_ (or happens-after) relations and are represented as thin arrows
(→) on these diagrams. They are weaker than the _sequenced-before_
double-arrows (⇒) that occur inside a single thread, but can still be used with
the coherence rules to determine which values of a memory location are valid to
read.
When a happens-before arrow stores a data value to an atomic (via a release
operation) which is then loaded by another happens-before arrow (via an acquire
operation) we say that the release operation _synchronized-with_ the acquire
operation, which in doing so establishes that the release operation
_happens-before_ the acquire operation. Therefore, we can say that in the first
possible execution, Thread 1s `store` synchronizes-with Thread 2s `load`,
which causes that `store` and everything sequenced-before it to happen-before
the `load` and everything sequenced-after it.
> More formally, we can say that A happens-before B if any of the following
> conditions are true:
> 1. A is sequenced-before B (i.e. A occurs before B on the same thread)
> 2. A synchronizes-with B (i.e. A is a `Release` operation and B is an
> `Acquire` operation that reads the value written by A)
> 3. A happens-before X, and X happens-before B (transitivity)
There is one more rule required for these to be useful, and that is _release
sequences_: after a release store is performed on an atomic, happens-before
arrows will connect together each subsequent value of the atomic as long as the
new value is caused by an RMW and not just a plain store (this means any
subsequent normal store, no matter the ordering, will end the sequence).
> In the C++11 memory model, any subsequent store by the same thread that
> performed the original `Release` store would also contribute to the release
> sequence. However, this was removed in C++20 for simplicity and better
> optimizations and so **must not** be relied upon.
With those rules in mind, converting Thread 1s second store to use a `Release`
ordering as well as converting Thread 2s CAS to use an `Acquire` ordering
allows us to effectively draw that arrow we needed before:
```text
Thread 1 locked data Thread 2
╭───────╮ ┌───────┐ ┌───┐ ╭───────╮
│ cas ├─┐ │ false │ │ 0 │ ┌───→ cas │
╰───╥───╯ │ └───────┘ ┌┼╌╌╌┤ │ ╰───╥───╯
╭───⇓───╮ └─┬───────┐ ├┼╌╌╌┤ │ ╭───⇓───╮
│ += 1; ├╌┐ │ true │ ┊│ 1 ├╌│╌╌╌┤ guard │
╰───╥───╯ ┊ └───────┘ ┊└───┘ │ ╰───╥───╯
╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌┘ │ ╭───⇓───╮
│ store ├───↘───────┐ │ ┌─┤ store │
╰───────╯ │ false │ │ │ ╰───────╯
└───┬───┘ │ │
┌───↓───┬─────────┘ │
│ true │ │
└───────┘ │
┌───────┬───────────┘
│ false │
└───────┘
```
We now can trace back along the reverse direction of arrows from the `guard`
bubble to the `+= 1` bubble; we have established that Thread 2s load
happens-after the `+= 1` side effect, because Thread 2s CAS synchronizes-with
Thread 1s store. This both avoids the data race and gives the guarantee that
`1` will be always read by Thread 2 (as long as it locks after Thread 1, of
course).
However, that is not the only execution of the program possible. Even with this
setup, there is another execution that can also cause UB: if Thread 2 locks the
mutex before Thread 1 does.
```text
Thread 1 locked data Thread 2
╭───────╮ ┌───────┐ ┌───┐ ╭───────╮
│ cas ├───┐ │ false │┌──│ 0 │────→ cas │
╰───╥───╯ │ └───────┘│ ┌┼╌╌╌┤ ╰───╥───╯
╭───⇓───╮ │ ┌───────┬┘ ├┼╌╌╌┤ ╭───⇓───╮
│ += 1; ├╌┐ │ │ true │ ┊│ 1 │ ?╌┤ guard │
╰───╥───╯ ┊ │ └───────┘ ┊└───┘ ╰───╥───╯
╭───⇓───╮ └╌│╌╌╌╌╌╌╌╌╌╌╌╌┘ ╭───⇓───╮
│ store ├─┐ │ ┌───────┬────────────┤ store │
╰───────╯ │ │ │ false │ ╰───────╯
│ │ └───────┘
│ └─┬───────┐
│ │ true │
│ └───────┘
└───↘───────┐
│ false │
└───────┘
```
Once again `guard` has multiple options for values to read. This ones a bit
more counterintuitive than the previous one, since it requires “travelling
forward in time” to understand why the `1` is even there in the first place —
but since the abstract machine has no concept of time, its just a valid UB as
any other.
Luckily, weve already solved this problem once, so it easy to solve again: just
like before, well have the CAS become acquire and the store become release, and
then we can use the second coherence rule from before to follow _forward_ the
arrow from the `guard` bubble all the way to the `+= 1;`, determining that it is
only possible for that read to see `0` as its value, as in the execution below.
```text
Thread 1 locked data Thread 2
╭───────╮ ┌───────┐ ┌───┐ ╭───────╮
│ cas ←───┐ │ false │┌──│ 0 ├╌┐──→ cas │
╰───╥───╯ │ └───────┘│ ┌┼╌╌╌┤ ┊ ╰───╥───╯
╭───⇓───╮ │ ┌───────┬┘ ├┼╌╌╌┤ ┊ ╭───⇓───╮
│ += 1; ├╌┐ │ │ true │ ┊│ 1 │ └─╌┤ guard │
╰───╥───╯ ┊ │ └───────┘ ┊└───┘ ╰───╥───╯
╭───⇓───╮ └╌│╌╌╌╌╌╌╌╌╌╌╌╌┘ ╭───⇓───╮
│ store ├─┐ │ ┌───────↙────────────┤ store │
╰───────╯ │ │ │ false │ ╰───────╯
│ │ └───┬───┘
│ └─┬───↓───┐
│ │ true │
│ └───────┘
└───↘───────┐
│ false │
└───────┘
```
This leads us to the proper memory orderings for any mutex (and other locks like
RW locks too, even): use `Acquire` to lock it, and `Release` to unlock it. So
lets go back to and update our original mutex definition with this knowledge.
But wait, `compare_exchange` takes two ordering parameters, not just one! Thats
right — it also takes a second one to apply when the exchange fails (in our case,
when the mutex is already locked). But we dont need an `Acquire` here, since in
that case we wont be reading from the `data` value anyway, so well just stick
with `Relaxed`.
```rust,ignore
impl<T> Mutex<T> {
pub fn lock(&self) -> Option<Guard<'_, T>> {
match self.locked.compare_exchange(
false,
true,
atomic::Ordering::Acquire,
atomic::Ordering::Relaxed,
) {
Ok(_) => Some(Guard(self)),
Err(_) => None,
}
}
}
impl<T> Drop for Guard<'_, T> {
fn drop(&mut self) {
self.0.locked.store(false, atomic::Ordering::Release);
}
}
```
Note that similarly to how atomic operations only make sense when paired with
other atomic operations on the same locations, `Acquire` only makes sense when
paired with `Release` and vice versa. That is, both an `Acquire` with no
corresponding `Release` and a `Release` with no corresponding `Acquire` are
useless, since the arrows will be unable to connect to anything.

@ -0,0 +1,125 @@
# Atomics
Rust pretty blatantly just inherits the memory model for atomics from C++20. This is not
due to this model being particularly excellent or easy to understand. Indeed,
this model is quite complex and known to have [several flaws][C11-busted].
Rather, it is a pragmatic concession to the fact that *everyone* is pretty bad
at modeling atomics. At very least, we can benefit from existing tooling and
research around the C/C++ memory model.
(You'll often see this model referred to as "C/C++11" or just "C11". C just copies
the C++ memory model; and C++11 was the first version of the model but it has
received some bugfixes since then.)
Trying to fully explain the model in this book is fairly hopeless. It's defined
in terms of madness-inducing causality graphs that require a full book to
properly understand in a practical way. If you want all the nitty-gritty
details, you should check out the [C++ specification][C++-model] —
note that Rust atomics correspond to C++s `atomic_ref`, since Rust allows
accessing atomics via non-atomic operations when it is safe to do so.
In this section we aim to give an informal overview of the topic to cover the
common problems that Rust developers face.
## Motivation
The C++ memory model is very large and confusing with lots of seemingly
arbitrary design decisions. To understand the motivation behind this, it can
help to look at what got us in this situation in the first place. There are
three main factors at play here:
1. Users of the language, who want fast, cross-platform code;
2. compilers, who want to optimize code to make it fast;
3. and the hardware, which is ready to unleash a wrath of inconsistent chaos on
your program at a moment's notice.
The memory model is fundamentally about trying to bridge the gap between these
three, allowing users to write the algorithms they want while the compiler and
hardware perform the arcane magic necessary to make them run fast.
### Compiler Reordering
Compilers fundamentally want to be able to do all sorts of complicated
transformations to reduce data dependencies and eliminate dead code. In
particular, they may radically change the actual order of events, or make events
never occur! If we write something like:
<!-- ignore: simplified code -->
```rust,ignore
x = 1;
y = 3;
x = 2;
```
The compiler may conclude that it would be best if your program did:
<!-- ignore: simplified code -->
```rust,ignore
x = 2;
y = 3;
```
This has inverted the order of events and completely eliminated one event.
From a single-threaded perspective this is completely unobservable: after all
the statements have executed we are in exactly the same state. But if our
program is multi-threaded, we may have been relying on `x` to actually be
assigned to 1 before `y` was assigned. We would like the compiler to be
able to make these kinds of optimizations, because they can seriously improve
performance. On the other hand, we'd also like to be able to depend on our
program *doing the thing we said*.
### Hardware Reordering
On the other hand, even if the compiler totally understood what we wanted and
respected our wishes, our hardware might instead get us in trouble. Trouble
comes from CPUs in the form of memory hierarchies. There is indeed a global
shared memory space somewhere in your hardware, but from the perspective of each
CPU core it is *so very far away* and *so very slow*. Each CPU would rather work
with its local cache of the data and only go through all the anguish of
talking to shared memory only when it doesn't actually have that memory in
cache.
After all, that's the whole point of the cache, right? If every read from the
cache had to run back to shared memory to double check that it hadn't changed,
what would the point be? The end result is that the hardware doesn't guarantee
that events that occur in some order on *one* thread, occur in the same
order on *another* thread. To guarantee this, we must issue special instructions
to the CPU telling it to be a bit less smart.
For instance, say we convince the compiler to emit this logic:
```text
initial state: x = 0, y = 1
THREAD 1 THREAD 2
y = 3; if x == 1 {
x = 1; y *= 2;
}
```
Ideally this program has 2 possible final states:
* `y = 3`: (thread 2 did the check before thread 1 completed)
* `y = 6`: (thread 2 did the check after thread 1 completed)
However there's a third potential state that the hardware enables:
* `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`)
It's worth noting that different kinds of CPU provide different guarantees. It
is common to separate hardware into two categories: strongly-ordered and
weakly-ordered, where strongly-ordered hardware implements weak orderings like
`Relaxed` using strong orderings like `Acquire`, while weakly-ordered hardware
makes use of the optimization potential that weak orderings like `Relaxed` give.
Most notably, x86/64 provides strong ordering guarantees, while ARM provides
weak ordering guarantees. This has two consequences for concurrent programming:
* Asking for stronger guarantees on strongly-ordered hardware may be cheap or
even free because they already provide strong guarantees unconditionally.
Weaker guarantees may only yield performance wins on weakly-ordered hardware.
* Asking for guarantees that are too weak on strongly-ordered hardware is
more likely to *happen* to work, even though your program is strictly
incorrect. If possible, concurrent algorithms should be tested on
weakly-ordered hardware.
[C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf
[C++-model]: https://en.cppreference.com/w/cpp/atomic/memory_order

@ -0,0 +1,257 @@
# Fences
As well as loads, stores, and RMWs, there is one more kind of atomic operation
to be aware of: fences. Fences can be triggered by the
[`core::sync::atomic::fence`] function, which accepts a single ordering
parameter and returns nothing. They dont do anything on their own, but can be
thought of as events that strengthen the ordering of nearby atomic operations.
## Acquire fences
The most common kind of fence is an _acquire fence_, which can be triggered in
three different ways:
1. `atomic::fence(atomic::Ordering::Acquire)`
1. `atomic::fence(atomic::Ordering::AcqRel)`
1. `atomic::fence(atomic::Ordering::SeqCst)`
An acquire fence retroactively makes every single non-`Acquire` operation that
was sequenced-before it act like an `Acquire` operation that occurred at the
fence — in other words, it causes every prior `Release`d value that was
previously loaded on the thread to synchronize-with the fence. For example, the
following code:
```rust
# use std::sync::atomic::{self, AtomicU32};
static X: AtomicU32 = AtomicU32::new(0);
// t_1
X.store(1, atomic::Ordering::Release);
// t_2
let value = X.load(atomic::Ordering::Relaxed);
atomic::fence(atomic::Ordering::Acquire);
```
Can result in two possible executions:
```text
Possible Execution 1 ┃ Possible Execution 2
t_1 X t_2 ┃ t_1 X t_2
╭───────╮ ┌───┐ ╭───────╮ ┃ ╭───────╮ ┌───┐ ╭───────╮
│ store ├─┐ │ 0 │ ┌─┤ load │ ┃ │ store ├─┐ │ 0 ├───┤ load │
╰───────╯ │ └───┘ │ ╰───╥───╯ ┃ ╰───────╯ │ └───┘ ╰───╥───╯
└─↘───┐ │ ╭───⇓───╮ ┃ └─↘───┐ ╭───⇓───╮
│ 1 ├─┘┌→ fence │ ┃ │ 1 │ │ fence │
└───┴──┘╰───────╯ ┃ └───┘ ╰───────╯
```
In the first execution, `t_1`s store synchronizes-with and therefore
happens-before `t_2`s fence due to the prior load, but note that it does _not_
happen-before `t_2`s load.
Acquire fences work on any number of atomics, and on release sequences too. A
more complex example is as follows:
```rust
# use std::sync::atomic::{self, AtomicU32};
static X: AtomicU32 = AtomicU32::new(0);
static Y: AtomicU32 = AtomicU32::new(0);
// t_1
X.store(1, atomic::Ordering::Release);
X.fetch_add(1, atomic::Ordering::Relaxed);
// t_2
Y.store(1, atomic::Ordering::Release);
// t_3
let x = X.load(atomic::Ordering::Relaxed);
let y = Y.load(atomic::Ordering::Relaxed);
atomic::fence(atomic::Ordering::Acquire);
```
This can result in an execution like so:
```text
t_1 X t_3 Y t_2
╭───────╮ ┌───┐ ╭───────╮ ┌───┐ ╭───────╮
│ store ├─┐ │ 0 │ ┌─┤ load │ │ 0 │ ┌─┤ store │
╰───╥───╯ │ └───┘ │ ╰───╥───╯ └───┘ │ ╰───────╯
╭───⇓───╮ └─↘───┐ │ ╭───⇓───╮ ┌───↙─┘
│ rmw ├─┐ │ 1 │ │ │ load ├───┤ 1 │
╰───────╯ │ └─┬─┘ │ ╰───╥───╯ ┌─┴───┘
└─┬─↓─┐ │ ╭───⇓───╮ │
│ 2 ├─┘┌→ fence ←─┘
└───┴──┘╰───────╯
```
There are two common scenarios in which acquire fences are used:
1. When an `Acquire` ordering is only necessary when a specific value is loaded.
For example, you may only wish to acquire when an `initialized` boolean is
`true`, since otherwise you wont be reading the shared state at all. In
this case, you can load with a `Relaxed` ordering and then issue an
`Acquire` fence afterward only if that condition is met, which can aid in
performance sometimes (since the acquire operation is avoided when
`initialized == false`).
2. When several `Acquire` operations on different locations need to be performed
in a row, but individually each operation doesnt need `Acquire` ordering;
it is often faster to perform all the loads as `Relaxed` first and use a
single `Acquire` fence at the end then it is to make each one separately use
`Acquire`.
## Release fences
Release fences are the natural complement to acquire fences, and they similarly
can be triggered in three different ways:
1. `atomic::fence(atomic::Ordering::Release)`
1. `atomic::fence(atomic::Ordering::AcqRel)`
1. `atomic::fence(atomic::Ordering::SeqCst)`
Release fences convert every subsequent atomic access in the same thread into a
release operation that has its arrow starting from the fence — in other words,
every `Acquire` operation that sees a value that was written by the fences
thread after the release fence will synchronize-with the release fence. For
example, the following code:
```rust
# use std::sync::atomic::{self, AtomicU32};
static X: AtomicU32 = AtomicU32::new(0);
// t_1
atomic::fence(atomic::Ordering::Release);
X.store(1, atomic::Ordering::Relaxed);
// t_2
X.load(atomic::Ordering::Acquire);
```
Can result in this execution:
```text
t_1 X t_2
╭───────╮ ┌───┐ ╭───────╮
│ fence ├─┐ │ 0 │ ┌─→ load │
╰───╥───╯ │ └───┘ │ ╰───────╯
╭───⇓───╮ └─↘───┐ │
│ store ├───┤ 1 ├─┘
╰───────╯ └───┘
```
As well as it being possible for a release fence to synchronize-with an acquire
load (fenceatomic synchronization) and a release store to synchronize-with an
acquire fence (atomicfence synchronization), it is also possible for release
fences to synchronize with acquire fences (fencefence synchronization). In this
code snippet, only fences and `Relaxed` operations are used to establish a
happens-before relation (in some executions):
```rust
# use std::sync::atomic::{self, AtomicU32};
static X: AtomicU32 = AtomicU32::new(0);
// t_1
atomic::fence(atomic::Ordering::Release);
X.store(1, atomic::Ordering::Relaxed);
// t_2
X.load(atomic::Ordering::Relaxed);
atomic::fence(atomic::Ordering::Acquire);
```
The execution with the relation looks like this:
```text
t_1 X t_2
╭───────╮ ┌───┐ ╭───────╮
│ fence ├─┐ │ 0 │ ┌─┤ load │
╰───╥───╯ │ └───┘ │ ╰───╥───╯
╭───⇓───╮ └─↘───┐ │ ╭───⇓───╮
│ store ├───┤ 1 ├─┘┌→ fence │
╰───────╯ └───┴──┘╰───────╯
```
Like with acquire fences, release fences can be used to optimize over a series
of atomic stores that dont individually need to be `Release`, since in some
conditions and on some architectures its faster to put a single release fence
at the start and use `Relaxed` from that point on than it is to use `Release`
every time.
## `AcqRel` fences
`AcqRel` fences are just the combined behaviour of an `Acquire` fence and a
`Release` fence in one operation. There isnt much special to note about them,
other than that they behave more like an acquire fence followed by a release
fence than the other way around, which is useful to know in situations like the
following:
```text
t_1 X t_2 Y t_3
╭───────╮ ┌───┐ ╭───────╮ ┌───┐ ╭───────╮
│ A │ │ 0 │ ┌─┤ load │ │ 0 │ ┌─→ load │
╰───╥───╯ └───┘ │ ╰───╥───╯ └───┘ │ ╰───╥───╯
╭───⇓───╮ ┌─↘───┐ │ ╭───⇓───╮┌──↘───┐ │ ╭───⇓───╮
│ store ├─┘ │ 1 ├─┘┌→ fence ├┘┌─┤ 1 ├─┘ │ B │
╰───────╯ └───┴──┘╰───╥───╯ │ └───┘ ╰───────╯
╭───⇓───╮ │
│ store ├─┘
╰───────╯
```
Here, A happens-before B, which is singularly due to the `AcqRel` fences
ability to “carry over” happens-before relations within itself.
## `SeqCst` fences
`SeqCst` fences are the strongest kind of fence. They first of all inherit the
behaviour from an `AcqRel` fence, meaning they have both acquire and release
semantics at the same time, but being `SeqCst` operations they also participate
in _S_. Just as with all other `SeqCst` operations, their placement in _S_ is
primarily determined by strongly happens-before relations (including the
[mixed-`SeqCst` caveat] that comes with it), which then gives additional
guarantees to your code.
Namely, the power of `SeqCst` fences can be summarized in three points:
* Everything that happens-before a `SeqCst` fence is not coherence-ordered-after
any `SeqCst` operation that the fence precedes in _S_.
* Everything that happens-after a `SeqCst` fence is not coherence-ordered-before
any `SeqCst` operation that the fence succeeds in _S_.
* Everything that happens-before a `SeqCst` fence X is not
coherence-ordered-after anything that happens-after another `SeqCst` fence
Y, if X preceeds Y in _S_.
> In C++11, the above three statements were similar, except they only talked
> about what was sequenced-before and sequenced-after the `SeqCst` fences; C++20
> strengthened this to also include happens-before, because in practice this
> theoretical optimization was not being exploited by anybody. However do note
> that as of the time of writing, [Miri only implements the old, weaker
> semantics][miri scfix] and so you may see false positives when testing with
> it.
The “motivating use-case” for `SeqCst` demonstrated in the `SeqCst` chapter can
also be rewritten to use exclusively `SeqCst` fences and `Relaxed` operations,
by inserting fences in between the operations in the two threads:
```text
a static X static Y b
╭─────────╮ ┌───────┐ ┌───────┐ ╭─────────╮
│ store X ├─┐ │ false │ │ false │ ┌─┤ store Y │
╰────╥────╯ │ └───────┘ └───────┘ │ ╰────╥────╯
╭────⇓────╮ └─┬───────┐ ┌───────┬─┘ ╭────⇓────╮
*fence* │ │ true │ │ true │ │ *fence*
╰────╥────╯ └───────┘ └───────┘ ╰────╥────╯
╭────⇓────╮ ╭────⇓────╮
│ load Y ├─? ?─┤ load X │
╰─────────╯ ╰─────────╯
```
There are two executions to consider here, depending on which way round the
fences appear in _S_. Should `a`s fence appear first, the fencefence `SeqCst`
guarantee tells us that `b`s load of `X` is not coherence-ordered-after `a`s
store of `X`, which forbids `b`s load of `X` from seeing the value `false`. The
same logic can be applied should the fences appear the other way around, proving
that at least one thread must load `true` in the end.
[`core::sync::atomic::fence`]: https://doc.rust-lang.org/stable/core/sync/atomic/fn.fence.html
[mixed-`SeqCst` caveat]: seqcst.md#the-mixed-seqcst-special-case
[miri scfix]: https://github.com/rust-lang/miri/issues/2301

@ -0,0 +1,291 @@
# Multithreaded Execution
When you write Rust code to run on your computer, it may surprise you but youre
not actually writing Rust code to run on your computer — instead, youre writing
Rust code to run on the _abstract machine_ (or AM for short). The abstract
machine, to be contrasted with the physical machine, is an abstract
representation of a theoretical computer: it doesnt actually exist _per se_,
but the combination of a compiler, target architecture and target operating
system is capable of emulating a subset of its possible behaviours.
The Abstract Machine has a few properties that are essential to understand:
1. It is architecture and OS-independent. The Abstract Machine doesnt care
whether youre on x86_64 or iOS or a Nintendo 3DS, the rules are the same
for everyone. This enables you to write code without having to think about
what the underlying system does or how it does it, as long as you obey the
Abstract Machines rules you know youll be fine.
1. It is the lowest common denominator of all supported computer systems. This
means it is allowed to result in executions no sane computer would actually
generate in real life. It is also purposefully built with forward
compatibility in mind, giving compilers the opportunity to make better and
more aggressive optimizations in the future. As a result, it can be quite
hard to test code, especially if youre on a system that exploits fewer of
the AMs allowed semantics, so it is highly recommended to utilize tools
that intentionally produce these executions like [Loom] and [Miri].
1. Its model is highly formalized and not representative of what goes on
underneath. Because C++ needs to be defined by a formal specification and
not just hand-wavy rules about “this is what is allowed and this is what
isnt”, the Abstract Machine defines things in a very mathematical and,
well, _abstract_, way; instead of saying things like “the compiler is
allowed to do X” it will find a way to define the system such that the
compilers ability to do X simply follows as a natural consequence. This
makes it very elegant and keeps the mathematicians happy, but you should
keep in mind that this is not how computers actually function, it is merely
a representation of it.
With that out of the way, lets look into how the C++20 Abstract Machine is
actually defined.
The first important thing to understand is that **the abstract machine has no
concept of time**. You might expect there to be a single global ordering of
events across the program where each happens at the same time or one after the
other, but under the abstract model no such ordering exists; instead, a possible
execution of the program must be treated as a single event that happens
instantaneously. There is never any such thing as “now”, or a “latest value”,
and using that terminology will only lead you to more confusion. Of course, in
reality there does exist a concept of time, but you must keep in mind that
youre not programming for the hardware, youre programming for the AM.
However, while no global ordering of operations exists _between_ threads, there
does exist a single total ordering _within_ each thread, which is known as its
_sequence_. For example, given this simple Rust program:
```rust
println!("A");
println!("B");
```
its sequence during one possible execution can be visualized like so:
```text
╭───────────────╮
│ println!("A") │
╰───────╥───────╯
╭───────⇓───────╮
│ println!("B") │
╰───────────────╯
```
That double arrow in between the two boxes (`⇒`) represents that the second
statement is _sequenced-after_ the first (and similarly the first statement is
_sequenced-before_ the second). This is the strongest kind of ordering guarantee
between any two operations, and only comes about when those two operations
happen one after the other and on the same thread.
If we add a second thread to the mix:
```rust
// Thread 1:
println!("A");
println!("B");
// Thread 2:
eprintln!("01");
eprintln!("02");
```
it will simply coexist in parallel, with each thread getting its own independent
sequence:
```text
Thread 1 Thread 2
╭───────────────╮ ╭─────────────────╮
│ println!("A") │ │ eprintln!("01") │
╰───────╥───────╯ ╰────────╥────────╯
╭───────⇓───────╮ ╭────────⇓────────╮
│ println!("B") │ │ eprintln!("02") │
╰───────────────╯ ╰─────────────────╯
```
We can say that the prints of `A` and `B` are _unsequenced_ with regard to the
prints of `01` and `02` that occur in the second thread, since they have no
sequenced-before arrows connecting the boxes together.
Note that these diagrams are **not** a representation of multiple things that
_could_ happen at runtime — instead, this diagram describes exactly what _did_
happen when the program ran once. This distinction is key, because it highlights
that even the lowest-level representation of a programs execution does not have
a global ordering between threads; those two disconnected chains are all there
is.
Now lets make things more interesting by introducing some shared data, and have
both threads read it.
```rust
// Initial state
let data = 0;
// Thread 1:
println!("{data}");
// Thread 2:
eprintln!("{data}");
```
Each memory location, similarly to threads, can be shown as another column on
our diagram, but holding values instead of instructions, and each access (read
or write) manifests as a line from the instruction that performed the access to
the associated value in the column. So this code can produce (and is in fact
guaranteed to produce) the following execution:
```text
Thread 1 data Thread 2
╭──────╮ ┌────┐ ╭──────╮
│ data ├╌╌╌╌┤ 0 ├╌╌╌╌┤ data │
╰──────╯ └────┘ ╰──────╯
```
That is, both threads read the same value of `0` from `data`, and the two
operations are unsequenced — they have no relative ordering between them.
Thats reads done, so well look at the other kind of data access next: writes.
Well also return to a single thread for now, just to keep things simple.
```rust
let mut data = 0;
data = 1;
```
Here, we have a single variable that the main thread writes to once — this means
that in its lifetime, it holds two values, first `0`, and then `1`.
Diagrammatically, this codes execution can be represented like so:
```text
Thread 1 data
╭───────╮ ┌────┐
│ = 1 ├╌╌╌┐ │ 0 │
╰───────╯ ├╌╌╌┼╌╌╌╌┤
└╌╌╌┼╌╌╌╌┤
│ 1 │
└────┘
```
Note the use of dashed padding in between the values of `data`s column. Those
spaces wont ever contain a value, but theyre used to represent an
unsynchronized (non-atomic) write — it is garbage data and attempting to read it
would result in a data race.
Now lets put all of our knowledge thus far together, and make a program both
that reads _and_ writes data — woah, scary!
```rust
let mut data = 0;
data = 1;
println!("{data}");
data = 2;
```
Working out executions of code like this is rather like solving a Sudoku puzzle:
you must first lay out all the facts that you know, and then fill in the blanks
with logical reasoning. The initial information weve been given is both the
initial value of `data` and the sequential order of Thread 1; we also know that
over its lifetime, `data` takes on a total of three different values that were
caused by two different non-atomic writes. This allows us to start drawing out
some boxes:
```text
Thread 1 data
╭───────╮ ┌────┐
│ = 1 ├╌? │ 0 │
╰───╥───╯ ?╌┼╌╌╌╌┤
╭───⇓───╮ ?╌┼╌╌╌╌┤
│ data ├╌? │ ? │
╰───╥───╯ ?╌┼╌╌╌╌┤
╭───⇓───╮ ?╌┼╌╌╌╌┤
│ = 2 ├╌? │ ? │
╰───────╯ └────┘
```
We know all of those lines need to be joined _somewhere_, but we dont quite
know _where_ yet. This is where we need to bring in our first rule, a rule that
universally governs all accesses to every location in memory:
> From the point at which the access occurs, find every other point that can be
> reached by following the reverse direction of arrows, then for each one of
> those, take a single step across every line that connects to the relevant
> memory location. **It is not allowed for the access to read or write any value
> that appears above any one of these points**.
In our case, there are two potential executions: one, where the first write
corresponds to the first value in `data`, and two, where the first write
corresponds to the second value in `data`. Considering the second case for a
moment, it would also force the second write to correspond to the first
value in `data`. Therefore its diagram would look something like this:
```text
Thread 1 data
╭───────╮ ┌────┐
│ = 1 ├╌╌┐ │ 0 │
╰───╥───╯ ┊ ┌╌╌┼╌╌╌╌┤
╭───⇓───╮ ┊ ├╌╌┼╌╌╌╌┤
│ data ├╌?┊ ┊ │ 2 │
╰───╥───╯ ├╌┼╌╌┼╌╌╌╌┤
╭───⇓───╮ └╌┼╌╌┼╌╌╌╌┤
│ = 2 ├╌╌╌╌┘ │ 1 │
╰───────╯ └────┘
```
However, that second line breaks the rule we just established! Following up the
arrows from the third operation in Thread 1, we reach the first operation, and
from there we can take a single step to reach the space in between the `2` and
the `1`, which excludes the third access from writing any value above that point
— including the `2` that it is currently writing!
So evidently, this execution is no good. We can therefore conclude that the only
possible execution of this program is the other one, in which the `1` appears
above the `2`:
```text
Thread 1 data
╭───────╮ ┌────┐
│ = 1 ├╌╌┐ │ 0 │
╰───╥───╯ ├╌╌┼╌╌╌╌┤
╭───⇓───╮ └╌╌┼╌╌╌╌┤
│ data ├╌? │ 1 │
╰───╥───╯ ┌╌╌┼╌╌╌╌┤
╭───⇓───╮ ├╌╌┼╌╌╌╌┤
│ = 2 ├╌╌┘ │ 2 │
╰───────╯ └────┘
```
Now to sort out the read operation in the middle. We can use the same rule as
before to trace up to the first write and rule out us reading either the `0`
value or the garbage that exists between it and `1`, but how do we choose
between the `1` and the `2`? Well, as it turns out there is a complement to the
rule we already defined which gives us the exact answer we need:
> From the point at which the access occurs, find every other point that can be
> reached by following the _forward_ direction of arrows, then for each one of
> those, take a single step across every line that connects to the relevant
> memory location. **It is not allowed for the access to read or write any value
> that appears below any one of these points**.
Using this rule, we can follow the arrow downwards and then across and finally
rule out `2` as well as the garbage before it. This leaves us with exactly _one_
value that the read operation can return, and exactly one possible execution
guaranteed by the Abstract Machine:
```text
Thread 1 data
╭───────╮ ┌────┐
│ = 1 ├╌╌┐ │ 0 │
╰───╥───╯ ├╌╌┼╌╌╌╌┤
╭───⇓───╮ └╌╌┼╌╌╌╌┤
│ data ├╌╌╌╌╌┤ 1 │
╰───╥───╯ ┌╌╌┼╌╌╌╌┤
╭───⇓───╮ ├╌╌┼╌╌╌╌┤
│ = 2 ├╌╌┘ │ 2 │
╰───────╯ └────┘
```
These two rules combined make up the more generalized rule known as _coherence_,
which is put in place to guarantee that a thread will never see a value earlier
than the last one it read or later than a one it will in future write. Coherence
is basically required for any program to act in a sane way, so luckily the C++20
standard guarantees it as one of its most fundamental principles.
You might be thinking that all this has been is the longest, most convoluted
explanation ever of the most basic intuitive semantics of programming — and
youd be absolutely right. But its essential to grasp these fundamentals,
because once you have this model in mind, the extension into multiple threads
and the complicated semantics of real atomics becomes completely natural.
[Loom]: https://docs.rs/loom
[Miri]: https://github.com/rust-lang/miri

@ -0,0 +1,452 @@
# Relaxed
Now weve got single-threaded mutation semantics out of the way, we can try
reintroducing a second thread. Well have one thread perform a write to the
memory location, and a second thread read from it, like so:
```rust
// Initial state
let mut data = 0;
// Thread 1:
data = 1;
// Thread 2:
println!("{data}");
```
Of course, any Rust programmer will immediately tell you that this code doesnt
compile, and indeed it definitely does not, and for good reason. But suspend
your disbelief for a moment, and imagine what would happen if it did. Lets draw
a diagram, leaving out the reading lines for now:
```text
Thread 1 data Thread 2
╭───────╮ ┌────┐ ╭───────╮
│ = 1 ├╌┐ │ 0 │ ?╌┤ data │
╰───────╯ ├╌┼╌╌╌╌┤ ╰───────╯
└╌┼╌╌╌╌┤
│ 1 │
└────┘
```
Unfortunately, coherence doesnt help us in finding out where Thread 2s line
joins up to, since there are no arrows connecting that operation to anything and
therefore we cant immediately rule any values out. As a result, we end up
facing a situation we havent faced before: there is _more than one_ potential
value for Thread 2 to read.
And this is where we encounter the big limitation with unsynchronized data
accesses: the price we pay for their speed and optimization capability is that
this situation is considered **Undefined Behavior**. For an unsynchronized read
to be acceptable, there has to be _exactly one_ potential value for it to read,
and when there are multiple like in this situation it is considered a data race.
So what can we do about this? Well, two things need to be changed. First of all,
Thread 1 has to use an atomic store instead of an unsynchronized write, and
secondly Thread 2 has to use an atomic load instead of an unsynchronized read.
Youll also notice that all the atomic functions accept one (and sometimes two)
parameters of `atomic::Ordering`s — well explore the details of the differences
between them later, but for now well use `Relaxed` because it is by far the
simplest of the lot.
```rust
# use std::sync::atomic::{self, AtomicU32};
// Initial state
let data = AtomicU32::new(0);
// Thread 1:
data.store(1, atomic::Ordering::Relaxed);
// Thread 2:
data.load(atomic::Ordering::Relaxed);
```
The use of the atomic store provides one additional ability in comparison to an
unsynchronized store, and that is that there is no “in-between” state between
the old and new values — instead, it immediately updates, resulting in a diagram
that look a bit more like this:
```text
Thread 1 data
╭───────╮ ┌────┐
│ = 1 ├─┐ │ 0 │
╰───────╯ │ └────┘
└─┬────┐
│ 1 │
└────┘
```
We have now established a _modification order_ for `data`: a total, ordered list
of distinct, separated values that it takes over its lifetime.
On the loading side, we also obtain one additional ability: when there are
multiple possible values to choose from in the modification order, instead of it
triggering UB, exactly one (but it is unspecified which) value is chosen. This
means that there are now _two_ potential executions of our program, with no way
for us to control which one occurs:
```text
Possible Execution 1 ┃ Possible Execution 2
Thread 1 data Thread 2 ┃ Thread 1 data Thread 2
╭───────╮ ┌────┐ ╭───────╮ ┃ ╭───────╮ ┌────┐ ╭───────╮
│ store ├─┐ │ 0 ├───┤ load │ ┃ │ store ├─┐ │ 0 │ ┌─┤ load │
╰───────╯ │ └────┘ ╰───────╯ ┃ ╰───────╯ │ └────┘ │ ╰───────╯
└─┬────┐ ┃ └─┬────┐ │
│ 1 │ ┃ │ 1 ├─┘
└────┘ ┃ └────┘
```
Note that **both sides must be atomic to avoid the data race**: if only the
writing side used atomic operations, the reading side would still have multiple
values to choose from (UB), and if only the reading side used atomic operations
it could end up reading the garbage data “in-between” `0` and `1` (also UB).
> **NOTE:** This description of why both sides are needed to be atomic
> operations, while neat and intuitive, is not strictly correct: in reality the
> answer is simply “because the spec says so”. However, it is functionally
> equivalent to the real rules, so it can aid in understanding.
## Read-modify-write operations
Loads and stores are pretty neat in avoiding data races, but you cant get very
far with them. For example, suppose you wanted to implement a global shared
counter that can be used to assign unique IDs to objects. Naïvely, you might try
to write code like this:
```rust
# use std::sync::atomic::{self, AtomicU64};
static COUNTER: AtomicU64 = AtomicU64::new(0);
pub fn get_id() -> u64 {
let value = COUNTER.load(atomic::Ordering::Relaxed);
COUNTER.store(value + 1, atomic::Ordering::Relaxed);
value
}
```
But then calling that function from multiple threads opens you up to an
execution like below that results in two threads obtaining the same ID (note
that the duplication of `1` in the modification order is intentional; even if
two values are the same, they always get separate entries in the order if they
were caused by different accesses):
```text
Thread 1 COUNTER Thread 2
╭───────╮ ┌───┐ ╭───────╮
│ load ├───┤ 0 ├───┤ load │
╰───╥───╯ └───┘ ╰────╥──╯
╭───⇓───╮ ┌─┬───┐ ╭────⇓──╮
│ store ├─┘ │ 1 │ ┌─┤ store │
╰───────╯ └───┘ │ ╰───────╯
┌───┬─┘
│ 1 │
└───┘
```
This is known as a a **race condition** — a logic error in a program caused by a
specific unintended execution of concurrent code. Note that this is distinct
from a _data race_: while a data race is caused by two threads performing
unsynchronized operations at the same time and is always undefined behaviour,
race conditions are totally OK and defined behaviour from the AMs perspective,
but are only harmful because the programmer didnt expect it to be possible. You
can think of the distinction between the two as analagous to the difference
between indexing out-of-bounds and indexing in-bounds, but to the wrong element:
both are bugs, but only one is universally a bug, and the other is merely a
logic problem.
Technically, I believe it is _possible_ to solve this problem with just loads
and stores, if you try hard enough and use several atomics. But luckily, you
dont have to because there also exists another kind of operation, the
read-modify-write, which is specifically suited to this purpose.
A read-modify-write operation (shortened to RMW) is a special kind of atomic
operation that reads, changes and writes back a value _in one step_. This means
that there are guaranteed to exist no other values in the modification order in
between the read and the write; it happens as a single operation. I would also
like to point out that this is true of **all** atomic orderings, since a common
misconception is that the `Relaxed` ordering somehow negates this guarantee.
> Another common confusion about RMWs is that they are guaranteed to “see the
> latest value” of an atomic, which I believe came from a misinterpretation of
> the C++ specification and was later spread by rumour. Of course, this makes no
> sense, since atomics have no latest value due to the lack of the concept of
> time. The original statement in the specification was actually just specifying
> that atomic RMWs are atomic: they only consider the directly previous value in
> the modification order and not any value before it, and gave no additional
> guarantee.
There are many different RMW operations to choose from, but the one most
appropriate for this use case is `fetch_add`, which adds a number to the atomic,
as well as returns the old value. So our code can be rewritten as this:
```rust
# use std::sync::atomic::{self, AtomicU64};
static COUNTER: AtomicU64 = AtomicU64::new(0);
pub fn get_id() -> u64 {
COUNTER.fetch_add(1, atomic::Ordering::Relaxed)
}
```
And then, no matter how many threads there are, that race condition from earlier
can never occur. Executions will have to look more like this:
```text
Thread 1 COUNTER Thread 2
╭───────────╮ ┌───┐ ╭───────────╮
│ fetch_add ├─┐ │ 0 │ ┌─┤ fetch_add │
╰───────────╯ │ └───┘ │ ╰───────────╯
└─┬───┐ │
│ 1 │ │
└───┘ │
┌───┬─┘
│ 2 │
└───┘
```
There is one problem with this code however, and that is that if `get_id()` is
called over 18446744073709551615 times, the counter will overflow and it
will start generating duplicate IDs. Of course, this wont feasibly happen, but
it can be problematic if you need to _prove_ that it cant happen (e.g. for
safety purposes) or youre using a smaller integer type like `u32`.
So were going to modify this function so that instead of returning a plain
`u64` it returns an `Option<u64>`, where `None` is used to indicate that an
overflow occurred and no more IDs could be generated. Additionally, its not
enough to just return `None` once, because if there are multiple threads
involved they will not see that result if it just occurs on a single thread —
instead, it needs to continue to return `None` _until the end of time_ (or,
well, this execution of the program).
That means we have to do away with `fetch_add`, because `fetch_add` will always
overflow and theres no `checked_fetch_add` equivalent. Well return to our racy
algorithm for a minute, this time thinking more about what went wrong. The steps
look something like this:
1. Load a value of the atomic
1. Perform the checked add, propagating `None`
1. Store in the new value of the atomic
The problem here is that the store does not necessarily occur directly after the
load in the atomics modification order, and that leads to the races. What we
need is some way to say, “add this new value to the modification order, but
_only if_ it occurs directly after the value we loaded”. And luckily for us,
there exists a function that does exactly\* this: `compare_exchange`.
`compare_exchange` is a bit like a store, but instead of unconditionally storing
the value, it will first check the value directly before the `compare_exchange`
in the modification order to see whether it is what we expect, and if not it
will simply tell us that and not make any changes. It is an RMW operation, so
all of this happens fully atomically — there is no chance for a race condition.
> \* Its not quite the same, because `compare_exchange` can suffer from ABA
> problems in which it will see a later value in the modification order that
> just happened to be same and succeed. For example, if the modification order
> contained `1, 2, 1` and a thread loaded the first `1`,
> `compare_exchange(1, 3)` could succeed in replacing either the first or second
> `1`, giving either `1, 3, 2, 1` or `1, 2, 1, 3`.
>
> For some algorithms, this is problematic and needs to be taken into account
> with additional checks; however for us, values can never be reused so we dont
> have to worry about it.
In our case, we can simply replace the store with a compare exchange of the old
value and itself plus one (returning `None` instead if the addition overflowed,
to prevent overflowing the atomic). Should the `compare_exchange` fail, we know
that some other thread inserted a value in the modification order after the
value we loaded. This isnt really a problem — we can just try again and again
until we succeed, and `compare_exchange` is even nice enough to give us the
updated value so we dont have to load again. Also note that after weve updated
our value of the atomic, were guaranteed to never see the old value again, by
the coherence rules from the previous chapter.
So heres how it looks with these changes appplied:
```rust
# use std::sync::atomic::{self, AtomicU64};
static COUNTER: AtomicU64 = AtomicU64::new(0);
pub fn get_id() -> Option<u64> {
// Load the counters initial value from some place in the modification
// order (it doesnt matter where, because the compare exchange makes sure
// that our new value appears directly after it).
let mut value = COUNTER.load(atomic::Ordering::Relaxed);
loop {
// Attempt to add one to the atomic.
let res = COUNTER.compare_exchange(
value,
value.checked_add(1)?,
atomic::Ordering::Relaxed,
atomic::Ordering::Relaxed,
);
// Check what happened…
match res {
// If there was no value in between the value we loaded and our
// newly written value in the modification order, the compare
// exchange suceeded and so we are done.
Ok(_) => break,
// Otherwise, there was a value in between and so we need to retry
// the addition and continue looping.
Err(updated_value) => value = updated_value,
}
}
Some(value)
}
```
This `compare_exchange` loop enables the algorithm to succeed even under
contention; it will simply try again (and again and again). In the below
execution, Thread 1 gets raced to storing its value of `1` to the counter, but
thats okay because it will just add `1` to the `1`, making `2`, and retry the
compare exchange with that, eventually resulting in a unique ID.
```text
Thread 1 COUNTER Thread 2
╭───────╮ ┌───┐ ╭───────╮
│ load ├───┤ 0 ├───┤ load │
╰───╥───╯ └───┘ ╰───╥───╯
╭───⇓───╮ ┌───┬─┐ ╭───⇓───╮
│ cas ├───┤ 1 │ └─┤ cas │
╰───╥───╯ └───┘ ╰───────╯
╭───⇓───╮ ┌─┬───┐
│ cas ├─┘ │ 2 │
╰───────╯ └───┘
```
> `compare_exchange` is abbreviated to CAS here (which stands for
> compare-and-swap), since that is the more general name for the operation. It
> is not to be confused with `compare_and_swap`, a deprecated method on Rust
> atomics that performs the same task as `compare_exchange` but has an inferior
> design in some ways.
There are two additional improvements we can make here. First, because our
algorithm occurs in a loop, it is actually perfectly fine for the CAS to fail
even when there wasnt a value inserted in the modification order in between,
since well just run it again. This allows to switch out our call to
`compare_exchange` with a call to the weaker `compare_exchange_weak`, that
unlike the former function is allowed to _spuriously_ (i.e. randomly, from the
programmers perspective) fail. This often results in better performance on
architectures like ARM, since their `compare_exchange` is really just a loop
around the underlying `compare_exchange_weak`. x86\_64 however will see no
difference in performance.
The second improvement is that this pattern is so common that the standard
library even provides a helper function for it, called `fetch_update`. It
implements the boilerplate `load`-`loop`-`match` parts for us, so all we have to
do is provide the closure that calls `checked_add(1)` and it will all just work.
This leads us to our final code for this example:
```rust
# use std::sync::atomic::{self, AtomicU64};
static COUNTER: AtomicU64 = AtomicU64::new(0);
pub fn get_id() -> Option<u64> {
COUNTER.fetch_update(
atomic::Ordering::Relaxed,
atomic::Ordering::Relaxed,
|value| value.checked_add(1),
)
.ok()
}
```
These CAS loops are the absolute bread and butter of concurrent programming;
theyre absolutely everywhere and essential to know about. Every other RMW
operation on atomics can (and often is, if the hardware doesnt have a more
efficient implementation) be implemented via a CAS loop. This is why CAS is seen
as the canonical example of an RMW — its pretty much the most fundamental
operation you can get on atomics.
Id also like to briefly bring attention to the atomic orderings used in this
section. They were mostly glossed over, but we were exclusively using `Relaxed`,
and thats because for something as simple as a global ID counter, _you never
need more than `Relaxed`_. The more complex cases which well look at later
definitely do need stronger orderings, but as a general rule, if:
- you only have one atomic, and
- you have no other related pieces of data
`Relaxed` is more than sufficient.
## “Out-of-thin-air” values
One peculiar consequence of the semantics of `Relaxed` operations is that it is
theoretically possible for values to come into existence “out-of-thin-air”
(commonly abbreviated to OOTA) — that is, a value could appear despite not ever
being calculated anywhere in code. In particular, consider this setup:
```rust
# use std::sync::atomic::{self, AtomicU32};
let x = AtomicU32::new(0);
let y = AtomicU32::new(0);
// Thread 1:
let r1 = y.load(atomic::Ordering::Relaxed);
x.store(r1, atomic::Ordering::Relaxed);
// Thread 2:
let r2 = x.load(atomic::Ordering::Relaxed);
y.store(r2, atomic::Ordering::Relaxed);
```
When starting to draw a diagram for a possible execution of this program, we
have to first lay out the basic facts that we know:
- `x` and `y` both start out as zero
- Thread 1 performs a load of `y` followed by a store of `x`
- Thread 2 performs a load of `x` followed by a store of `y`
- Each of `x` and `y` take on exactly two values in their lifetime
Then we can start to construct boxes:
```text
Thread 1 x y Thread 2
╭───────╮ ┌───┐ ┌───┐ ╭───────╮
│ load ├─┐ │ 0 │ │ 0 │ ┌─┤ load │
╰───╥───╯ │ └───┘ └───┘ │ ╰───╥───╯
║ │ ?───────────┘ ║
╭───⇓───╮ └───────────? ╭───⇓───╮
│ store ├───┬───┐ ┌───┬───┤ store │
╰───────╯ │ ? │ │ ? │ ╰───────╯
└───┘ └───┘
```
At this point, if either of those lines were to connect to the higher box then
the execution would be simple: that thread would forward the value to its lower
box, which the other thread would then either read, or load the same value
(zero) from the box above it, and wed end up with zero in both atomics. But
what if they were to connect downwards? Then wed end up with an execution that
looks like this:
```text
Thread 1 x y Thread 2
╭───────╮ ┌───┐ ┌───┐ ╭───────╮
│ load ├─┐ │ 0 │ │ 0 │ ┌─┤ load │
╰───╥───╯ │ └───┘ └───┘ │ ╰───╥───╯
║ │ ┌───────────┘ ║
╭───⇓───╮ └───┼───────┐ ╭───⇓───╮
│ store ├───┬─┴─┐ ┌─┴─┬───┤ store │
╰───────╯ │ ? │ │ ? │ ╰───────╯
└───┘ └───┘
```
But hang on — its not fully resolved yet, we still havent put in a value in
those lower question marks. So what value should it be? Well, the second value
of `x` is just copied from from the second value of `y`, so we just have to find
the value of that — but the second value of `y` is itself copied from the second
value of `x`! This means that we can actually put any value we like in that box,
including `0` or `42`, and the logic will check out perfectly fine — meaning if
this program were to execute in this fashion, it would end up reading a value
produced out of thin air!
Now, if we were to strictly follow the rules weve laid out thus far, then this
would be totally valid thing to happen. But luckily, the authors of the C++
specification have recognized this as a problem, and as such refined the
semantics of `Relaxed` to implement a thorough, logically sound, mathematically
proven formal model that prevents it, thats just too complex and technical to
explain here—
> No “out-of-thin-air” values can be computed that circularly depend on their
> own computations.
Just kidding. Turns out, its a *really* difficult problem to solve, and to my
knowledge even now there is no known formal way to express how to prevent it. So
in the specification they just kind of hand-wave and say that it shouldnt
happen, and that the above program must always give zero in both atomics,
despite the theoretical execution that could result in something else. Well, it
generally works in practice so I cant complain — its just a very interesting
detail to know about.

@ -0,0 +1,432 @@
# SeqCst
`SeqCst` is probably the most interesting ordering, because it is simultaneously
the simplest and most complex atomic memory ordering in existence. Its
simple, because if you do only use `SeqCst` everywhere then you can kind of
maybe pretend like the Abstract Machine has a concept of time; phrases like
“latest value” make sense, the program can be thought of as a set of steps that
interleave, there is a universal “now” and “before” and wouldnt that be nice?
But its also the most complex, because as soon as look under the hood you
realize just how incredibly convoluted and hard to follow the actual rules
behind it are, and it gets really ugly really fast as soon as you try to mix it
with any other ordering.
To understand `SeqCst`, we first have to understand the problem it exists to
solve. A simple example used to show where weaker orderings produce
counterintuitive results is this:
```rust
# use std::sync::atomic::{self, AtomicBool};
use std::thread;
// Set this to Relaxed, Acquire, Release, AcqRel, doesnt matter — the result is
// the same (modulo panics caused by attempting acquire stores or release
// loads).
const ORDERING: atomic::Ordering = atomic::Ordering::Relaxed;
static X: AtomicBool = AtomicBool::new(false);
static Y: AtomicBool = AtomicBool::new(false);
let a = thread::spawn(|| { X.store(true, ORDERING); Y.load(ORDERING) });
let b = thread::spawn(|| { Y.store(true, ORDERING); X.load(ORDERING) });
let a = a.join().unwrap();
let b = b.join().unwrap();
# return;
// This assert is allowed to fail.
assert!(a || b);
```
The basic setup of this code, for all of its possible executions, looks like
this:
```text
a static X static Y b
╭─────────╮ ┌───────┐ ┌───────┐ ╭─────────╮
│ store X ├─┐ │ false │ │ false │ ┌─┤ store Y │
╰────╥────╯ │ └───────┘ └───────┘ │ ╰────╥────╯
╭────⇓────╮ └─┬───────┐ ┌───────┬─┘ ╭────⇓────╮
│ load Y ├─? │ true │ │ true │ ?─┤ load X │
╰─────────╯ └───────┘ └───────┘ ╰─────────╯
```
In other words, `a` and `b` are guaranteed to store `true` into `X` and `Y`
respectively, and then attempt to load from the other threads atomic. The
question now is: is it possible for them _both_ to load `false`?
And looking at this diagram, theres absolutely no reason why not. There isnt
even a single arrow connecting the left and right hand sides so far, so the
loads have no coherence-based restrictions on which values they are allowed to
pick, and we could end up with an execution like this:
```text
a static X static Y b
╭─────────╮ ┌───────┐ ┌───────┐ ╭─────────╮
│ store X ├┐ │ false ├─┐┌┤ false │ ┌┤ store Y │
╰────╥────╯│ └───────┘┌─┘└───────┘ │╰────╥────╯
║ │ ┌─────────┘└───────────┐│ ║
╭────⇓────╮└─│┬───────┐ ┌───────┬─│┘╭────⇓────╮
│ load Y ├──┘│ true │ │ true │ └─┤ load X │
╰─────────╯ └───────┘ └───────┘ ╰─────────╯
```
Which results in a failed assert. This execution is brought about because the
model of separate modification orders means that there is no relative ordering
between `X` and `Y` being changed, and so each thread is allowed to “see” either
order. However, some algorithms will require a globally agreed-upon ordering,
and this is where `SeqCst` can come in useful.
This ordering, first and foremost, inherits the guarantees from all the other
orderings — it is an acquire operation for loads, a release operation for stores
and an acquire-release operation for RMWs. In addition to this, it gives some
guarantees unique to `SeqCst` about what values it is allowed to load. Note that
these guarantees are not about preventing data races: unless you have some
unrelated code that triggers a data race given an unexpected condition, using
`SeqCst` can only prevent you from race conditions because its guarantees only
apply to other `SeqCst` operations rather than all data accesses.
## S
`SeqCst` is fundamentally about _S_, which is the global ordering of all
`SeqCst` operations in an execution of the program. It is consistent between
every atomic and every thread, and all stores, fences and RMWs that use a
sequentially consistent ordering have a place in it (but no other operations
do). It is in contrast to modification orders, which are similarly total but
only scoped to a single atomic rather than the whole program.
Other than an edge case involving `SeqCst` mixed with weaker orderings (detailed
later on), _S_ is primarily controlled by the happens-before relations in a
program: this means that if an action _A_ happens-before an action _B_, it is
also guaranteed to appear before _B_ in _S_. Other than that restriction, _S_ is
unspecified and will be chosen arbitrarily during execution.
Once a particular _S_ has been established, every atomics modification order is
then guaranteed to be consistent with it, so a `SeqCst` load will never see a
value that has been overwritten by a write that occurred before it in _S_, or a
value that has been written by a write that occured after it in _S_ (note that a
`Relaxed`/`Acquire` load however might, since there is no “before” or “after” as
it is not in _S_ in the first place).
More formally, this guarantee can be described with _coherence orderings_, a
relation which expresses which of two operations appears before the other in an
atomics modification order. It is said that an operation _A_ is
_coherence-ordered-before_ another operation _B_ if any of the following
conditions are met:
1. _A_ is a store or RMW, _B_ is a store or RMW, and _A_ appears before _B_ in
the modification order.
1. _A_ is a store or RMW, _B_ is a load, and _B_ reads the value stored by _A_.
1. _A_ is a load, _B_ is a store or RMW, and _A_ takes its value from a place in
the modification order that appears before _B_.
1. _A_ is coherence-ordered-before a different operation _X_, and _X_ is
coherence-ordered-before _B_ (the basic transitivity property).
The following diagram gives examples for the main three rules (in each case _A_
is coherence-ordered-before _B_):
```text
Rule 1 ┃ Rule 2 ┃ Rule 3
┃ ┃
╭───╮ ┌─┬───┐ ╭───╮ ┃ ╭───╮ ┌─┬───┐ ╭───╮ ┃ ╭───╮ ┌───┐ ╭───╮
│ A ├─┘ │ │ ┌─┤ B │ ┃ │ A ├─┘ │ ├───┤ B │ ┃ │ A ├───┤ │ ┌─┤ B │
╰───╯ └───┘ │ ╰───╯ ┃ ╰───╯ └───┘ ╰───╯ ┃ ╰───╯ └───┘ │ ╰───╯
┌───┬─┘ ┃ ┃ ┌───┬─┘
│ │ ┃ ┃ │ │
└───┘ ┃ ┃ └───┘
```
The only important thing to note is that for two loads of the same value in the
modification order, neither is coherence-ordered-before the other, as in the
following example where _A_ has no coherence ordering relation to _B_:
```text
╭───╮ ┌───┐ ╭───╮
│ A ├───┤ ├───┤ B │
╰───╯ └───┘ ╰───╯
```
Because of this, “_A_ is coherence-ordered-before _B_” is subtly different from
“_A_ is not coherence-ordered-after _B_”; only the latter phrase includes the
above situation, and is synonymous with “either _A_ is coherence-ordered-before
_B_ or _A_ and _B_ are both loads, and see the same value in the atomics
modification order”. “Not coherence-ordered-after” is generally a more useful
relation than “coherence-ordered-before”, and so its important to understand
what it means.
With this terminology applied, we can use a more precise definition of
`SeqCst`s guarantee: for two `SeqCst` operations on the same atomic _A_ and
_B_, where _A_ precedes _B_ in _S_, _A_ is not coherence-ordered-after _B_.
Effectively, this one rule ensures that _S_s order “propagates”
throughout all the atomics of the program — you can imagine each operation in
_S_ as storing a snapshot of the world, so that every subsequent operation is
consistent with it.
## Applying `SeqCst`
So, looking back at our program, lets consider how we could use `SeqCst` to
make that execution invalid. As a refresher, heres the framework for every
possible execution of the program:
```text
a static X static Y b
╭─────────╮ ┌───────┐ ┌───────┐ ╭─────────╮
│ store X ├─┐ │ false │ │ false │ ┌─┤ store Y │
╰────╥────╯ │ └───────┘ └───────┘ │ ╰────╥────╯
╭────⇓────╮ └─┬───────┐ ┌───────┬─┘ ╭────⇓────╮
│ load Y ├─? │ true │ │ true │ ?─┤ load X │
╰─────────╯ └───────┘ └───────┘ ╰─────────╯
```
First of all, both the final loads (`a` and `b`s second operations) need to
become `SeqCst`, because they need to be aware of the total ordering that
determines whether `X` or `Y` becomes `true` first. And secondly, we need to
establish that ordering in the first place, and that needs to be done by making
sure that there is always one operation in _S_ that both sees one of the atomics
as `true` and precedes both final loads in _S_, so that the coherence ordering
guarantee will apply (the final loads themselves dont work for this since
although they “know” that their corresponding atomic is `true` they dont
interact with it directly so _S_ doesnt care) — for this, we must set both
stores to use the `SeqCst` ordering.
This leaves us with the correct version of the above program, which is
guaranteed to never panic:
```rust
# use std::sync::atomic::{self, AtomicBool};
use std::thread;
const ORDERING: atomic::Ordering = atomic::Ordering::SeqCst;
static X: AtomicBool = AtomicBool::new(false);
static Y: AtomicBool = AtomicBool::new(false);
let a = thread::spawn(|| { X.store(true, ORDERING); Y.load(ORDERING) });
let b = thread::spawn(|| { Y.store(true, ORDERING); X.load(ORDERING) });
let a = a.join().unwrap();
let b = b.join().unwrap();
# return;
// This assert is **not** allowed to fail.
assert!(a || b);
```
As there are four `SeqCst` operations with a partial order between two pairs in
them (caused by the sequenced-before relation), there are six possible
executions of this program:
- All of `a`s operations precede `b`s operations:
1. `a` stores `true` into `X`
1. `a` loads `Y` (gives `false`)
1. `b` stores `true` into `Y`
1. `b` loads `X` (required to give `true`)
- All of `b`s operations precede `a`s operations:
1. `b` stores `true` into `Y`
1. `b` loads `X` (gives `false`)
1. `a` stores `true` into `X`
1. `a` loads `Y` (required to give `true`)
- The stores precede the loads,
`a`s store precedes `b`s and `a`s load precedes `b`s:
1. `a` stores `true` to `X`
1. `b` stores `true` into `Y`
1. `a` loads `Y` (required to give `true`)
1. `b` loads `X` (required to give `true`)
- The stores precede the loads,
`a`s store precedes `b`s and `b`s load precedes `a`s:
1. `a` stores `true` to `X`
1. `b` stores `true` into `Y`
1. `b` loads `X` (required to give `true`)
1. `a` loads `Y` (required to give `true`)
- The stores precede the loads,
`b`s store precedes `a`s and `a`s load precedes `b`s:
1. `b` stores `true` into `Y`
1. `a` stores `true` to `X`
1. `a` loads `Y` (required to give `true`)
1. `b` loads `X` (required to give `true`)
- The stores precede the loads,
`b`s store precedes `a`s and `b`s load precedes `a`s:
1. `b` stores `true` into `Y`
1. `a` stores `true` to `X`
1. `b` loads `X` (required to give `true`)
1. `a` loads `Y` (required to give `true`)
All the places where the load was required to give `true` were caused by a
preceding store in _S_ of the same atomic of `true` — otherwise, the load would
be coherence-ordered-before a store which precedes it in _S_, and that is
impossible.
## The mixed-`SeqCst` special case
As Ive been alluding to for a while, I wasnt being totally truthful when I
said that _S_ is consistent with happens-before relations — in reality, it is
only consistent with _strongly happens-before_ relations, which presents a
subtly-defined subset of happens-before relations. In particular, it excludes
two situations:
1. The `SeqCst` operation A synchronizes-with an `Acquire` or `AcqRel` operation
B which is sequenced-before another `SeqCst` operation C. Here, despite the
fact that A happens-before C, A does not _strongly_ happen-before C and so is
not guaranteed to precede C in _S_.
2. The `SeqCst` operation A is sequenced-before the `Release` or `AcqRel`
operation B, which synchronizes-with another `SeqCst` operation C. Similarly,
despite the fact that A happens-before C, A might not precede C in _S_.
The first situation is illustrated below, with `SeqCst` accesses repesented with
asterisks:
```text
t_1 x t_2
╭─────╮ ┌─↘───┐ ╭─────╮
*A* ├─┘ │ 1 ├───→ B │
╰─────╯ └───┘ ╰──╥──╯
╭──⇓──╮
*C*
╰─────╯
```
A happens-before, but does not strongly happen-before, C — and anything
sequenced-after C will have the same treatment (unless more synchronization is
used). This means that C is actually allowed to _precede_ A in _S_, despite
conceptually occuring after it. However, anything sequenced-before A, because
there is at least one sequence on either side of the synchronization, will
strongly happen-before C.
But this is all highly theoretical at the moment, so lets make an example to
show how that rule can actually affect the execution of code. So, if C were to
precede A in _S_ (and they are not both loads) then that means C is always
coherence-ordered-before A. Lets say then that C loads from `x` (the atomic
that A has to access), it may load the value that came before A if it were to
precede A in _S_:
```text
t_1 x t_2
╭─────╮ ┌───┐ ╭─────╮
*A* ├─┐ │ 0 ├─┐┌→ B │
╰─────╯ │ └───┘ ││╰──╥──╯
└─↘───┐┌─┘╭──⇓──╮
│ 1 ├┘└─→ *C*
└───┘ ╰─────╯
```
Ah wait no, that doesnt work because regular coherence still mandates that `1`
is the only value that can be loaded. In fact, once `1` is loaded _S_s required
consistency with coherence orderings means that A _is_ required to precede C in
_S_ after all.
So somehow, to observe this difference we need to have a _different_ `SeqCst`
operation, lets call it E, be the one that loads from `x`, where C is
guaranteed to precede it in _S_ (so we can observe the “weird” state in between
C and A) but C also doesnt happen-before it (to avoid coherence getting in the
way) — and to do that, all we have to do is have C appear before a `SeqCst`
operation D in the modification order of another atomic, but have D be a store
so as to avoid C synchronizing with it, and then our desired load E can simply
be sequenced-after D (this will carry over the “precedes in _S_” guarantee, but
does not restore the happens-after relation to C since that was already dropped
by having D be a store).
In diagram form, that looks like this:
```text
t_1 x t_2 helper t_3
╭─────╮ ┌───┐ ╭─────╮ ┌─────┐ ╭─────╮
*A* ├─┐ │ 0 ├┐┌─→ B │ ┌─┤ 0 │ ┌─┤ *D*
╰─────╯ │ └───┘││ ╰──╥──╯ │ └─────┘ │ ╰──╥──╯
│ └│────║────│─────────│┐ ║
└─↘───┐ │ ╭──⇓──╮ │ ┌─────↙─┘│╭──⇓──╮
│ 1 ├─┘ │ *C* ←─┘ │ 1 │ └→ *E*
└───┘ ╰─────╯ └─────┘ ╰─────╯
S = C → D → E → A
```
C is guaranteed to precede D in _S_, and D is guaranteed to precede E, but
because this exception means that A is _not_ guaranteed to precede C, it is
totally possible for it to come at the end, resulting in the surprising but
totally valid outcome of E loading `0` from `x`. In code, this can be expressed
as the following code _not_ being guaranteed to panic:
```rust
# use std::sync::atomic::{AtomicU8, Ordering::{Acquire, SeqCst}};
# return;
static X: AtomicU8 = AtomicU8::new(0);
static HELPER: AtomicU8 = AtomicU8::new(0);
// thread_1
X.store(1, SeqCst); // A
// thread_2
assert_eq!(X.load(Acquire), 1); // B
assert_eq!(HELPER.load(SeqCst), 0); // C
// thread_3
HELPER.store(1, SeqCst); // D
assert_eq!(X.load(SeqCst), 0); // E
```
The second situation listed above has very similar consequences. Its abstract
form is the following execution in which A is not guaranteed to precede C in
_S_, despite A happening-before C:
```text
t_1 x t_2
╭─────╮ ┌─↘───┐ ╭─────╮
*A* │ │ │ 0 ├───→ *C*
╰──╥──╯ │ └───┘ ╰─────╯
╭──⇓──╮ │
│ B ├─┘
╰─────╯
```
Similarly to before, we cant just have A access `x` to show why A not
necessarily preceding C in _S_ matters; instead, we have to introduce a second
atomic and third thread to break the happens-before chain first. And finally, a
single relaxed load F at the end is added just to prove that the weird execution
actually happened (leaving `x` as 2 instead of 1).
```text
t_3 helper t_1 x t_2
╭─────╮ ┌─────┐ ╭─────╮ ┌───┐ ╭─────╮
*D* ├┐┌─┤ 0 │ ┌─┤ *A* │ │ 0 │ ┌─→ *C*
╰──╥──╯││ └─────┘ │ ╰──╥──╯ └───┘ │ ╰──╥──╯
║ └│─────────│────║─────┐ │ ║
╭──⇓──╮ │ ┌─────↙─┘ ╭──⇓──╮ ┌─↘───┐ │ ╭──⇓──╮
*E* ←─┘ │ 1 │ │ B ├─┘││ 1 ├─┘┌┤ F │
╰─────╯ └─────┘ ╰─────╯ │└───┘ │╰─────╯
└↘───┐ │
│ 2 ├──┘
└───┘
S = C → D → E → A
```
This execution mandates both C preceding A in _S_ and A happening-before C,
something that is only possible through these two mixed-`SeqCst` special
exceptions. It can be expressed in code as well:
```rust
# use std::sync::atomic::{AtomicU8, Ordering::{Release, Relaxed, SeqCst}};
# return;
static X: AtomicU8 = AtomicU8::new(0);
static HELPER: AtomicU8 = AtomicU8::new(0);
// thread_3
X.store(2, SeqCst); // D
assert_eq!(HELPER.load(SeqCst), 0); // E
// thread_1
HELPER.store(1, SeqCst); // A
X.store(1, Release); // B
// thread_2
assert_eq!(X.load(SeqCst), 1); // C
assert_eq!(X.load(Relaxed), 2); // F
```
If this seems ridiculously specific and obscure, thats because it is.
Originally, back in C++11, this special case didnt exist — but then six years
later it was discovered that in practice atomics on Power, Nvidia GPUs and
sometimes ARMv7 _would_ have this special case, and fixing the implementations
would make atomics significantly slower. So instead, in C++20 they simply
encoded it into the specification.
Generally however, this rule is so complex its best to just avoid it entirely
by never mixing `SeqCst` and non-`SeqCst` on a single atomic in the first place.
Loading…
Cancel
Save