nomicon/atomics.md

% Atomics

Rust pretty blatantly just inherits C11's memory model for atomics. This is not
due this model being particularly excellent or easy to understand. Indeed, this
model is quite complex and known to have [several flaws][C11-busted]. Rather, it
is a pragmatic concession to the fact that *everyone* is pretty bad at modeling
atomics. At very least, we can benefit from existing tooling and research around
C.

Trying to fully explain the model in this book is fairly hopeless. It's defined
in terms of madness-inducing causality graphs that require a full book to
properly understand in a practical way. If you want all the nitty-gritty
details, you should check out [C's specification (Section 7.17)][C11-model].
Still, we'll try to cover the basics and some of the problems Rust developers
face.

The C11 memory model is fundamentally about trying to bridge the gap between the
semantics we want, the optimizations compilers want, and the inconsistent chaos
our hardware wants. *We* would like to just write programs and have them do
exactly what we said but, you know, *fast*. Wouldn't that be great?


# Compiler Reordering

Compilers fundamentally want to be able to do all sorts of crazy transformations
to reduce data dependencies and eliminate dead code. In particular, they may
radically change the actual order of events, or make events never occur! If we
write something like

```rust,ignore
x = 1;
y = 3;
x = 2;
```

The compiler may conclude that it would *really* be best if your program did

```rust,ignore
x = 2;
y = 3;
```

This has inverted the order of events *and* completely eliminated one event.
From a single-threaded perspective this is completely unobservable: after all
the statements have executed we are in exactly the same state. But if our
program is multi-threaded, we may have been relying on `x` to *actually* be
assigned to 1 before `y` was assigned. We would *really* like the compiler to be
able to make these kinds of optimizations, because they can seriously improve
performance. On the other hand, we'd really like to be able to depend on our
program *doing the thing we said*.


# Hardware Reordering

On the other hand, even if the compiler totally understood what we wanted and
respected our wishes, our *hardware* might instead get us in trouble. Trouble
comes from CPUs in the form of memory hierarchies. There is indeed a global
shared memory space somewhere in your hardware, but from the perspective of each
CPU core it is *so very far away* and *so very slow*. Each CPU would rather work
with its local cache of the data and only go through all the *anguish* of
talking to shared memory *only* when it doesn't actually have that memory in
cache.

After all, that's the whole *point* of the cache, right? If every read from the
cache had to run back to shared memory to double check that it hadn't changed,
what would the point be? The end result is that the hardware doesn't guarantee
that events that occur in the same order on *one* thread, occur in the same
order on *another* thread. To guarantee this, we must issue special instructions
to the CPU telling it to be a bit less smart.

For instance, say we convince the compiler to emit this logic:

```text
initial state: x = 0, y = 1

THREAD 1        THREAD2
y = 3;          if x == 1 {
x = 1;              y *= 2;
                }
```

Ideally this program has 2 possible final states:

* `y = 3`: (thread 2 did the check before thread 1 completed)
* `y = 6`: (thread 2 did the check after thread 1 completed)

However there's a third potential state that the hardware enables:

* `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`)

It's worth noting that different kinds of CPU provide different guarantees. It
is common to separate hardware into two categories: strongly-ordered and weakly-
ordered. Most notably x86/64 provides strong ordering guarantees, while ARM
provides weak ordering guarantees. This has two consequences for concurrent
programming:

* Asking for stronger guarantees on strongly-ordered hardware may be cheap or
  even *free* because they already provide strong guarantees unconditionally.
  Weaker guarantees may only yield performance wins on weakly-ordered hardware.

* Asking for guarantees that are *too* weak on strongly-ordered hardware   is
  more likely to *happen* to work, even though your program is strictly
  incorrect. If possible, concurrent algorithms should be tested on   weakly-
  ordered hardware.


# Data Accesses

The C11 memory model attempts to bridge the gap by allowing us to talk about the
*causality* of our program. Generally, this is by establishing a *happens
before* relationships between parts of the program and the threads that are
running them. This gives the hardware and compiler room to optimize the program
more aggressively where a strict happens-before relationship isn't established,
but forces them to be more careful where one *is* established. The way we
communicate these relationships are through *data accesses* and *atomic
accesses*.

Data accesses are the bread-and-butter of the programming world. They are
fundamentally unsynchronized and compilers are free to aggressively optimize
them. In particular, data accesses are free to be reordered by the compiler on
the assumption that the program is single-threaded. The hardware is also free to
propagate the changes made in data accesses to other threads as lazily and
inconsistently as it wants. Mostly critically, data accesses are how data races
happen. Data accesses are very friendly to the hardware and compiler, but as
we've seen they offer *awful* semantics to try to write synchronized code with.
Actually, that's too weak. *It is literally impossible to write correct
synchronized code using only data accesses*.

Atomic accesses are how we tell the hardware and compiler that our program is
multi-threaded. Each atomic access can be marked with an *ordering* that
specifies what kind of relationship it establishes with other accesses. In
practice, this boils down to telling the compiler and hardware certain things
they *can't* do. For the compiler, this largely revolves around re-ordering of
instructions. For the hardware, this largely revolves around how writes are
propagated to other threads. The set of orderings Rust exposes are:

* Sequentially Consistent (SeqCst) Release Acquire Relaxed

(Note: We explicitly do not expose the C11 *consume* ordering)

TODO: negative reasoning vs positive reasoning? TODO: "can't forget to
synchronize"


# Sequentially Consistent

Sequentially Consistent is the most powerful of all, implying the restrictions
of all other orderings. Intuitively, a sequentially consistent operation
*cannot* be reordered: all accesses on one thread that happen before and after a
SeqCst access *stay* before and after it. A data-race-free program that uses
only sequentially consistent atomics and data accesses has the very nice
property that there is a single global execution of the program's instructions
that all threads agree on. This execution is also particularly nice to reason
about: it's just an interleaving of each thread's individual executions. This
*does not* hold if you start using the weaker atomic orderings.

The relative developer-friendliness of sequential consistency doesn't come for
free. Even on strongly-ordered platforms sequential consistency involves
emitting memory fences.

In practice, sequential consistency is rarely necessary for program correctness.
However sequential consistency is definitely the right choice if you're not
confident about the other memory orders. Having your program run a bit slower
than it needs to is certainly better than it running incorrectly! It's also
*mechanically* trivial to downgrade atomic operations to have a weaker
consistency later on. Just change `SeqCst` to e.g. `Relaxed` and you're done! Of
course, proving that this transformation is *correct* is a whole other matter.


# Acquire-Release

Acquire and Release are largely intended to be paired. Their names hint at their
use case: they're perfectly suited for acquiring and releasing locks, and
ensuring that critical sections don't overlap.

Intuitively, an acquire access ensures that every access after it *stays* after
it. However operations that occur before an acquire are free to be reordered to
occur after it. Similarly, a release access ensures that every access before it
*stays* before it. However operations that occur after a release are free to be
reordered to occur before it.

When thread A releases a location in memory and then thread B subsequently
acquires *the same* location in memory, causality is established. Every write
that happened *before* A's release will be observed by B *after* its release.
However no causality is established with any other threads. Similarly, no
causality is established if A and B access *different* locations in memory.

Basic use of release-acquire is therefore simple: you acquire a location of
memory to begin the critical section, and then release that location to end it.
For instance, a simple spinlock might look like:

```rust
use std::sync::Arc;
use std::sync::atomic::{AtomicBool, Ordering};
use std::thread;

fn main() {
    let lock = Arc::new(AtomicBool::new(true)); // value answers "am I locked?"

    // ... distribute lock to threads somehow ...

    // Try to acquire the lock by setting it to false
    while !lock.compare_and_swap(true, false, Ordering::Acquire) { }
    // broke out of the loop, so we successfully acquired the lock!

    // ... scary data accesses ...

    // ok we're done, release the lock
    lock.store(true, Ordering::Release);
}
```

On strongly-ordered platforms most accesses have release or acquire semantics,
making release and acquire often totally free. This is not the case on
weakly-ordered platforms.


# Relaxed

Relaxed accesses are the absolute weakest. They can be freely re-ordered and
provide no happens-before relationship. Still, relaxed operations *are* still
atomic. That is, they don't count as data accesses and any read-modify-write
operations done to them occur atomically. Relaxed operations are appropriate for
things that you definitely want to happen, but don't particularly otherwise care
about. For instance, incrementing a counter can be safely done by multiple
threads using a relaxed `fetch_add` if you're not using the counter to
synchronize any other accesses.

There's rarely a benefit in making an operation relaxed on strongly-ordered
platforms, since they usually provide release-acquire semantics anyway. However
relaxed operations can be cheaper on weakly-ordered platforms.


[C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf
[C11-model]: http://www.open-std.org/jtc1/sc22/wg14/www/standards.html#9899
shard out concurrency 10 years ago			`% Atomics`

			`Rust pretty blatantly just inherits C11's memory model for atomics. This is not`
			`due this model being particularly excellent or easy to understand. Indeed, this`
nits and realigning 10 years ago			`model is quite complex and known to have [several flaws][C11-busted]. Rather, it`
			`is a pragmatic concession to the fact that everyone is pretty bad at modeling`
shard out concurrency 10 years ago			`atomics. At very least, we can benefit from existing tooling and research around`
			`C.`

flesh out atomics 10 years ago			`Trying to fully explain the model in this book is fairly hopeless. It's defined`
nits and realigning 10 years ago			`in terms of madness-inducing causality graphs that require a full book to`
			`properly understand in a practical way. If you want all the nitty-gritty`
			`details, you should check out [C's specification (Section 7.17)][C11-model].`
			`Still, we'll try to cover the basics and some of the problems Rust developers`
			`face.`
shard out concurrency 10 years ago
nits and realigning 10 years ago			`The C11 memory model is fundamentally about trying to bridge the gap between the`
			`semantics we want, the optimizations compilers want, and the inconsistent chaos`
			`our hardware wants. We would like to just write programs and have them do`
			`exactly what we said but, you know, fast. Wouldn't that be great?`
flesh out atomics 10 years ago



			`# Compiler Reordering`

			`Compilers fundamentally want to be able to do all sorts of crazy transformations`
niko discussion affects 10 years ago			`to reduce data dependencies and eliminate dead code. In particular, they may`
flesh out atomics 10 years ago			`radically change the actual order of events, or make events never occur! If we`
			`write something like`

			```rust,ignore
			`x = 1;`
			`y = 3;`
			`x = 2;`
			```

			`The compiler may conclude that it would really be best if your program did`

			```rust,ignore
			`x = 2;`
			`y = 3;`
			```

nits and realigning 10 years ago			`This has inverted the order of events and completely eliminated one event.`
			`From a single-threaded perspective this is completely unobservable: after all`
			`the statements have executed we are in exactly the same state. But if our`
			program is multi-threaded, we may have been relying on `x` to actually be
			assigned to 1 before `y` was assigned. We would really like the compiler to be
			`able to make these kinds of optimizations, because they can seriously improve`
			`performance. On the other hand, we'd really like to be able to depend on our`
			`program doing the thing we said.`
flesh out atomics 10 years ago



			`# Hardware Reordering`

			`On the other hand, even if the compiler totally understood what we wanted and`
nits and realigning 10 years ago			`respected our wishes, our hardware might instead get us in trouble. Trouble`
			`comes from CPUs in the form of memory hierarchies. There is indeed a global`
			`shared memory space somewhere in your hardware, but from the perspective of each`
			`CPU core it is so very far away and so very slow. Each CPU would rather work`
			`with its local cache of the data and only go through all the anguish of`
			`talking to shared memory only when it doesn't actually have that memory in`
			`cache.`
flesh out atomics 10 years ago
			`After all, that's the whole point of the cache, right? If every read from the`
			`cache had to run back to shared memory to double check that it hadn't changed,`
			`what would the point be? The end result is that the hardware doesn't guarantee`
nits and realigning 10 years ago			`that events that occur in the same order on one thread, occur in the same`
			`order on another thread. To guarantee this, we must issue special instructions`
			`to the CPU telling it to be a bit less smart.`
flesh out atomics 10 years ago
			`For instance, say we convince the compiler to emit this logic:`

			```text
			`initial state: x = 0, y = 1`

			`THREAD 1 THREAD2`
			`y = 3; if x == 1 {`
			`x = 1; y *= 2;`
			`}`
			```

			`Ideally this program has 2 possible final states:`

fixup atomics 9 years ago			* `y = 3`: (thread 2 did the check before thread 1 completed)
			* `y = 6`: (thread 2 did the check after thread 1 completed)
flesh out atomics 10 years ago
			`However there's a third potential state that the hardware enables:`

fixup atomics 9 years ago			* `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`)
flesh out atomics 10 years ago
			`It's worth noting that different kinds of CPU provide different guarantees. It`
nits and realigning 10 years ago			`is common to separate hardware into two categories: strongly-ordered and weakly-`
			`ordered. Most notably x86/64 provides strong ordering guarantees, while ARM`
			`provides weak ordering guarantees. This has two consequences for concurrent`
			`programming:`
flesh out atomics 10 years ago
			`* Asking for stronger guarantees on strongly-ordered hardware may be cheap or`
			`even free because they already provide strong guarantees unconditionally.`
			`Weaker guarantees may only yield performance wins on weakly-ordered hardware.`

nits and realigning 10 years ago			`* Asking for guarantees that are too weak on strongly-ordered hardware is`
			`more likely to happen to work, even though your program is strictly`
			`incorrect. If possible, concurrent algorithms should be tested on weakly-`
			`ordered hardware.`
flesh out atomics 10 years ago




			`# Data Accesses`

nits and realigning 10 years ago			`The C11 memory model attempts to bridge the gap by allowing us to talk about the`
			`causality of our program. Generally, this is by establishing a *happens`
			`before* relationships between parts of the program and the threads that are`
			`running them. This gives the hardware and compiler room to optimize the program`
			`more aggressively where a strict happens-before relationship isn't established,`
			`but forces them to be more careful where one is established. The way we`
			`communicate these relationships are through data accesses and *atomic`
			`accesses*.`
shard out concurrency 10 years ago
			`Data accesses are the bread-and-butter of the programming world. They are`
			`fundamentally unsynchronized and compilers are free to aggressively optimize`
nits and realigning 10 years ago			`them. In particular, data accesses are free to be reordered by the compiler on`
			`the assumption that the program is single-threaded. The hardware is also free to`
			`propagate the changes made in data accesses to other threads as lazily and`
			`inconsistently as it wants. Mostly critically, data accesses are how data races`
			`happen. Data accesses are very friendly to the hardware and compiler, but as`
			`we've seen they offer awful semantics to try to write synchronized code with.`
			`Actually, that's too weak. *It is literally impossible to write correct`
			`synchronized code using only data accesses*.`
shard out concurrency 10 years ago
flesh out atomics 10 years ago			`Atomic accesses are how we tell the hardware and compiler that our program is`
nits and realigning 10 years ago			`multi-threaded. Each atomic access can be marked with an ordering that`
			`specifies what kind of relationship it establishes with other accesses. In`
			`practice, this boils down to telling the compiler and hardware certain things`
			`they can't do. For the compiler, this largely revolves around re-ordering of`
			`instructions. For the hardware, this largely revolves around how writes are`
			`propagated to other threads. The set of orderings Rust exposes are:`

			`* Sequentially Consistent (SeqCst) Release Acquire Relaxed`
shard out concurrency 10 years ago
			`(Note: We explicitly do not expose the C11 consume ordering)`

nits and realigning 10 years ago			`TODO: negative reasoning vs positive reasoning? TODO: "can't forget to`
			`synchronize"`
flesh out atomics 10 years ago


			`# Sequentially Consistent`

			`Sequentially Consistent is the most powerful of all, implying the restrictions`
nits and realigning 10 years ago			`of all other orderings. Intuitively, a sequentially consistent operation`
			`cannot be reordered: all accesses on one thread that happen before and after a`
			`SeqCst access stay before and after it. A data-race-free program that uses`
			`only sequentially consistent atomics and data accesses has the very nice`
			`property that there is a single global execution of the program's instructions`
			`that all threads agree on. This execution is also particularly nice to reason`
			`about: it's just an interleaving of each thread's individual executions. This`
			`does not hold if you start using the weaker atomic orderings.`
flesh out atomics 10 years ago
			`The relative developer-friendliness of sequential consistency doesn't come for`
clarify atomics 10 years ago			`free. Even on strongly-ordered platforms sequential consistency involves`
flesh out atomics 10 years ago			`emitting memory fences.`

			`In practice, sequential consistency is rarely necessary for program correctness.`
			`However sequential consistency is definitely the right choice if you're not`
			`confident about the other memory orders. Having your program run a bit slower`
			`than it needs to is certainly better than it running incorrectly! It's also`
clarify atomics 10 years ago			`mechanically trivial to downgrade atomic operations to have a weaker`
			consistency later on. Just change `SeqCst` to e.g. `Relaxed` and you're done! Of
nits and realigning 10 years ago			`course, proving that this transformation is correct is a whole other matter.`
flesh out atomics 10 years ago



			`# Acquire-Release`
shard out concurrency 10 years ago
nits and realigning 10 years ago			`Acquire and Release are largely intended to be paired. Their names hint at their`
			`use case: they're perfectly suited for acquiring and releasing locks, and`
			`ensuring that critical sections don't overlap.`
shard out concurrency 10 years ago
clarify atomics 10 years ago			`Intuitively, an acquire access ensures that every access after it stays after`
			`it. However operations that occur before an acquire are free to be reordered to`
			`occur after it. Similarly, a release access ensures that every access before it`
nits and realigning 10 years ago			`stays before it. However operations that occur after a release are free to be`
			`reordered to occur before it.`
clarify atomics 10 years ago
			`When thread A releases a location in memory and then thread B subsequently`
			`acquires the same location in memory, causality is established. Every write`
nits and realigning 10 years ago			`that happened before A's release will be observed by B after its release.`
clarify atomics 10 years ago			`However no causality is established with any other threads. Similarly, no`
			`causality is established if A and B access different locations in memory.`

			`Basic use of release-acquire is therefore simple: you acquire a location of`
			`memory to begin the critical section, and then release that location to end it.`
			`For instance, a simple spinlock might look like:`

			```rust
			`use std::sync::Arc;`
			`use std::sync::atomic::{AtomicBool, Ordering};`
			`use std::thread;`
shard out concurrency 10 years ago
clarify atomics 10 years ago			`fn main() {`
			`let lock = Arc::new(AtomicBool::new(true)); // value answers "am I locked?"`
flesh out atomics 10 years ago
clarify atomics 10 years ago			`// ... distribute lock to threads somehow ...`

			`// Try to acquire the lock by setting it to false`
			`while !lock.compare_and_swap(true, false, Ordering::Acquire) { }`
			`// broke out of the loop, so we successfully acquired the lock!`

			`// ... scary data accesses ...`

			`// ok we're done, release the lock`
			`lock.store(true, Ordering::Release);`
			`}`
			```
flesh out atomics 10 years ago
			`On strongly-ordered platforms most accesses have release or acquire semantics,`
			`making release and acquire often totally free. This is not the case on`
			`weakly-ordered platforms.`




			`# Relaxed`

			`Relaxed accesses are the absolute weakest. They can be freely re-ordered and`
			`provide no happens-before relationship. Still, relaxed operations are still`
clarify atomics 10 years ago			`atomic. That is, they don't count as data accesses and any read-modify-write`
			`operations done to them occur atomically. Relaxed operations are appropriate for`
			`things that you definitely want to happen, but don't particularly otherwise care`
			`about. For instance, incrementing a counter can be safely done by multiple`
			threads using a relaxed `fetch_add` if you're not using the counter to
			`synchronize any other accesses.`
flesh out atomics 10 years ago
			`There's rarely a benefit in making an operation relaxed on strongly-ordered`
			`platforms, since they usually provide release-acquire semantics anyway. However`
			`relaxed operations can be cheaper on weakly-ordered platforms.`





shard out concurrency 10 years ago			`[C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf`
clarify atomics 10 years ago			`[C11-model]: http://www.open-std.org/jtc1/sc22/wg14/www/standards.html#9899`