mirror of https://github.com/rust-lang/nomicon
Merge b139a3ce58
into b4448fa406
commit
d9e07f1b03
@ -1,239 +0,0 @@
|
||||
# Atomics
|
||||
|
||||
Rust pretty blatantly just inherits the memory model for atomics from C++20. This is not
|
||||
due to this model being particularly excellent or easy to understand. Indeed,
|
||||
this model is quite complex and known to have [several flaws][C11-busted].
|
||||
Rather, it is a pragmatic concession to the fact that *everyone* is pretty bad
|
||||
at modeling atomics. At very least, we can benefit from existing tooling and
|
||||
research around the C/C++ memory model.
|
||||
(You'll often see this model referred to as "C/C++11" or just "C11". C just copies
|
||||
the C++ memory model; and C++11 was the first version of the model but it has
|
||||
received some bugfixes since then.)
|
||||
|
||||
Trying to fully explain the model in this book is fairly hopeless. It's defined
|
||||
in terms of madness-inducing causality graphs that require a full book to
|
||||
properly understand in a practical way. If you want all the nitty-gritty
|
||||
details, you should check out the [C++ specification][C++-model].
|
||||
Still, we'll try to cover the basics and some of the problems Rust developers
|
||||
face.
|
||||
|
||||
The C++ memory model is fundamentally about trying to bridge the gap between the
|
||||
semantics we want, the optimizations compilers want, and the inconsistent chaos
|
||||
our hardware wants. *We* would like to just write programs and have them do
|
||||
exactly what we said but, you know, fast. Wouldn't that be great?
|
||||
|
||||
## Compiler Reordering
|
||||
|
||||
Compilers fundamentally want to be able to do all sorts of complicated
|
||||
transformations to reduce data dependencies and eliminate dead code. In
|
||||
particular, they may radically change the actual order of events, or make events
|
||||
never occur! If we write something like:
|
||||
|
||||
<!-- ignore: simplified code -->
|
||||
```rust,ignore
|
||||
x = 1;
|
||||
y = 3;
|
||||
x = 2;
|
||||
```
|
||||
|
||||
The compiler may conclude that it would be best if your program did:
|
||||
|
||||
<!-- ignore: simplified code -->
|
||||
```rust,ignore
|
||||
x = 2;
|
||||
y = 3;
|
||||
```
|
||||
|
||||
This has inverted the order of events and completely eliminated one event.
|
||||
From a single-threaded perspective this is completely unobservable: after all
|
||||
the statements have executed we are in exactly the same state. But if our
|
||||
program is multi-threaded, we may have been relying on `x` to actually be
|
||||
assigned to 1 before `y` was assigned. We would like the compiler to be
|
||||
able to make these kinds of optimizations, because they can seriously improve
|
||||
performance. On the other hand, we'd also like to be able to depend on our
|
||||
program *doing the thing we said*.
|
||||
|
||||
## Hardware Reordering
|
||||
|
||||
On the other hand, even if the compiler totally understood what we wanted and
|
||||
respected our wishes, our hardware might instead get us in trouble. Trouble
|
||||
comes from CPUs in the form of memory hierarchies. There is indeed a global
|
||||
shared memory space somewhere in your hardware, but from the perspective of each
|
||||
CPU core it is *so very far away* and *so very slow*. Each CPU would rather work
|
||||
with its local cache of the data and only go through all the anguish of
|
||||
talking to shared memory only when it doesn't actually have that memory in
|
||||
cache.
|
||||
|
||||
After all, that's the whole point of the cache, right? If every read from the
|
||||
cache had to run back to shared memory to double check that it hadn't changed,
|
||||
what would the point be? The end result is that the hardware doesn't guarantee
|
||||
that events that occur in some order on *one* thread, occur in the same
|
||||
order on *another* thread. To guarantee this, we must issue special instructions
|
||||
to the CPU telling it to be a bit less smart.
|
||||
|
||||
For instance, say we convince the compiler to emit this logic:
|
||||
|
||||
```text
|
||||
initial state: x = 0, y = 1
|
||||
|
||||
THREAD 1 THREAD 2
|
||||
y = 3; if x == 1 {
|
||||
x = 1; y *= 2;
|
||||
}
|
||||
```
|
||||
|
||||
Ideally this program has 2 possible final states:
|
||||
|
||||
* `y = 3`: (thread 2 did the check before thread 1 completed)
|
||||
* `y = 6`: (thread 2 did the check after thread 1 completed)
|
||||
|
||||
However there's a third potential state that the hardware enables:
|
||||
|
||||
* `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`)
|
||||
|
||||
It's worth noting that different kinds of CPU provide different guarantees. It
|
||||
is common to separate hardware into two categories: strongly-ordered and weakly-ordered.
|
||||
Most notably x86/64 provides strong ordering guarantees, while ARM
|
||||
provides weak ordering guarantees. This has two consequences for concurrent
|
||||
programming:
|
||||
|
||||
* Asking for stronger guarantees on strongly-ordered hardware may be cheap or
|
||||
even free because they already provide strong guarantees unconditionally.
|
||||
Weaker guarantees may only yield performance wins on weakly-ordered hardware.
|
||||
|
||||
* Asking for guarantees that are too weak on strongly-ordered hardware is
|
||||
more likely to *happen* to work, even though your program is strictly
|
||||
incorrect. If possible, concurrent algorithms should be tested on
|
||||
weakly-ordered hardware.
|
||||
|
||||
## Data Accesses
|
||||
|
||||
The C++ memory model attempts to bridge the gap by allowing us to talk about the
|
||||
*causality* of our program. Generally, this is by establishing a *happens
|
||||
before* relationship between parts of the program and the threads that are
|
||||
running them. This gives the hardware and compiler room to optimize the program
|
||||
more aggressively where a strict happens-before relationship isn't established,
|
||||
but forces them to be more careful where one is established. The way we
|
||||
communicate these relationships are through *data accesses* and *atomic
|
||||
accesses*.
|
||||
|
||||
Data accesses are the bread-and-butter of the programming world. They are
|
||||
fundamentally unsynchronized and compilers are free to aggressively optimize
|
||||
them. In particular, data accesses are free to be reordered by the compiler on
|
||||
the assumption that the program is single-threaded. The hardware is also free to
|
||||
propagate the changes made in data accesses to other threads as lazily and
|
||||
inconsistently as it wants. Most critically, data accesses are how data races
|
||||
happen. Data accesses are very friendly to the hardware and compiler, but as
|
||||
we've seen they offer *awful* semantics to try to write synchronized code with.
|
||||
Actually, that's too weak.
|
||||
|
||||
**It is literally impossible to write correct synchronized code using only data
|
||||
accesses.**
|
||||
|
||||
Atomic accesses are how we tell the hardware and compiler that our program is
|
||||
multi-threaded. Each atomic access can be marked with an *ordering* that
|
||||
specifies what kind of relationship it establishes with other accesses. In
|
||||
practice, this boils down to telling the compiler and hardware certain things
|
||||
they *can't* do. For the compiler, this largely revolves around re-ordering of
|
||||
instructions. For the hardware, this largely revolves around how writes are
|
||||
propagated to other threads. The set of orderings Rust exposes are:
|
||||
|
||||
* Sequentially Consistent (SeqCst)
|
||||
* Release
|
||||
* Acquire
|
||||
* Relaxed
|
||||
|
||||
(Note: We explicitly do not expose the C++ *consume* ordering)
|
||||
|
||||
TODO: negative reasoning vs positive reasoning? TODO: "can't forget to
|
||||
synchronize"
|
||||
|
||||
## Sequentially Consistent
|
||||
|
||||
Sequentially Consistent is the most powerful of all, implying the restrictions
|
||||
of all other orderings. Intuitively, a sequentially consistent operation
|
||||
cannot be reordered: all accesses on one thread that happen before and after a
|
||||
SeqCst access stay before and after it. A data-race-free program that uses
|
||||
only sequentially consistent atomics and data accesses has the very nice
|
||||
property that there is a single global execution of the program's instructions
|
||||
that all threads agree on. This execution is also particularly nice to reason
|
||||
about: it's just an interleaving of each thread's individual executions. This
|
||||
does not hold if you start using the weaker atomic orderings.
|
||||
|
||||
The relative developer-friendliness of sequential consistency doesn't come for
|
||||
free. Even on strongly-ordered platforms sequential consistency involves
|
||||
emitting memory fences.
|
||||
|
||||
In practice, sequential consistency is rarely necessary for program correctness.
|
||||
However sequential consistency is definitely the right choice if you're not
|
||||
confident about the other memory orders. Having your program run a bit slower
|
||||
than it needs to is certainly better than it running incorrectly! It's also
|
||||
mechanically trivial to downgrade atomic operations to have a weaker
|
||||
consistency later on. Just change `SeqCst` to `Relaxed` and you're done! Of
|
||||
course, proving that this transformation is *correct* is a whole other matter.
|
||||
|
||||
## Acquire-Release
|
||||
|
||||
Acquire and Release are largely intended to be paired. Their names hint at their
|
||||
use case: they're perfectly suited for acquiring and releasing locks, and
|
||||
ensuring that critical sections don't overlap.
|
||||
|
||||
Intuitively, an acquire access ensures that every access after it stays after
|
||||
it. However operations that occur before an acquire are free to be reordered to
|
||||
occur after it. Similarly, a release access ensures that every access before it
|
||||
stays before it. However operations that occur after a release are free to be
|
||||
reordered to occur before it.
|
||||
|
||||
When thread A releases a location in memory and then thread B subsequently
|
||||
acquires *the same* location in memory, causality is established. Every write
|
||||
(including non-atomic and relaxed atomic writes) that happened before A's
|
||||
release will be observed by B after its acquisition. However no causality is
|
||||
established with any other threads. Similarly, no causality is established
|
||||
if A and B access *different* locations in memory.
|
||||
|
||||
Basic use of release-acquire is therefore simple: you acquire a location of
|
||||
memory to begin the critical section, and then release that location to end it.
|
||||
For instance, a simple spinlock might look like:
|
||||
|
||||
```rust
|
||||
use std::sync::Arc;
|
||||
use std::sync::atomic::{AtomicBool, Ordering};
|
||||
use std::thread;
|
||||
|
||||
fn main() {
|
||||
let lock = Arc::new(AtomicBool::new(false)); // value answers "am I locked?"
|
||||
|
||||
// ... distribute lock to threads somehow ...
|
||||
|
||||
// Try to acquire the lock by setting it to true
|
||||
while lock.compare_and_swap(false, true, Ordering::Acquire) { }
|
||||
// broke out of the loop, so we successfully acquired the lock!
|
||||
|
||||
// ... scary data accesses ...
|
||||
|
||||
// ok we're done, release the lock
|
||||
lock.store(false, Ordering::Release);
|
||||
}
|
||||
```
|
||||
|
||||
On strongly-ordered platforms most accesses have release or acquire semantics,
|
||||
making release and acquire often totally free. This is not the case on
|
||||
weakly-ordered platforms.
|
||||
|
||||
## Relaxed
|
||||
|
||||
Relaxed accesses are the absolute weakest. They can be freely re-ordered and
|
||||
provide no happens-before relationship. Still, relaxed operations are still
|
||||
atomic. That is, they don't count as data accesses and any read-modify-write
|
||||
operations done to them occur atomically. Relaxed operations are appropriate for
|
||||
things that you definitely want to happen, but don't particularly otherwise care
|
||||
about. For instance, incrementing a counter can be safely done by multiple
|
||||
threads using a relaxed `fetch_add` if you're not using the counter to
|
||||
synchronize any other accesses.
|
||||
|
||||
There's rarely a benefit in making an operation relaxed on strongly-ordered
|
||||
platforms, since they usually provide release-acquire semantics anyway. However
|
||||
relaxed operations can be cheaper on weakly-ordered platforms.
|
||||
|
||||
[C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf
|
||||
[C++-model]: https://en.cppreference.com/w/cpp/atomic/memory_order
|
Loading…
Reference in new issue