Write the “Relaxed” section

pull/378/head
SabrinaJewson 2 years ago
parent 103a733af9
commit a26eab47d2
No known key found for this signature in database
GPG Key ID: 3D5438FFA5F05564

@ -28,11 +28,11 @@ Thread 1 data Thread 2
└────┘
```
Lets try to figure out where the line in Thread 2s access joins up. The rules
from before dont help us much unfortunately since there are no arrows
connecting that operation to anything, so we cant immediately rule anything
out. As a result, we end up facing a situation we havent faced before: there is
_more than one_ potential value for Thread 2 to read.
Unfortunately, the rules from before dont help us in finding out where Thread
2s line joins up to, since there are no arrows connecting that operation to
anything and therefore we cant immediately rule any values out. As a result, we
end up facing a situation we havent faced before: there is _more than one_
potential value for Thread 2 to read.
And this is where we encounter the big limitation with unsynchronized data
accesses: the price we pay for their speed and optimization capability is that
@ -40,4 +40,384 @@ this situation is considered **Undefined Behavior**. For an unsynchronized read
to be acceptable, there has to be _exactly one_ potential value for it to read,
and when there are multiple like in this situation it is considered a data race.
So what can we do about this? Well, two things need to be changed. First of all,
Thread 1 has to use an atomic store instead of an unsynchronized write, and
secondly Thread 2 has to use an atomic load instead of an unsynchronized read.
Youll also notice that all the atomic functions accept one (and sometimes two)
parameters of `atomic::Ordering`s — well explore the details of the differences
between them later, but for now well use `Relaxed` because it is by far the
simplest of the lot.
```rust
# use std::sync::atomic::{self, AtomicU32};
// Initial state
let data = AtomicU32::new(0);
// Thread 1:
data.store(1, atomic::Ordering::Relaxed);
// Thread 2:
data.load(atomic::Ordering::Relaxed);
```
The use of the atomic store provides one additional ability in comparison to an
unsynchronized store, and that is that there is no “in-between” state between
the old and new values — instead, it immediately updates, resulting in a diagram
that look a bit more like this:
```text
Thread 1 data
╭───────╮ ┌────┐
│ = 1 ├─┐ │ 0 │
╰───────╯ │ └────┘
└─┬────┐
│ 1 │
└────┘
```
We have now established a _modification order_ for `data`: a total, ordered list
of distinct, separated values that it takes over its lifetime.
On the loading side, we also obtain one additional ability: when there are
multiple possible values to choose from in the modification order, instead of it
triggering UB, exactly one (but it is unspecified which) value is chosen. This
means that there are now _two_ potential executions of our program, with no way
for us to control which one occurs:
```text
Possible Execution 1 ┃ Possible Execution 2
Thread 1 data Thread 2 ┃ Thread 1 data Thread 2
╭───────╮ ┌────┐ ╭───────╮ ┃ ╭───────╮ ┌────┐ ╭───────╮
│ store ├─┐ │ 0 ├───┤ load │ ┃ │ store ├─┐ │ 0 │ ┌─┤ load │
╰───────╯ │ └────┘ ╰───────╯ ┃ ╰───────╯ │ └────┘ │ ╰───────╯
└─┬────┐ ┃ └─┬────┐ │
│ 1 │ ┃ │ 1 ├─┘
└────┘ ┃ └────┘
```
Note that **both sides must be atomic to avoid the data race**: if only the
writing side used atomic operations, the reading side would still have multiple
values to choose from (UB), and if only the reading side used atomic operations
it could end up reading the garbage data “in-between” `0` and `1` (also UB).
> **NOTE:** This description of why both sides are needed to be atomic
> operations, while neat and intuitive, is not strictly correct: in reality the
> answer is simply “because the spec says so”. However, it is isomorphic to the
> real rules, so it can aid in understanding.
## Read-modify-write operations
Loads and stores are pretty neat in avoiding data races, but you cant get very
far with them. For example, suppose you wanted to implement a global shared
counter that can be used to assign unique IDs to objects. Naïvely, you might try
to write code like this:
```rust
# use std::sync::atomic::{self, AtomicU64};
static COUNTER: AtomicU64 = AtomicU64::new(0);
pub fn get_id() -> u64 {
let value = COUNTER.load(atomic::Ordering::Relaxed);
COUNTER.store(value + 1, atomic::Ordering::Relaxed);
value
}
```
But then calling that function from multiple threads opens you up to an
execution like below that results in two threads obtaining the same ID:
```text
Thread 1 COUNTER Thread 2
╭───────╮ ┌───┐ ╭───────╮
│ load ├───┤ 0 ├───┤ load │
╰───╥───╯ └───┘ ╰────╥──╯
╭───⇓───╮ ┌─┬───┐ ╭────⇓──╮
│ store ├─┘ │ 1 │ ┌─┤ store │
╰───────╯ └───┘ │ ╰───────╯
┌───┬─┘
│ 1 │
└───┘
```
Technically, I believe it is _possible_ to implement this kind of thing with
just loads and stores, if you try hard enough and use several atomics. But
luckily, you dont have to because there also exists another kind of operation,
the read-modify-write, which is specifically suited to this purpose.
A read-modify-write operation (shortened to RMW) is a special kind of atomic
operation that reads, changes and writes back a value _in one step_. This means
that there are guaranteed to exist no other values in the modification order in
between the read and the write; it happens as a single operation. I would also
like to point out that this is true of **all** atomic orderings, since a common
misconception is that the `Relaxed` ordering somehow negates this guarantee.
There are many different RMW operations to choose from, but the one most
appropriate for this use case is `fetch_add`, which adds a number to the atomic,
as well as returns the old value. So our code can be rewritten as this:
```rust
# use std::sync::atomic::{self, AtomicU64};
static COUNTER: AtomicU64 = AtomicU64::new(0);
pub fn get_id() -> u64 {
COUNTER.fetch_add(1, atomic::Ordering::Relaxed)
}
```
And then, no matter how many threads there are, that race condition from earlier
can never occur. Executions will have to look more like this:
```text
Thread 1 COUNTER Thread 2
╭───────────╮ ┌───┐ ╭───────────╮
│ fetch_add ├─┐ │ 0 │ ┌─┤ fetch_add │
╰───────────╯ │ └───┘ │ ╰───────────╯
└─┬───┐ │
│ 1 │ │
└───┘ │
┌───┬─┘
│ 2 │
└───┘
```
There is one problem with this code however, and that is that if `get_id()` is
called over 18446744073709551615 times, the counter will overflow and it
will start generating duplicate IDs. Of course, this wont feasibly happen, but
it can be problematic if you need to _prove_ that it cant happen (e.g. for
safety purposes) or youre using a smaller integer type like `u32`.
So were going to modify this function so that instead of returning a plain
`u64` it returns an `Option<u64>`, where `None` is used to indicate that an
overflow occurred and no more IDs could be generated. Additionally, its not
enough to just return `None` once, because if there are multiple threads
involved they will not see that result if it just occurs on a single thread —
instead, it needs to continue to return `None` _until the end of time_ (or,
well, this execution of the program).
That means we have to do away with `fetch_add`, because `fetch_add` will always
overflow and theres no `checked_fetch_add` equivalent. Well return to our racy
algorithm for a minute, this time thinking more about what went wrong. The steps
look something like this:
1. Load a value of the atomic
1. Perform the checked add, propagating `None`
1. Store in the new value of the atomic
The problem here is that the store does not necessarily occur directly after the
load in the atomics modification order, and that leads to the races. What we
need is some way to say, “add this new value to the modification order, but
_only if_ it occurs directly after the value we loaded”. And luckily for us,
there exists a function that does exactly\* this: `compare_exchange`.
`compare_exchange` is a bit like a store, but instead of unconditionally storing
the value, it will first check the previous value in the modification order to
see whether it is what we expect, and if not it will simply tell us that and not
make any changes. It is an RMW operation, so all of this happens fully
atomically — there is no chance for a race condition.
> \* Its not quite the same, because `compare_exchange` can suffer from ABA
> problems in which it will see a later value in the modification order that
> just happened to be same and succeed. However, in this code values can never
> be reused so we dont have to worry about that.
In our case, we can simply replace the store with a compare exchange of the old
value and itself plus one (returning `None` instead if the addition overflowed,
to prevent overflowing the atomic). Should the `compare_exchange` fail, we know
that some other thread inserted a value in the modification order after the
value we loaded. This isnt really a problem — we can just try again and again
until we succeed, and `compare_exchange` is even nice enough to give us the
updated value so we dont have to load again. Also note that after weve updated
our value of the atomic, were guaranteed to never see the old value again, by
the arrow rules from the previous chapter.
So heres how it looks with these changes appplied:
```rust
# use std::sync::atomic::{self, AtomicU64};
static COUNTER: AtomicU64 = AtomicU64::new(0);
pub fn get_id() -> Option<u64> {
// Load the counters initial value from some place in the modification
// order (it doesnt matter where, because the compare exchange makes sure
// that our new value appears directly after it).
let mut value = COUNTER.load(atomic::Ordering::Relaxed);
loop {
// Attempt to add one to the atomic.
let res = COUNTER.compare_exchange(
value,
value.checked_add(1)?,
atomic::Ordering::Relaxed,
atomic::Ordering::Relaxed,
);
// Check what happened…
match res {
// If there was no value in between the value we loaded and our
// newly written value in the modification order, the compare
// exchange suceeded and so we are done.
Ok(_) => break,
// Otherwise, there was a value in between and so we need to retry
// the addition and continue looping.
Err(updated_value) => value = updated_value,
}
}
Some(value)
}
```
This `compare_exchange` loop enables the algorithm to succeed even under
contention; it will simply try again (and again and again). In the below
execution, Thread 1 gets raced to storing its value of `1` to the counter, but
thats okay because it will just add `1` to the `1`, making `2`, and retry the
compare exchange with that, eventually resulting in a unique ID.
```text
Thread 1 COUNTER Thread 2
╭───────╮ ┌───┐ ╭───────╮
│ load ├───┤ 0 ├───┤ load │
╰───╥───╯ └───┘ ╰───╥───╯
╭───⇓───╮ ┌───┬─┐ ╭───⇓───╮
│ cas ├───┤ 1 │ └─┤ cas │
╰───╥───╯ └───┘ ╰───────╯
╭───⇓───╮ ┌─┬───┐
│ cas ├─┘ │ 2 │
╰───────╯ └───┘
```
> `compare_exchange` is abbreviated to CAS here (which stands for
> compare-and-swap), since that is the more general name for the operation. It
> is not to be confused with `compare_and_swap`, a deprecated method on Rust
> atomics that performs the same task as `compare_exchange` but has an inferior
> design in some ways.
There are two additional improvements we can make here. First, because our
algorithm occurs in a loop, it is actually perfectly fine for the CAS to fail
even when there wasnt a value inserted in the modification order in between,
since well just run it again. This allows to switch out our call to
`compare_exchange` with a call to the weaker `compare_exchange_weak`, that
unlike the former function is allowed to _spuriously_ (i.e. randomly, from the
programmers perspective) fail. This often results in better performance on
architectures like ARM, since their `compare_exchange` is really just a loop
around the underlying `compare_exchange_weak`. x86\_64 however will see no
difference in performance.
The second improvement is that this pattern is so common that the standard
library even provides a helper function for it, called `fetch_update`. It
implements the boilerplate `load`-`loop`-`match` parts for us, so all we have to
do is provide the closure that calls `checked_add(1)` and it will all just work.
This leads us to our final code for this example:
```rust
# use std::sync::atomic::{self, AtomicU64};
static COUNTER: AtomicU64 = AtomicU64::new(0);
pub fn get_id() -> Option<u64> {
COUNTER.fetch_update(
atomic::Ordering::Relaxed,
atomic::Ordering::Relaxed,
|value| value.checked_add(1),
)
.ok()
}
```
These CAS loops are the absolute bread and butter of concurrent programming;
theyre absolutely everywhere and essential to know about. Every other RMW
operation on atomics can (and often is, if the hardware doesnt have a more
efficient implementation) be implemented via a CAS loop. This is why CAS is seen
as the canonical example of an RMW — its pretty much the most fundamental
operation you can get on atomics.
Id also like to briefly bring attention to the atomic orderings used in this
section. They were mostly glossed over, but we were exclusively using `Relaxed`,
and thats because for something as simple as a global ID counter, _you never
need more than `Relaxed`_. The more complex cases which well look at later
definitely do need stronger orderings, but as a general rule, if:
- you only have one atomic, and
- you have no other related pieces of data
`Relaxed` is more than sufficient.
## “Out-of-thin-air” values
One peculiar consequence of the semantics of `Relaxed` operations is that it is
theoretically possible for values to come into existence “out-of-thin-air”
(commonly abbreviated to OOTA) — that is, a value could appear despite not ever
being calculated anywhere in code. In particular, consider this setup:
```rust
# use std::sync::atomic::{self, AtomicU32};
let x = AtomicU32::new(0);
let y = AtomicU32::new(0);
// Thread 1:
let r1 = y.load(atomic::Ordering::Relaxed);
x.store(r1, atomic::Ordering::Relaxed);
// Thread 2:
let r2 = x.load(atomic::Ordering::Relaxed);
y.store(r2, atomic::Ordering::Relaxed);
```
When starting to draw a diagram for a possible execution of this program, we
have to first lay out the basic facts that we know:
- `x` and `y` both start out as zero
- Thread 1 performs a load of `y` followed by a store of `x`
- Thread 2 performs a load of `x` followed by a store of `y`
- Each of `x` and `y` take on exactly two values in their lifetime
Then we can start to construct boxes:
```text
Thread 1 x y Thread 2
╭───────╮ ┌───┐ ┌───┐ ╭───────╮
│ load ├─┐ │ 0 │ │ 0 │ ┌─┤ load │
╰───╥───╯ │ └───┘ └───┘ │ ╰───╥───╯
║ │ ?───────────┘ ║
╭───⇓───╮ └───────────? ╭───⇓───╮
│ store ├───┬───┐ ┌───┬───┤ store │
╰───────╯ │ ? │ │ ? │ ╰───────╯
└───┘ └───┘
```
At this point, if either of those lines were to connect to the higher box then
the execution would be simple: that thread would forward the value to its lower
box, which the other thread would then either read, or load the same value
(zero) from the box above it, and wed end up with zero in both atomics. But
what if they were to connect downwards? Then wed end up with an execution that
looks like this:
```text
Thread 1 x y Thread 2
╭───────╮ ┌───┐ ┌───┐ ╭───────╮
│ load ├─┐ │ 0 │ │ 0 │ ┌─┤ load │
╰───╥───╯ │ └───┘ └───┘ │ ╰───╥───╯
║ │ ┌───────────┘ ║
╭───⇓───╮ └───┼───────┐ ╭───⇓───╮
│ store ├───┬─┴─┐ ┌─┴─┬───┤ store │
╰───────╯ │ ? │ │ ? │ ╰───────╯
└───┘ └───┘
```
But hang on — its not fully resolved yet, we still havent put in a value in
those lower question marks. So what value should it be? Well, the second value
of `x` is just copied from from the second value of `y`, so we just have to find
the value of that — but the second value of `y` is itself copied from the second
value of `x`! This means that we can actually put any value we like in that box,
including `0` or `42`, and the logic will check out perfectly fine — meaning if
this program were to execute in this fashion, it would end up reading a value
produced out of thin air!
Now, if we were to strictly follow the rules weve laid out thus far, then this
would be totally valid thing to happen. But luckily, the authors of the C++
specification have recognized this as a problem, and as such refined the
semantics of `Relaxed` to implement a thorough, logically sound, mathematically
proven formal model that prevents it, thats just too complex and technical to
explain here—
> No “out-of-thin-air” values can be computed that circularly depend on their
> own computations.
Just kidding. Turns out, its a *really* difficult problem to solve, and to my
knowledge even now there is no known formal way to express how to prevent it. So
in the specification they just kind of hand-wave and say that it shouldnt
happen, and that the above program must always give zero in both atomics,
despite the theoretical execution that could result in something else. Well, it
generally works in practice so I cant complain — its just a very interesting
detail to know about.

Loading…
Cancel
Save