Nines of safety: Terence Tao’s proposed unit of measurement of risk
I recently came across Terence Tao’s post proposing a risk measure called “nines of safety”. I think it’s a very interesting proposal, and given that many EA forum users think a lot about risks and probabilities, I’m curious to hear what opinions other people have.
Below I’ll briefly summarise my understanding of the idea, and ask some specific questions about how this might be related to EA. For more details, I highly recommend reading the original post.
The proposal
While we often use percentages to describe probabilities and proportions, it can be hard to tell whether a given percentage is “good” or “bad”. For instance, having a 60% chance of success seems risky for a medical operation, but a result of 60% to 40% would be a landslide victory in a two-party election.
Part of the difficulty is that percentages can be used in a multitude of ways, with different interpretations in different scenarios. Tao proposes a unit that can be used to measure both the risk and safety of some event that has a really bad outcome (e.g. a global pandemic). The trouble with percentages in this scenario is that the probability of getting a good outcome needs to be really high in order for us to be comfortable. For instance, 90% odds of successfully completing a potentially life-threatening medical operation might seem a bit risky. 99% odds would probably feel quite a bit better, and at 99.9% odds of success we might start feeling reasonably safe (in general, this depends on how bad the negative outcomes are, the counterfactual of not doing the operation, etc.).
Writing out all of these 9s seems a bit clumsy, and the measure that Tao proposes addresses this—it’s called “nines of safety”, and informally measures how many consecutive 9s there are in the probability of success (the nines of risk would be the same as the nines of safety, but applied to the probability of failure). So in the previous example:
90% success = 1 nine of safety
99% success = 2 nines of safety
99.9% success = 3 nines of safety
We can formalise this in terms of the base-10 logarithm:
which allows us to extend “nines of safety” so that it isn’t just a whole number. We can write a table to convert between the probabilities of success, failure, and the nines of safety:
Probability of success | Probability of failure | Nines of safety |
0% | 100% | 0.0 |
50% | 50% | 0.3 |
75% | 25% | 0.6 |
80% | 20% | 0.7 |
90% | 10% | 1.0 |
95% | 5% | 1.3 |
97.5% | 2.5% | 1.6 |
98% | 2% | 1.7 |
99% | 1% | 2.0 |
99.5% | 0.5% | 2.3 |
99.75% | 0.25% | 2.6 |
99.8% | 0.2% | 2.7 |
99.9% | 0.1% | 3.0 |
99.95% | 0.05% | 3.3 |
99.975% | 0.025% | 3.6 |
99.98% | 0.02% | 3.7 |
99.99% | 0.01% | 4.0 |
100% | 0% | infinite |
Note that the nines of safety are rounded to 1 decimal place, because in practice probability estimates are likely to be quite uncertain, and extra decimal places may not be particularly significant.
In general (as in the aforementioned example of the medical operation), the number of nines of safety depends on several factors, such as the number of people exposed and the duration of exposure. We might also need to consider repeated exposures, which can be quite complicated – depending on the task, individual exposures may not necessarily be independent from each other.
Why?
This potentially has several benefits:
Easier mental arithmetic:
Due to the properties of logarithms, adding “nines of safety” is the same as multiplying probabilities, which makes mental calculation easier (especially true if we’re dealing with relative risks, e.g. “What are the odds of catching COVID in a vaccinated group relative to an unvaccinated control group?”)
It also makes it easier to convert from individual risk to group risk (see this comment)
“Apples-to-apples comparisons”: since percentages are interpreted very differently depended on context, having a measure that is uniquely devoted to measuring risks can be quite helpful
Finer characterisation of high odds of success: e.g. a small change in percentage from 99% to 99.9% odds of success leads to an addition of 1 nine of safety
I think it’s also worth mentioning that a similar idea is already used in some fields, like reliability engineering and assessing the purity of substances. In the post, Tao summarises how nines of safety would be used quite nicely:
“In summary, when debating the value of a given risk mitigation measure, the correct question to ask is not quite “Is it certain to work” or “Can it fail?”, but rather “How many extra nines of safety does it add?”.”
One possible objection to this would be that expected value calculations already account for “low-probability, high-risk” scenarios. A counterargument to this is that expected value requires estimating both the probability and the impact, which leads to greater uncertainty than just considering the nines of safety (which only depends on probability). Overall though, I’m unsure about how useful nines of safety might be compared to expected value in different cause areas.
Relation to EA
I guess some obvious questions would be something like, “how many nines of safety are there for different problems in your field?” As an example, if we convert the risks from The Precipice, we get:
Existential catastrophe via | Chance within next 100 years | Nines of safety |
Asteroid or comet impact | ~1 in 1,000,000 | 6 |
Supervolcanic eruption | ~1 in 10,000 | 4 |
Stellar explosion | ~1 in 1,000,000,000 | 9 |
Total natural risk | ~1 in 10,000 | 4 |
Nuclear war | ~1 in 1,000 | 3 |
Climate change | ~1 in 1,000 | 3 |
Other environmental damage | ~1 in 1,000 | 3 |
“Naturally” arising pandemics | ~1 in 10,000 | 4 |
Engineered pandemics | ~1 in 30 | 1.5 |
Unaligned AI | ~1 in 10 | 1.7 |
Unforeseen anthropogenic risks | ~1 in 30 | 1.5 |
Other anthropogenic risks | ~1 in 50 | 1.7 |
Total anthropogenic risk | ~1 in 6 | 0.8 |
Total existential risk | ~1 in 6 | 0.8 |
If you think that these numbers are severely underestimating the X-risks, then perhaps you have a case against nines of safety being a particularly useful measure.
In general, I’m curious about several things:
What do you think about this proposal in general?
What are some examples of things in EA where using nines of safety might be a good idea? I’m especially interested in examples where there is disagreement about how likely an intervention is to work (e.g. researcher A believes intervention X is better than intervention Y, but researcher B believes otherwise).
Does “nines of safety” seem useful relative to techniques we already use, like expected values?
What are some arguments against nines of safety?
I like that it frames safety as a noun, not just an adjective. “We’re 99% safe” vs. “we have two nines of safety.” For some reason, it hits you different if safety feels like a tangible product you can make, or buy, rather than an intangible perception or description.
My worry is that this proposal is meant to address lay people’s innumeracy. But it requires a lengthy explanation. I suspect there are many people who don’t understand that an earthquake generating a 6 on the Richter scale is a 10x more powerful earthquake than a 5.
Another alternative is just to say “this intervention would make us 10x safer,” rather than “this intervention gives us an extra nine of safety.”
So this proposal seems to me to have a tradeoff between the psychological impact of nounification and the potential confusion of a logarithmic scale. I don’t see any risk of harm, but I think that it is probably best used in contexts where you can expect the audience to know and feel comfortable with the log scale.
As a follow up, the more common proposal for this issue is to switch to ratios. For example, rather than saying you have a 99.99999% chance (7 nines) of not dying from a lightning strike this year, say that only 1 in 10,000,000 people die from lightning strikes per year.
I think this is harder when we’re discussing global risks and unprecedented risks. It’s hard to conceptualize humanity going extinct in 1 in 6 earths-this-century (Toby Ord’s guess). Easier to think of a 17% chance. Maybe percentages work best for one-off risks, and ratios work better when we have a base rate to work with?
I think these are all good points, thanks for sharing!
To push back on the point about lay people innumeracy a bit, doesn’t expected value also need a somewhat lengthy explanation? In addition, I think a common mistake is to conflate EV and averages, so should we have similar concerns about EV as well?
Maybe a counterargument to this would be that “nines of safety” has obvious alternatives (e.g. ratios, as you point out), but perhaps it’s harder to do this for EV?
In general, it’s best to use knowledge that’s common to your audience when possible. If that’s not possible, then you have to find the right balance precision, brevity, and familiarity. The appropriate balance will heavily depend on the audience and topic.
My practice, when writing informally, is to notice when I’m about to use a jargon term, and then search my knowledge of colloquial speech to see if there’s a common term or phrase that captures this jargon term. If so, I tend to use it.
Here are two examples of sentences from the EA forum containing the phrase “expected value,” and how I might rephrase them in more colloquial speech. I won’t link to the source, because that would be a little tedious, but credit for the sentences goes to the authors, and you can find the source by searching for the sentence itself.
1.
“Here, the option with the greatest expected value is donating to the speculative research (at least on certain theories of value—more on those in a moment).”
->
“Here, speculative research is the best option because of its massive upside potential, at least depending on what we care about...”
2.
“My previous model, in which I took expected value estimates and adjusted them based on my intuition, was clearly inadequate.”
->
“Before, I estimated the costs and benefits and then adjusted those estimates intuitively, which definitely wasn’t good enough.”
I think what you’re reaching for are the Weber-Fechner laws, which point out that human perception seems to operate on a log scale. The Wikipedia article on the topic illustrates.
However, my read on the Richter scale is that even if you’re right that people routinely think a 1-point jump on the RS feels like a less-than-10x jump in perception of shaking, that this is an effect, not a cause, of the choice of scale. But I don’t concede that—as I say, I think it’s likely to be more complex.
It does say why a log scale was chosen.
“ First, to span the wide range of possible values, Richter adopted Gutenberg’s suggestion of a logarithmic scale, where each step represents a tenfold increase of magnitude, similar to the magnitude scale used by astronomers for star brightness.”
If I remember correctly (from ‘The Precipice’) ‘Unaligned AI ~1 in 50 1.7’ should actually be ‘Unaligned AI ~1 in 10 1’.
Thanks for pointing this out! Should be fixed now
It’s in use in electric grid availability
I’ve heard this for e.g. server uptime as well.
“Writing out all of these 9s seems a bit clumsy”
Personally, I don’t see it as more clumsy/awkward/inconvenient than trying to learn and accurately use terms like “three nines.” And then you get to situations where it’s a non-integer number of nines (e.g., 0.83 nines): trying to convert that to percentages seems like a pain/intuition block, especially given that most people aren’t familiar with this system. On that point, I would strongly echo the points of AllAmericanBreakfast—whose example of the Richter scale seems like a great example: my impression has been that most (lay) people do not accurately understand these numbers in terms of logarithms, and so it seems like they are less likely to understand this nines system.
Ultimately, I imagine there probably are some mathematically-oriented justifications for using this system, but I think that the key deficiency here is about lay and intuitive understanding, and my impression is that this system does the opposite of helping with that—or at least that it would be far more effective to use better language with existing systems (e.g., saying that the risk has tripled from 0.1% to 0.3% instead of saying the safety has decreased from 99.9% to 99.7%) and/or teach people to better understand the existing systems, rather than introducing a new system.
Yes please. This is a great idea and I would want us to move towards a culture where this is more common. Even better if we can use logarithmic odds instead, but I understand that is a harder sell.
Talking about probabilities makes sense for repeated events where we care about the proportion of outcomes. This is not the case for existential risk.
Also I am going to be pedantic and point out that Tao’s example about the election is misleading. The percentage is not the chances of winning the election! Instead is the pollling results. The implicit probability being discussed is the chances of the election outcome given the polling, that is a far more extreme probability depending on how representative the poll is and how close the results are.
The way scientists and engineers deal with these issues of scale, when not using a log scale, is with unit choice. In our lab, we talk about “microns” when discussing the micro scale, and “nanometers” when discussing the nanoscale. This lets us keep our numbers conveniently sized for discussion. It has nothing to do with the felt size of a nanometer versus a micrometer. It has everything to do with the convenience and precision of technical discussion among colleagues.
Log scales are designed by and for scientists for similar purposes.
When we communicate that a substance is dangerously acidic, we typically do that with big red warning letters and pictures indicating the danger. When we indicate that a vinegar or a citrus fruit is tart (also a function of acidity), we do it by comparing with a familiar taste, or use a vivid verbal description. Log scales are nowhere to be found.
This means the same thing it means for any other use of a log scale. It means that the value can be a very small number or very, very big number.
The Richter scale is measuring the amplitude of waves recorded by seismographs. The amplitude of these waves for human-perceptible earthquakes has crossed 9.6 orders of magnitude since we started using seismographs.
Given this, the Richter scale does not have the widest range of any other log scale, if that’s what you mean. The pH scale, for example, has a range of 15 orders of magnitude.
If you are communicating with a lay audience, and want to give them a sense of the expected damage from an earthquake, or how it’ll physically feel, probably the best way to do that is with a verbal description.
If you are communicating with a scientific audience, you want to use a measured seismograph value. You could write out all those zeroes to express a high wave amplitude, but it’s convenient to use the log scale.
This is not historically accurate. You can find a pretty good account of the development of the Richter scale on Wikipedia. It was developed to replace a felt-experience-based assessment, such as the Rossi-Forel, which had some nice vivid descriptions.
Felt experience still was used to choose the zero point.
If we think about property damage, many buildings are going to be built to withstand an earthquake up to a certain magnitude. Below this, damage, measured in dollars or in lost lives, may be less than 10x per point on the Richter scale. Above this, damage may suddenly jump by much more. It’s all about the relationship between historical trends in earthquake magnitude and investment in engineering to resist future earthquakes.
In my experience, log scales are convenient for scientists, because we’re less prone to error. If I need a solution to be at pH 1, that’s easy to remember. If I had to convert that to absolute H+ concentration, I’d be prone to dropping a zero somewhere. But if you don’t understand log scales, or are using them as a subjective guide rather than a scientific instrument, I think they’re less helpful—as evidenced by the fact that they don’t get used for, say, measuring wealth or population levels, which are the areas where lay audiences routinely encounter large numbers.
For another interesting history of the move from subjective assessments to more uniform observations, check out the development of the Beaufort Scale!
Nines of unsafety, for the pessimists. So 2 9′s of unsafety is a 99% chance of doom.
For starters, you haven’t given any examples of laws here. You’ve only given examples of units. And from your wording, I’m not sure if you understand the difference.
For example, when you say “distance is additive,” I’m not sure what you mean. Distance is a scalar, not a law, and laws involving distance may use all kinds of arithmetic transformations of distance. For example, Newton’s law of universal gravitation related the force of attraction between two bodies to the inverse square of their distance.
Not only have you not been clear enough with your language, you have also not supplied evidence for your claims. By evidence, I mean either historical examples about how various unit types were developed, hypotheticals about our ability to distinguish levels of magnitude with our senses, or real-world examples of how we communicate expectations about sensory experience to the lay public. I’ve given you all of these forms of evidence, and you haven’t responded to them.
For these reasons, I don’t understand you, and am no longer interested in talking with you about this subject.