Use resilience, instead of imprecision, to communicate uncertainty

BLUF: Suppose you want to estimate some important X (e.g. risk of great power conflict this century, total compute in 2050). If your best guess for X is 0.37, but youā€™re very uncertain, you still shouldnā€™t replace it with an imprecise approximation (e.g. ā€œroughly 0.4ā€, ā€œfairly unlikelyā€), as this removes information. It is better to offer your precise estimate, alongside some estimate of its resilience, either subjectively (ā€œ0.37, but if I thought about it for an hour Iā€™d expect to go up or down by a factor of 2ā€), or objectively (ā€œ0.37, but I think the standard error for my guess to be ~0.1ā€³).

ā€˜False precisionā€™

Imprecision often has a laudable motivationā€”to avoid misleading your audience into relying on your figures more than they should. If 1 in 7 of my patients recover with a new treatment, I shouldnā€™t just report this proportion, without elaboration, to 5 significant figures (14.285%).

I think a similar rationale is often applied to subjective estimates (forecasting most salient in my mind). If I say something like ā€œI think thereā€™s a 12% chance of the UN declaring a famine in South Sudan this yearā€, this could imply my guess is accurate to the nearest percent. If I made this guess off the top of my head, I do not want to suggest such a strong warrantyā€”and others might accuse me of immodest overconfidence (ā€œSure, Nostradamus āˆ’ 12% exactlyā€). Rounding off to a number (ā€œ10%ā€), or just a verbal statement (ā€œpretty unlikelyā€) seems both more reasonable and defensible, as this makes it clearer Iā€™m guessing.

In praise of uncertain precision

One downside of this is natural language has a limited repertoire to communicate degrees of uncertainty. Sometimes ā€˜round numbersā€™ are not meant as approximations: I might mean ā€œ10%ā€ to be exactly 10% rather than roughly 10%. Verbal riders (e.g. roughly X, around X, X or so, etc.) are ambiguous: does roughly 1000 mean one is uncertain about the last three digits, or the first, or how many digits in total? Qualitative statements are similar: people vary widely in their interpretation of words like ā€˜unlikelyā€™, ā€˜almost certainā€™, and so on.

The greatest downside, though, is precision: you lose half the information if you round percents to per-tenths. If, as is often the case in EA-land, one is constructing some estimate ā€˜multiplying throughā€™ various subjective judgements, there could also be significant ā€˜error carried forwardā€™ (cf. premature rounding). If Iā€™m assessing the value of famine prevention efforts in South Sudan, rounding status quo risk to 10% versus 12% infects downstream work with a 1/ā€‹6th directional error.

There are two natural replies one can make. Both are mistaken.

High precision is exactly worthless

First, one can deny the more precise estimate is any more accurate than the less precise one. Although maybe superforecasters could expect ā€˜rounding to the nearest 10%ā€™ would harm their accuracy, others thinking the same are just kidding themselves, so nothing is lost. One may also have some of Tetlockā€™s remarks in mind about ā€˜rounding offā€™ mediocre forecasters doesnā€™t harm their scores, as opposed to the best.

I donā€™t think this is right. Combining the two relevant papers (1, 2), you see that everyone, even mediocre forecasters, have significantly worse Brier scores if you round them into seven bins. Non-superforecasters do not see a significant loss if rounded to the nearest 0.1. Superforecasters do see a significant loss at 0.1, but not if you rounded more tightly to 0.05.

Type 2 error (i.e. rounding in fact leads to worse accuracy, but we do not detect it statistically), rather than the returns to precision falling to zero, seems a much better explanation. In principle:

  • If a measure has signal (and in aggregate everyone was predicting better than chance) shaving bits off it should reduce it; it also definitely shouldnā€™t increase it, setting the upper bound of whether this helps or harms reliably to zero.

  • Trailing bits of estimates can be informative even if discrimination between them is unreliable. Itā€™s highly unlikely superforecasters can reliably discriminate (e.g.) p=0.42 versus 0.43, yet their unreliable discrimination can still tend towards the truth (X-1% forecasts happen less frequently than X% forecasts, even if one lacks the data to demonstrate this for any particular X). Superforecaster callibration curves, although good, are imperfect, yet I aver the transformation to perfect calibration will be order preserving rather than ā€˜stair-steppingā€™.

  • Rounding (i.e. undersampling) would only help if we really did have small-n discrete values for our estimates across the number line, and we knew variation under this ā€œestimative nyquist limitā€ was uninformative jitter.

  • Yet that we have small-n discrete values (often equidistant on the probability axis, and shared between people), and increased forecasting skill leads n to increase is implausible. That we just have some estimation error (the degree of which is lower for better forecasters) has much better face validity. Yet if thereā€™s no scale threshold which variation below is uninformative, taking the central estimate (rather than adding some arbitrary displacement to get it to the nearest ā€˜round numberā€™) should fare better.

  • Even on the n-bin model, intermediate values can be naturally parsed as estimation anti-aliasing when one remains unsure which bin to go with (e.g. ā€œMaybe itā€™s 10%, but maybe itā€™s 15% - Iā€™m not sure, but maybe more likely 10% than 15%, so Iā€™ll say 12%ā€). Aliasing them again should do worse.

In practice:

  • The effect sizes for ā€˜costs to roundingā€™ increase both with degree of rounding (you tank Brier more with 3 bins than 7), and with underlying performance (i.e. you tank superforecaster scores more with 7 bins than untrained forecasters). This lines up well with T2 error: I predict even untrained forecasters are numerically worse with 0.1 (or 0.05) rounding, but as their accuracy wasnā€™t great to start with, this small decrement wonā€™t pass hypothesis testing (but rounding superforecasters to the same granularity generate a larger and so detectable penalty).

  • Superforecasters themselves are prone to offering intermediate values. If they really only have 0.05 bins (e.g. events they say are 12% likely are really 10% likely, events they say are 13% likely are really 15% likely), this habit worsens their performance. Further, this habit would be one of the few things they do worse than typical forecasters: a typical forecaster jittering over 20 bins when they only have 10 levels is out by a factor of two; a ā€˜superforecasterā€™, jittering over percentages when they only have twenty levels, is out by a factor of five.

The rounding/ā€‹granularity assessments are best seen as approximate tests of accuracy. The error processes which would result in rounding being no worse (or an improvement) labour under very adverse priors, and ā€˜not shown to be statistically worseā€™ should not convince us of them.

Precision is essentially (although not precisely) pointless

Second, one may assert the accuracy benefit of precision may be greater than zero, but less than any non-trivial value. For typical forecasters, the cost of rounding into seven bins is a barely perceptible percent or so of Brier score. If (e.g.) whether famine prevention efforts are a good candidate intervention proves sensitive to whether we use a subjective estimate of 12% or round it to 10%, this ā€˜bottom lineā€™ seems too volatile to take seriously. So rounding is practically non-inferior with respect to accuracy, and so the benefits noted before tilt the balance of considerations in favour.

Yet this reply conflates issues around value of information (q.v.). If Iā€™m a program officer weighing up whether to recommend famine prevention efforts in South Sudan, and I find my evaluation is very sensitive to this ā€˜off the top of my headā€™ guess on how likely famine is on the status quo, said guess looks like an important thing to develop if I want to improve my overall estimate.

Suppose this cannot be doneā€”say for some reason I need to make a decision right now, or, despite careful further study, I remain just as uncertain as I was before. In these cases I should decide on the basis of my unrounded-estimate: my decision is better in expectation (if only fractionally) if I base them on (in expectation) fractionally more accurate estimates.

Thus I take precision, even when uncertainā€”or very uncertainā€”to be generally beneficial. It would be good if there was some ā€˜best of both worldsā€™ way to concisely communicate uncertainty without sacrificing precision. I have a suggestion.

Resilience

One underlying challenge is natural language poorly distinguishes between aleatoric and epistemic uncertainty. I am uncertain (aleatoric sense) whether a coin will land heads, but Iā€™m fairly sure the likelihood will be close to 50% (coins tend approximately fair). I am also uncertain whether local time is before noon in [place Iā€™ve never heard of before], but this uncertainty is essentially inside my own head. I might initially guess 50% (modulo steers like ā€˜sounds more a like a place in this region on the planetā€™), but expect this guess to shift to ~0 or ~1 after several seconds of internet enquiry.

This distinction can get murky (e.g. isnā€™t all the uncertainty about whether there will be a famine ā€˜inside our headsā€™?), but the moral that we want to communicate our degree of epistemic uncertainty remains. Some folks already do this by giving a qualitative ā€˜epistemic statusā€™. We can do the same thing, somewhat more quantitatively, by guessing how resilient our guesses are.

There are a couple of ways I try to do this:

Give a standard error or credible interval: ā€œI think the area of the Mediterranean sea is 300k square kilometers, but I expect to be off by an order of magnitudeā€; ā€œI think Alice is 165 cm tall (95 CI: 130-190)ā€. I think it works best when we expect to get access to the ā€˜true valueā€™ - or where thereā€™s a clear core of non-epistemic uncertainty even a perfect (human) cognizer would have to grapple with.

Give an expected error/ā€‹CI relative to some better estimatorā€”either a counterpart of yours (ā€œI think thereā€™s a 12% chance of a famine in South Sudan this year, but if I spent another 5 hours on this Iā€™d expect to move by 6%ā€); or a hypothetical one (ā€œ12%, but my 95% CI for what a superforecaster median would be is [0%-45%]ā€). This works better when one does not expect to get access to the ā€˜true valueā€™ (ā€œWhat was the ā€˜rightā€™ ex ante probability Trump wins the 2016 election?ā€)

With either, one preserves precision, and communicates a better sense of uncertainty (i.e. how uncertain, rather than that one is uncertain), at a modest cost of verbiage. Another minor benefit is many of these can be tracked for calibration purposes: the first method is all-but a calibration exercise; for the latter, one can review how well you predicted what your more thoughtful self thinks.

Conclusion

All that said, sometimes precision has little value: some very rough sense of uncertainty around a rough estimate is good enough, and careful elaboration is a waste of time. ā€œIā€™m going into town, I think Iā€™ll be back in around 13 minutes, but with an hour to think more about it Iā€™d expect my guess would change by 3 minutes on averageā€, seems overkill versus ā€œGoing to town, back in quarter of an hour-ishā€, as typically the marginal benefit to my friend believing ā€œ13 [10-16]ā€ versus (say) ā€œ15 [10-25]ā€ is minimal.

Yet not always; some numbers are much more important than others, and worth traversing a very long way along a concision/ā€‹precision efficient frontier. ā€œHow many COVID-19 deaths will be averted if we adopt costly policy X versus less costly variant X`?ā€ is the sort of question where one basically wants as much precision as possible (e.g. youā€™d probably want to be a lot verbose on spreadā€”or just give the distribution with subjective error barsā€”rather than a standard error for the mean, etc.)


In these important cases, one is hamstrung if one only has ā€˜quick and dirtyā€™ ways to communicate uncertainty in oneā€™s arsenal: our powers of judgement are feeble enough without saddling them with lossy and ambiguous communication too. Important cases are also the ones EA-land is often interested in.