BLUF: Suppose you want to estimate some important X (e.g. risk of great power conflict this century, total compute in 2050). If your best guess for X is 0.37, but youâre very uncertain, you still shouldnât replace it with an imprecise approximation (e.g. âroughly 0.4â, âfairly unlikelyâ), as this removes information. It is better to offer your precise estimate, alongside some estimate of its resilience, either subjectively (â0.37, but if I thought about it for an hour Iâd expect to go up or down by a factor of 2â), or objectively (â0.37, but I think the standard error for my guess to be ~0.1âł).
âFalse precisionâ
Imprecision often has a laudable motivationâto avoid misleading your audience into relying on your figures more than they should. If 1 in 7 of my patients recover with a new treatment, I shouldnât just report this proportion, without elaboration, to 5 significant figures (14.285%).
I think a similar rationale is often applied to subjective estimates (forecasting most salient in my mind). If I say something like âI think thereâs a 12% chance of the UN declaring a famine in South Sudan this yearâ, this could imply my guess is accurate to the nearest percent. If I made this guess off the top of my head, I do not want to suggest such a strong warrantyâand others might accuse me of immodest overconfidence (âSure, Nostradamus â 12% exactlyâ). Rounding off to a number (â10%â), or just a verbal statement (âpretty unlikelyâ) seems both more reasonable and defensible, as this makes it clearer Iâm guessing.
In praise of uncertain precision
One downside of this is natural language has a limited repertoire to communicate degrees of uncertainty. Sometimes âround numbersâ are not meant as approximations: I might mean â10%â to be exactly 10% rather than roughly 10%. Verbal riders (e.g. roughly X, around X, X or so, etc.) are ambiguous: does roughly 1000 mean one is uncertain about the last three digits, or the first, or how many digits in total? Qualitative statements are similar: people vary widely in their interpretation of words like âunlikelyâ, âalmost certainâ, and so on.
The greatest downside, though, is precision: you lose half the information if you round percents to per-tenths. If, as is often the case in EA-land, one is constructing some estimate âmultiplying throughâ various subjective judgements, there could also be significant âerror carried forwardâ (cf. premature rounding). If Iâm assessing the value of famine prevention efforts in South Sudan, rounding status quo risk to 10% versus 12% infects downstream work with a 1/â6th directional error.
There are two natural replies one can make. Both are mistaken.
High precision is exactly worthless
First, one can deny the more precise estimate is any more accurate than the less precise one. Although maybe superforecasters could expect ârounding to the nearest 10%â would harm their accuracy, others thinking the same are just kidding themselves, so nothing is lost. One may also have some of Tetlockâs remarks in mind about ârounding offâ mediocre forecasters doesnât harm their scores, as opposed to the best.
I donât think this is right. Combining the two relevant papers (1, 2), you see that everyone, even mediocre forecasters, have significantly worse Brier scores if you round them into seven bins. Non-superforecasters do not see a significant loss if rounded to the nearest 0.1. Superforecasters do see a significant loss at 0.1, but not if you rounded more tightly to 0.05.
Type 2 error (i.e. rounding in fact leads to worse accuracy, but we do not detect it statistically), rather than the returns to precision falling to zero, seems a much better explanation. In principle:
If a measure has signal (and in aggregate everyone was predicting better than chance) shaving bits off it should reduce it; it also definitely shouldnât increase it, setting the upper bound of whether this helps or harms reliably to zero.
Trailing bits of estimates can be informative even if discrimination between them is unreliable. Itâs highly unlikely superforecasters can reliably discriminate (e.g.) p=0.42 versus 0.43, yet their unreliable discrimination can still tend towards the truth (X-1% forecasts happen less frequently than X% forecasts, even if one lacks the data to demonstrate this for any particular X). Superforecaster callibration curves, although good, are imperfect, yet I aver the transformation to perfect calibration will be order preserving rather than âstair-steppingâ.
Rounding (i.e. undersampling) would only help if we really did have small-n discrete values for our estimates across the number line, and we knew variation under this âestimative nyquist limitâ was uninformative jitter.
Yet that we have small-n discrete values (often equidistant on the probability axis, and shared between people), and increased forecasting skill leads n to increase is implausible. That we just have some estimation error (the degree of which is lower for better forecasters) has much better face validity. Yet if thereâs no scale threshold which variation below is uninformative, taking the central estimate (rather than adding some arbitrary displacement to get it to the nearest âround numberâ) should fare better.
Even on the n-bin model, intermediate values can be naturally parsed as estimation anti-aliasing when one remains unsure which bin to go with (e.g. âMaybe itâs 10%, but maybe itâs 15% - Iâm not sure, but maybe more likely 10% than 15%, so Iâll say 12%â). Aliasing them again should do worse.
In practice:
The effect sizes for âcosts to roundingâ increase both with degree of rounding (you tank Brier more with 3 bins than 7), and with underlying performance (i.e. you tank superforecaster scores more with 7 bins than untrained forecasters). This lines up well with T2 error: I predict even untrained forecasters are numerically worse with 0.1 (or 0.05) rounding, but as their accuracy wasnât great to start with, this small decrement wonât pass hypothesis testing (but rounding superforecasters to the same granularity generate a larger and so detectable penalty).
Superforecasters themselves are prone to offering intermediate values. If they really only have 0.05 bins (e.g. events they say are 12% likely are really 10% likely, events they say are 13% likely are really 15% likely), this habit worsens their performance. Further, this habit would be one of the few things they do worse than typical forecasters: a typical forecaster jittering over 20 bins when they only have 10 levels is out by a factor of two; a âsuperforecasterâ, jittering over percentages when they only have twenty levels, is out by a factor of five.
The rounding/âgranularity assessments are best seen as approximate tests of accuracy. The error processes which would result in rounding being no worse (or an improvement) labour under very adverse priors, and ânot shown to be statistically worseâ should not convince us of them.
Precision is essentially (although not precisely) pointless
Second, one may assert the accuracy benefit of precision may be greater than zero, but less than any non-trivial value. For typical forecasters, the cost of rounding into seven bins is a barely perceptible percent or so of Brier score. If (e.g.) whether famine prevention efforts are a good candidate intervention proves sensitive to whether we use a subjective estimate of 12% or round it to 10%, this âbottom lineâ seems too volatile to take seriously. So rounding is practically non-inferior with respect to accuracy, and so the benefits noted before tilt the balance of considerations in favour.
Yet this reply conflates issues around value of information (q.v.). If Iâm a program officer weighing up whether to recommend famine prevention efforts in South Sudan, and I find my evaluation is very sensitive to this âoff the top of my headâ guess on how likely famine is on the status quo, said guess looks like an important thing to develop if I want to improve my overall estimate.
Suppose this cannot be doneâsay for some reason I need to make a decision right now, or, despite careful further study, I remain just as uncertain as I was before. In these cases I should decide on the basis of my unrounded-estimate: my decision is better in expectation (if only fractionally) if I base them on (in expectation) fractionally more accurate estimates.
Thus I take precision, even when uncertainâor very uncertainâto be generally beneficial. It would be good if there was some âbest of both worldsâ way to concisely communicate uncertainty without sacrificing precision. I have a suggestion.
Resilience
One underlying challenge is natural language poorly distinguishes between aleatoric and epistemic uncertainty. I am uncertain (aleatoric sense) whether a coin will land heads, but Iâm fairly sure the likelihood will be close to 50% (coins tend approximately fair). I am also uncertain whether local time is before noon in [place Iâve never heard of before], but this uncertainty is essentially inside my own head. I might initially guess 50% (modulo steers like âsounds more a like a place in this region on the planetâ), but expect this guess to shift to ~0 or ~1 after several seconds of internet enquiry.
This distinction can get murky (e.g. isnât all the uncertainty about whether there will be a famine âinside our headsâ?), but the moral that we want to communicate our degree of epistemic uncertainty remains. Some folks already do this by giving a qualitative âepistemic statusâ. We can do the same thing, somewhat more quantitatively, by guessing how resilient our guesses are.
There are a couple of ways I try to do this:
Give a standard error or credible interval: âI think the area of the Mediterranean sea is 300k square kilometers, but I expect to be off by an order of magnitudeâ; âI think Alice is 165 cm tall (95 CI: 130-190)â. I think it works best when we expect to get access to the âtrue valueâ - or where thereâs a clear core of non-epistemic uncertainty even a perfect (human) cognizer would have to grapple with.
Give an expected error/âCI relative to some better estimatorâeither a counterpart of yours (âI think thereâs a 12% chance of a famine in South Sudan this year, but if I spent another 5 hours on this Iâd expect to move by 6%â); or a hypothetical one (â12%, but my 95% CI for what a superforecaster median would be is [0%-45%]â). This works better when one does not expect to get access to the âtrue valueâ (âWhat was the ârightâ ex ante probability Trump wins the 2016 election?â)
With either, one preserves precision, and communicates a better sense of uncertainty (i.e. how uncertain, rather than that one is uncertain), at a modest cost of verbiage. Another minor benefit is many of these can be tracked for calibration purposes: the first method is all-but a calibration exercise; for the latter, one can review how well you predicted what your more thoughtful self thinks.
Conclusion
All that said, sometimes precision has little value: some very rough sense of uncertainty around a rough estimate is good enough, and careful elaboration is a waste of time. âIâm going into town, I think Iâll be back in around 13 minutes, but with an hour to think more about it Iâd expect my guess would change by 3 minutes on averageâ, seems overkill versus âGoing to town, back in quarter of an hour-ishâ, as typically the marginal benefit to my friend believing â13 [10-16]â versus (say) â15 [10-25]â is minimal.
Yet not always; some numbers are much more important than others, and worth traversing a very long way along a concision/âprecision efficient frontier. âHow many COVID-19 deaths will be averted if we adopt costly policy X versus less costly variant X`?â is the sort of question where one basically wants as much precision as possible (e.g. youâd probably want to be a lot verbose on spreadâor just give the distribution with subjective error barsârather than a standard error for the mean, etc.)
In these important cases, one is hamstrung if one only has âquick and dirtyâ ways to communicate uncertainty in oneâs arsenal: our powers of judgement are feeble enough without saddling them with lossy and ambiguous communication too. Important cases are also the ones EA-land is often interested in.
Use resilience, instead of imprecision, to communicate uncertainty
BLUF: Suppose you want to estimate some important X (e.g. risk of great power conflict this century, total compute in 2050). If your best guess for X is 0.37, but youâre very uncertain, you still shouldnât replace it with an imprecise approximation (e.g. âroughly 0.4â, âfairly unlikelyâ), as this removes information. It is better to offer your precise estimate, alongside some estimate of its resilience, either subjectively (â0.37, but if I thought about it for an hour Iâd expect to go up or down by a factor of 2â), or objectively (â0.37, but I think the standard error for my guess to be ~0.1âł).
âFalse precisionâ
Imprecision often has a laudable motivationâto avoid misleading your audience into relying on your figures more than they should. If 1 in 7 of my patients recover with a new treatment, I shouldnât just report this proportion, without elaboration, to 5 significant figures (14.285%).
I think a similar rationale is often applied to subjective estimates (forecasting most salient in my mind). If I say something like âI think thereâs a 12% chance of the UN declaring a famine in South Sudan this yearâ, this could imply my guess is accurate to the nearest percent. If I made this guess off the top of my head, I do not want to suggest such a strong warrantyâand others might accuse me of immodest overconfidence (âSure, Nostradamus â 12% exactlyâ). Rounding off to a number (â10%â), or just a verbal statement (âpretty unlikelyâ) seems both more reasonable and defensible, as this makes it clearer Iâm guessing.
In praise of uncertain precision
One downside of this is natural language has a limited repertoire to communicate degrees of uncertainty. Sometimes âround numbersâ are not meant as approximations: I might mean â10%â to be exactly 10% rather than roughly 10%. Verbal riders (e.g. roughly X, around X, X or so, etc.) are ambiguous: does roughly 1000 mean one is uncertain about the last three digits, or the first, or how many digits in total? Qualitative statements are similar: people vary widely in their interpretation of words like âunlikelyâ, âalmost certainâ, and so on.
The greatest downside, though, is precision: you lose half the information if you round percents to per-tenths. If, as is often the case in EA-land, one is constructing some estimate âmultiplying throughâ various subjective judgements, there could also be significant âerror carried forwardâ (cf. premature rounding). If Iâm assessing the value of famine prevention efforts in South Sudan, rounding status quo risk to 10% versus 12% infects downstream work with a 1/â6th directional error.
There are two natural replies one can make. Both are mistaken.
High precision is exactly worthless
First, one can deny the more precise estimate is any more accurate than the less precise one. Although maybe superforecasters could expect ârounding to the nearest 10%â would harm their accuracy, others thinking the same are just kidding themselves, so nothing is lost. One may also have some of Tetlockâs remarks in mind about ârounding offâ mediocre forecasters doesnât harm their scores, as opposed to the best.
I donât think this is right. Combining the two relevant papers (1, 2), you see that everyone, even mediocre forecasters, have significantly worse Brier scores if you round them into seven bins. Non-superforecasters do not see a significant loss if rounded to the nearest 0.1. Superforecasters do see a significant loss at 0.1, but not if you rounded more tightly to 0.05.
Type 2 error (i.e. rounding in fact leads to worse accuracy, but we do not detect it statistically), rather than the returns to precision falling to zero, seems a much better explanation. In principle:
If a measure has signal (and in aggregate everyone was predicting better than chance) shaving bits off it should reduce it; it also definitely shouldnât increase it, setting the upper bound of whether this helps or harms reliably to zero.
Trailing bits of estimates can be informative even if discrimination between them is unreliable. Itâs highly unlikely superforecasters can reliably discriminate (e.g.) p=0.42 versus 0.43, yet their unreliable discrimination can still tend towards the truth (X-1% forecasts happen less frequently than X% forecasts, even if one lacks the data to demonstrate this for any particular X). Superforecaster callibration curves, although good, are imperfect, yet I aver the transformation to perfect calibration will be order preserving rather than âstair-steppingâ.
Rounding (i.e. undersampling) would only help if we really did have small-n discrete values for our estimates across the number line, and we knew variation under this âestimative nyquist limitâ was uninformative jitter.
Yet that we have small-n discrete values (often equidistant on the probability axis, and shared between people), and increased forecasting skill leads n to increase is implausible. That we just have some estimation error (the degree of which is lower for better forecasters) has much better face validity. Yet if thereâs no scale threshold which variation below is uninformative, taking the central estimate (rather than adding some arbitrary displacement to get it to the nearest âround numberâ) should fare better.
Even on the n-bin model, intermediate values can be naturally parsed as estimation anti-aliasing when one remains unsure which bin to go with (e.g. âMaybe itâs 10%, but maybe itâs 15% - Iâm not sure, but maybe more likely 10% than 15%, so Iâll say 12%â). Aliasing them again should do worse.
In practice:
The effect sizes for âcosts to roundingâ increase both with degree of rounding (you tank Brier more with 3 bins than 7), and with underlying performance (i.e. you tank superforecaster scores more with 7 bins than untrained forecasters). This lines up well with T2 error: I predict even untrained forecasters are numerically worse with 0.1 (or 0.05) rounding, but as their accuracy wasnât great to start with, this small decrement wonât pass hypothesis testing (but rounding superforecasters to the same granularity generate a larger and so detectable penalty).
Superforecasters themselves are prone to offering intermediate values. If they really only have 0.05 bins (e.g. events they say are 12% likely are really 10% likely, events they say are 13% likely are really 15% likely), this habit worsens their performance. Further, this habit would be one of the few things they do worse than typical forecasters: a typical forecaster jittering over 20 bins when they only have 10 levels is out by a factor of two; a âsuperforecasterâ, jittering over percentages when they only have twenty levels, is out by a factor of five.
The rounding/âgranularity assessments are best seen as approximate tests of accuracy. The error processes which would result in rounding being no worse (or an improvement) labour under very adverse priors, and ânot shown to be statistically worseâ should not convince us of them.
Precision is essentially (although not precisely) pointless
Second, one may assert the accuracy benefit of precision may be greater than zero, but less than any non-trivial value. For typical forecasters, the cost of rounding into seven bins is a barely perceptible percent or so of Brier score. If (e.g.) whether famine prevention efforts are a good candidate intervention proves sensitive to whether we use a subjective estimate of 12% or round it to 10%, this âbottom lineâ seems too volatile to take seriously. So rounding is practically non-inferior with respect to accuracy, and so the benefits noted before tilt the balance of considerations in favour.
Yet this reply conflates issues around value of information (q.v.). If Iâm a program officer weighing up whether to recommend famine prevention efforts in South Sudan, and I find my evaluation is very sensitive to this âoff the top of my headâ guess on how likely famine is on the status quo, said guess looks like an important thing to develop if I want to improve my overall estimate.
Suppose this cannot be doneâsay for some reason I need to make a decision right now, or, despite careful further study, I remain just as uncertain as I was before. In these cases I should decide on the basis of my unrounded-estimate: my decision is better in expectation (if only fractionally) if I base them on (in expectation) fractionally more accurate estimates.
Thus I take precision, even when uncertainâor very uncertainâto be generally beneficial. It would be good if there was some âbest of both worldsâ way to concisely communicate uncertainty without sacrificing precision. I have a suggestion.
Resilience
One underlying challenge is natural language poorly distinguishes between aleatoric and epistemic uncertainty. I am uncertain (aleatoric sense) whether a coin will land heads, but Iâm fairly sure the likelihood will be close to 50% (coins tend approximately fair). I am also uncertain whether local time is before noon in [place Iâve never heard of before], but this uncertainty is essentially inside my own head. I might initially guess 50% (modulo steers like âsounds more a like a place in this region on the planetâ), but expect this guess to shift to ~0 or ~1 after several seconds of internet enquiry.
This distinction can get murky (e.g. isnât all the uncertainty about whether there will be a famine âinside our headsâ?), but the moral that we want to communicate our degree of epistemic uncertainty remains. Some folks already do this by giving a qualitative âepistemic statusâ. We can do the same thing, somewhat more quantitatively, by guessing how resilient our guesses are.
There are a couple of ways I try to do this:
Give a standard error or credible interval: âI think the area of the Mediterranean sea is 300k square kilometers, but I expect to be off by an order of magnitudeâ; âI think Alice is 165 cm tall (95 CI: 130-190)â. I think it works best when we expect to get access to the âtrue valueâ - or where thereâs a clear core of non-epistemic uncertainty even a perfect (human) cognizer would have to grapple with.
Give an expected error/âCI relative to some better estimatorâeither a counterpart of yours (âI think thereâs a 12% chance of a famine in South Sudan this year, but if I spent another 5 hours on this Iâd expect to move by 6%â); or a hypothetical one (â12%, but my 95% CI for what a superforecaster median would be is [0%-45%]â). This works better when one does not expect to get access to the âtrue valueâ (âWhat was the ârightâ ex ante probability Trump wins the 2016 election?â)
With either, one preserves precision, and communicates a better sense of uncertainty (i.e. how uncertain, rather than that one is uncertain), at a modest cost of verbiage. Another minor benefit is many of these can be tracked for calibration purposes: the first method is all-but a calibration exercise; for the latter, one can review how well you predicted what your more thoughtful self thinks.
Conclusion
All that said, sometimes precision has little value: some very rough sense of uncertainty around a rough estimate is good enough, and careful elaboration is a waste of time. âIâm going into town, I think Iâll be back in around 13 minutes, but with an hour to think more about it Iâd expect my guess would change by 3 minutes on averageâ, seems overkill versus âGoing to town, back in quarter of an hour-ishâ, as typically the marginal benefit to my friend believing â13 [10-16]â versus (say) â15 [10-25]â is minimal.
Yet not always; some numbers are much more important than others, and worth traversing a very long way along a concision/âprecision efficient frontier. âHow many COVID-19 deaths will be averted if we adopt costly policy X versus less costly variant X`?â is the sort of question where one basically wants as much precision as possible (e.g. youâd probably want to be a lot verbose on spreadâor just give the distribution with subjective error barsârather than a standard error for the mean, etc.)
In these important cases, one is hamstrung if one only has âquick and dirtyâ ways to communicate uncertainty in oneâs arsenal: our powers of judgement are feeble enough without saddling them with lossy and ambiguous communication too. Important cases are also the ones EA-land is often interested in.