Linch comments on Evidence on good forecasting practices from the Good Judgment Project: an accompanying blog post

Linch Jul 15, 2020, 10:25 AM
5 points
0 ∶ 0
when the authors rounded the superforecaster’s forecasts to the nearest 0.05, their accuracy dropped.
You might be interested in knowing that this part is not supported by the evidence, see KrisMoore’s comment on Metaculus.
I also personally doubt that the “calibration precision” of most supers is at 101 units (though it’s certainly possible!)
What links here?
- Gregory Lewis🔸Jul 15, 2020, 5:30 PM
  4 points
  0 ∶ 0
  Parent
  It is true that given the primary source (presumably this), the implication is that rounding supers to 0.1 hurt them, but 0.05 didn’t:
  To explore this relationship, we rounded forecasts to the nearest 0.05, 0.10, or 0.33 to see whether Brier scores became less accurate on the basis of rounded forecasts rather than unrounded forecasts. [...]
  For superforecasters, rounding to the nearest 0.10 produced significantly worse Brier scores [by implication, rounding to the nearest 0.05 did not]. However, for the other two groups, rounding to the nearest 0.10 had no influence. It was not until rounding was done to the nearest 0.33 that accuracy declined.
  Prolonged aside:
  That said, despite the absent evidence I’m confident accuracy with superforecasters (and ~anyone else—more later, and elsewhere) does numerically drop with rounding to 0.05 (or anything else), even if has not been demonstrated to be statistically significant:
  From first principles, if the estimate has signal, shaving bits of information from it by rounding should make it less accurate (and it obviously shouldn’t make it more accurate, pretty reliably setting the upper bound of our uncertainty to 0).
  Further, there seems very little motivation for the idea we have n discrete ‘bins’ of probability across the number line (often equidistant!) inside our heads, and as we become better forecasters n increases. That we have some standard error to our guesses (which ~smoothly falls with increasing skill) seems significantly more plausible. As such the ‘rounding’ tests should be taken as loose proxies to assess this error.
  Yet if error process is this, rather than ‘n real values + jitter no more than 0.025’, undersampling and aliasing should introduce a further distortion. Even if you think there really are n bins someone can ‘really’ discriminate between, intermediate values are best seen as a form of anti-aliasing (“Think it is more likely 0.1 than 0.15, but not sure, maybe its ⁶⁰⁄₄₀ between them so I’ll say 0.12”) which rounding ablates. In other words ‘accurate to the nearest 0.1’ does not mean the second decimal place carries no information.
  Also, if you are forecasting distributions rather than point estimates (cf. Metaculus), said forecast distributions typically imply many intermediate value forecasts.
  Empirically, there’s much to suggest a T2 error explanation of the lack of a ‘significant’ drop. As you’d expect, the size of the accuracy loss grows with both how coarsely things are rounded, and the performance of the forecaster. Even if relatively finer coarsening makes things slightly worse, we may expect to miss it. This looks better to me on priors than these trends ‘hitting a wall’ at a given level of granularity (so I’d guess untrained forecasters are numerically worse if rounded to 0.1, even if the worse performance means there is less signal to be lost, and in turn makes this hard to ‘statistically significantly’ detect).
  I’d adduce other facts against too. One is simply that superforecasters are prone to not give forecasts on a 5% scale, using intermediate values instead: given their good callibration, you’d expect them to iron out this Brier-score-costly jitter (also, this would be one of the few things they are doing worse than regular forecasters). You’d also expect discretization in things like their calibration curve (e.g. events they say happen 12% of the time in fact happen 10% of time, whilst events that they say happen 13% of the time in fact happen 15% of the time), or other derived figures like ROC.
  This is ironically foxy, so I wouldn’t be shocked for this to be slain by the numerical data. But I’d bet at good odds (north of 3:1) things like “Typically, for ‘superforecasts’ of X%, these events happened more frequently than those forecast at (X-1)%, (X-2)%, etc.”
  - Larks Jul 15, 2020, 7:18 PM
    10 points
    0 ∶ 0
    Parent
    It always seemed strange to me that the idea was expressed as ‘rounding’. Replacing a 50.4% with 50% seems relatively innocuous to me; replacing 0.6% with 1% - or worse, 0.4% with 0% - seems like a very different thing altogether!
  - Linch Jul 16, 2020, 5:42 AM
    3 points
    0 ∶ 0
    Parent
    I think I broadly agree with what you say and will not bet against your last paragraph, except for the trivial sense that I expect most studies to be too underpowered to detect those differences.
- kokotajlod Jul 16, 2020, 2:45 PM
  3 points
  0 ∶ 0
  Parent
  Thanks, I’ll update the text when I get access to Metaculus again (I’ve blocked myself from it for productivity reasons lol)