I have grips with the methodology of the article, but I don’t think highlighting the geometric mean of odds over the mean of probabilities is a major fault. The core problem is assuming independence over the predictions at each stage. The right move would have been to aggregate the total P(doom) of each forecaster using geo mean of odds (not that I think that asking random people and aggregating their beliefs like this is particularly strong evidence).
The intuition pump that if someone assigns a zero percent chance then the geomean aggregate breaks is flawed:
There is an equally compelling pump the other way around: the arithmetic mean of probabilities defers unduly to people assigning a high chance. A single dissenter between 10 experts can bound the lower bound of the probability to their preferred up to a factor of 10.
And surely if anyone is assigning a zero percent chance to something, you can safely assume they are not taking the situation seriously and ignore them.
And if you are still worried about dissenters skewing the predictions, one common strategy is to winsorize, by clipping the predictions among the 5% and 95% percentile for example.
I have grips with the methodology of the article, but I don’t think highlighting the geometric mean of odds over the mean of probabilities is a major fault. The core problem is assuming independence over the predictions at each stage. The right move would have been to aggregate the total P(doom) of each forecaster using geo mean of odds (not that I think that asking random people and aggregating their beliefs like this is particularly strong evidence).
The intuition pump that if someone assigns a zero percent chance then the geomean aggregate breaks is flawed:
There is an equally compelling pump the other way around: the arithmetic mean of probabilities defers unduly to people assigning a high chance. A single dissenter between 10 experts can bound the lower bound of the probability to their preferred up to a factor of 10.
And surely if anyone is assigning a zero percent chance to something, you can safely assume they are not taking the situation seriously and ignore them.
In ultimate instance, we can theorize all we want, but as a matter of fact the best performance when predicting complex events is achieved when taking the geometric mean of odds, both in terms of logloss and brier scores. Without more compelling evidence or a very clear theoretical reason that distinguishes between the contexts, it seems weird to argue that we should treat AI risk differently.
And if you are still worried about dissenters skewing the predictions, one common strategy is to winsorize, by clipping the predictions among the 5% and 95% percentile for example.