Thanks for this post—I think this was a very useful conversation to have started (at least for my own work!), even if I’m less confident than you in some of these conclusions (both because I just feel confused and because I’ve heard other people give good-sounding arguments for other conclusions).
In the two large scale empirical evaluations I am aware of [1][2], it surpasses the mean of probabilities and the median (*).
But it seems worth noting that in one of those cases the geometric mean of probabilities outperformed the geometric mean of odds.
You later imply that you think this is at least partly because of a specific bias among Metaculus forecasts. But I’m not sure if you think it’s fully because of that or whether that’s the right explanation (I only skimmed the linked thread). And in any case the basic fact that geometric mean of probabilities performed best in this dataset seems worth noting if you’re using performance in that dataset as evidence for some other aggregation method.
Thanks for this post—I think this was a very useful conversation to have started (at least for my own work!), even if I’m less confident than you in some of these conclusions
Thank you for your kind words! To dismiss any impression of confidence, this represents my best guesses. I am also quite confused.
I’ve heard other people give good-sounding arguments for other conclusions
I’d be really curious if you can dig these up!
You later imply that you think [the geo mean of probs outperforming the geo mean of odds] is at least partly because of a specific bias among Metaculus forecasts. But I’m not sure if you think it’s fully because of that or whether that’s the right explanation
I am confident that the geometric mean of probs outperformed the geo mean of odds because of this bias. If you change the coding of all binary questions so that True becomes False and viceversa then you are going to get worse performance that the geo mean of odds.
This is because the geometric mean of probabilities does not map consistently predictions and their complements. With a basic example, suppose that we have p1=0.01,p2=0.3. Then √p1∗p2+√(1−p1)∗(1−p2)≈0.89<1.
So the geometric mean of probabilities in this sense it’s not a consistent probability—it doesn’t map the complement of probabilities to the the complement of the geometric mean as we would expect (the geometric mean of odds, the mean of probabilities and the median all satisfy this basic property).
So I would recommend viewing the geometric mean of probabilities as a hack to adjust the geometric mean of odds down. This is also why I think better adjustments likely exist, since this isn’t a particularly well motivated adjustment. It does however seem to slighly improve Metaculus predictions, so I included it in the flowchart.
To drill this point even more, here is what we would get if we aggregated the predictions in the last 860 resolved metaculus binary questions by mapping each prediction to their complement, taking the geo mean of probs and taking the complement again:
As you can see, this change (that would not affect the other aggregates) significantly weakens the geo mean of probs.
Thanks for this post—I think this was a very useful conversation to have started (at least for my own work!), even if I’m less confident than you in some of these conclusions (both because I just feel confused and because I’ve heard other people give good-sounding arguments for other conclusions).
But it seems worth noting that in one of those cases the geometric mean of probabilities outperformed the geometric mean of odds.
You later imply that you think this is at least partly because of a specific bias among Metaculus forecasts. But I’m not sure if you think it’s fully because of that or whether that’s the right explanation (I only skimmed the linked thread). And in any case the basic fact that geometric mean of probabilities performed best in this dataset seems worth noting if you’re using performance in that dataset as evidence for some other aggregation method.
Thank you for your kind words! To dismiss any impression of confidence, this represents my best guesses. I am also quite confused.
I’d be really curious if you can dig these up!
I am confident that the geometric mean of probs outperformed the geo mean of odds because of this bias. If you change the coding of all binary questions so that True becomes False and viceversa then you are going to get worse performance that the geo mean of odds.
This is because the geometric mean of probabilities does not map consistently predictions and their complements. With a basic example, suppose that we have p1=0.01,p2=0.3. Then √p1∗p2+√(1−p1)∗(1−p2)≈0.89<1.
So the geometric mean of probabilities in this sense it’s not a consistent probability—it doesn’t map the complement of probabilities to the the complement of the geometric mean as we would expect (the geometric mean of odds, the mean of probabilities and the median all satisfy this basic property).
So I would recommend viewing the geometric mean of probabilities as a hack to adjust the geometric mean of odds down. This is also why I think better adjustments likely exist, since this isn’t a particularly well motivated adjustment. It does however seem to slighly improve Metaculus predictions, so I included it in the flowchart.
To drill this point even more, here is what we would get if we aggregated the predictions in the last 860 resolved metaculus binary questions by mapping each prediction to their complement, taking the geo mean of probs and taking the complement again:
As you can see, this change (that would not affect the other aggregates) significantly weakens the geo mean of probs.
Is there a map which doesn’t have a discontinuity at .5?
What do you mean exactly? None of these maps have a discontinuity at .5