This increases my confidence in the geo mean of the odds, and decreases my confidence in the extremization bit.
I find it very interesting that the extremized version was consistently below by a narrow margin. I wonder if this means that there is a subset of questions where it works well, and another where it underperforms.
One question / nitpick: what do you mean by geometric mean of the probabilities? If you just take the geometric mean of probabilities then you do not get a valid probability—the sum of the pooled ps and (1-p)s does not equal 1. You need to rescale them, at which point you end with the geometric mean of odds.
Unexpected values explains this better than me here.
I find it very interesting that the extremized version was consistently below by a narrow margin. I wonder if this means that there is a subset of questions where it works well, and another where it underperforms.
I think it’s actually that historically the Metaculus community was underconfident (see track record here before 2020 vs after 2020).
Extremizing fixes that underconfidence, but also the average predictor improving their ability also fixed that underconfidence.
One question / nitpick: what do you mean by geometric mean of the probabilities?
Metaculus has a known bias towards questions resolving positive . Metaculus users have a known bias overestimating the probabilities of questions resolving positive. (Again—see the track record). Taking a geometric median of the probabilities of the events happening will give a number between 0 and 1. (That is, a valid probability). It will be inconsistent with the estimate you’d get if you flipped the question HOWEVER Metaculus users also seem to be inconsistent in that way, so I thought it was a neat way to attempt to fix that bias. I should have made it more explicit, that’s fair.
but also the average predictor improving their ability also fixed that underconfidence
What do mean by this?
Metaculus has a known bias towards questions resolving positive
Oh I see!
It is very cool that this works.
One thing that confuses me—when you take the geometric mean of probabilities you end up with ppooled+(1−p)pooled<1. So the pooled probability gets slighly nudged towards 0 in comparison to what you would get with the geometric mean of odds. Doesn’t that mean that it should be less accurate, given the bias towards questions resolving positively?
but also the average predictor improving their ability also fixed that underconfidence
What do mean by this?
I mean in the past people were underconfident (so extremizing would make their predictions better). Since then they’ve stopped being underconfident. My assumption is that this is because the average predictor is now more skilled or because more predictors improves the quality of the average.
Doesn’t that mean that it should be less accurate, given the bias towards questions resolving positively?
The bias isn’t that more questions resolve positively than users expect. The bias is that users expect more questions to resolve positive than actually resolve positive. Shifting probabilities lower fixes this.
Basically lots of questions on Metaculus are “Will X happen?” where X is some interesting event people are talking about, but the base rate is perhaps low. People tend to overestimate the probability of X relative to what actually occurs.
The bias isn’t that more questions resolve positively than users expect. The bias is that users expect more questions to resolve positive than actually resolve positive.
I mean in the past people were underconfident (so extremizing would make their predictions better). Since then they’ve stopped being underconfident. My assumption is that this is because the average predictor is now more skilled or because more predictors improves the quality of the average.
Gotcha!
The bias isn’t that more questions resolve positively than users expect.
Thank you for the superb analysis!
This increases my confidence in the geo mean of the odds, and decreases my confidence in the extremization bit.
I find it very interesting that the extremized version was consistently below by a narrow margin. I wonder if this means that there is a subset of questions where it works well, and another where it underperforms.
One question / nitpick: what do you mean by geometric mean of the probabilities? If you just take the geometric mean of probabilities then you do not get a valid probability—the sum of the pooled ps and (1-p)s does not equal 1. You need to rescale them, at which point you end with the geometric mean of odds.
Unexpected values explains this better than me here.
I think it’s actually that historically the Metaculus community was underconfident (see track record here before 2020 vs after 2020).
Extremizing fixes that underconfidence, but also the average predictor improving their ability also fixed that underconfidence.
Metaculus has a known bias towards questions resolving positive. Metaculus users have a known bias overestimating the probabilities of questions resolving positive. (Again—see the track record). Taking a geometric median of the probabilities of the events happening will give a number between 0 and 1. (That is, a valid probability). It will be inconsistent with the estimate you’d get if you flipped the question HOWEVER Metaculus users also seem to be inconsistent in that way, so I thought it was a neat way to attempt to fix that bias. I should have made it more explicit, that’s fair.Edit: Updated for clarity based on comments below
What do mean by this?
Oh I see!
It is very cool that this works.
One thing that confuses me—when you take the geometric mean of probabilities you end up with ppooled+(1−p)pooled<1. So the pooled probability gets slighly nudged towards 0 in comparison to what you would get with the geometric mean of odds. Doesn’t that mean that it should be less accurate, given the bias towards questions resolving positively?
What am I missing?
I mean in the past people were underconfident (so extremizing would make their predictions better). Since then they’ve stopped being underconfident. My assumption is that this is because the average predictor is now more skilled or because more predictors improves the quality of the average.
The bias isn’t that more questions resolve positively than users expect. The bias is that users expect more questions to resolve positive than actually resolve positive. Shifting probabilities lower fixes this.
Basically lots of questions on Metaculus are “Will X happen?” where X is some interesting event people are talking about, but the base rate is perhaps low. People tend to overestimate the probability of X relative to what actually occurs.
I don’t get what the difference between these is.
“more questions resolve positively than users expect”
Users expect 50 to resolve positively, but actually 60 resolve positive.
“users expect more questions to resolve positive than actually resolve positive”
Users expect 50 to resolve positive, but actually 40 resolve positive.
I have now editted the original comment to be clearer?
Cheers
Gotcha!
Oh I see!