I see a few ways of assessing “global overconfidence”:
Lump all predictions into two bins (under and over 50%) and check that the lower point is above the diagonal and the upper one is below the diagonal. I just did this and the points are where you’d expect if we were overconfident, but the 90% credible intervals still overlap with the diagonal, so pooling all the bins in this way still provides weak evidence of overconfidence.
Calculating the OC score as defined by Metaculus (scroll down to the bottom of the page and click the (+) sign next to Details). A score between 0 and 1 indicates overconfidence. Open Phil’s score is 0.175, so this is evidence that we’re overconfident. I don’t know how to put a meaningful confidence/credible interval on that number, so it’s hard to say how strong this evidence is.
Run a linear regression on the calibration curve and check that the slope is <1. When I do this for the original curve with 10 points, statsmodels OLS method spits out [0.772, 0.996] as a 95% confidence interval for the slope. I see this as stronger evidence of overconfidence than the previous ones.
One thing to note here is it is plausible that your errors are not symmetric in expectation, if there’s some bias towards phrasing questions one way or another (this could be something like frequently asking “will [event] happen” where optimism might cause you to be too high in general, for example). This might mean assuming linearity could be wrong.
This is probably easier for you to tell since you can see the underlying data.
I haven’t seen a rigorous analysis of this, but I like looking at the slope, and I expect that it’s best to include each resolved prediction as a separate data point. So there would be 743 data points, each with a y value of either 0 or 1.
Very good point!
I see a few ways of assessing “global overconfidence”:
Lump all predictions into two bins (under and over 50%) and check that the lower point is above the diagonal and the upper one is below the diagonal. I just did this and the points are where you’d expect if we were overconfident, but the 90% credible intervals still overlap with the diagonal, so pooling all the bins in this way still provides weak evidence of overconfidence.
Calculating the OC score as defined by Metaculus (scroll down to the bottom of the page and click the
(+)
sign next toDetails
). A score between 0 and 1 indicates overconfidence. Open Phil’s score is 0.175, so this is evidence that we’re overconfident. I don’t know how to put a meaningful confidence/credible interval on that number, so it’s hard to say how strong this evidence is.Run a linear regression on the calibration curve and check that the slope is <1. When I do this for the original curve with 10 points, statsmodels OLS method spits out [0.772, 0.996] as a 95% confidence interval for the slope. I see this as stronger evidence of overconfidence than the previous ones.
One thing to note here is it is plausible that your errors are not symmetric in expectation, if there’s some bias towards phrasing questions one way or another (this could be something like frequently asking “will [event] happen” where optimism might cause you to be too high in general, for example). This might mean assuming linearity could be wrong.
This is probably easier for you to tell since you can see the underlying data.
I haven’t seen a rigorous analysis of this, but I like looking at the slope, and I expect that it’s best to include each resolved prediction as a separate data point. So there would be 743 data points, each with a y value of either 0 or 1.