Minor point, but I disagree with the unqualified claim of being well calibrated here except for the 90% bucket, at least a little.
Weak evidence that you are overconfident in each of the 0-10, 10-20, 70-80, 80-90 and 90%+ buckets is decent evidence of an overconfidence bias overall, even if those errors are mostly individually within the margin of error.
I see a few ways of assessing āglobal overconfidenceā:
Lump all predictions into two bins (under and over 50%) and check that the lower point is above the diagonal and the upper one is below the diagonal. I just did this and the points are where youād expect if we were overconfident, but the 90% credible intervals still overlap with the diagonal, so pooling all the bins in this way still provides weak evidence of overconfidence.
Calculating the OC score as defined by Metaculus (scroll down to the bottom of the page and click the (+) sign next to Details). A score between 0 and 1 indicates overconfidence. Open Philās score is 0.175, so this is evidence that weāre overconfident. I donāt know how to put a meaningful confidence/ācredible interval on that number, so itās hard to say how strong this evidence is.
Run a linear regression on the calibration curve and check that the slope is <1. When I do this for the original curve with 10 points, statsmodels OLS method spits out [0.772, 0.996] as a 95% confidence interval for the slope. I see this as stronger evidence of overconfidence than the previous ones.
One thing to note here is it is plausible that your errors are not symmetric in expectation, if thereās some bias towards phrasing questions one way or another (this could be something like frequently asking āwill [event] happenā where optimism might cause you to be too high in general, for example). This might mean assuming linearity could be wrong.
This is probably easier for you to tell since you can see the underlying data.
I havenāt seen a rigorous analysis of this, but I like looking at the slope, and I expect that itās best to include each resolved prediction as a separate data point. So there would be 743 data points, each with a y value of either 0 or 1.
Iām probably missing something but doesnāt the graph show OP is under-confident in the 0-10 and 10-20 bins? e.g. those data points are above the dotted grey line of perfect calibration where the 90%+ bin is far below?
I think overconfident and underconfident arenāt crisp terms to describe this. With binary outcomes, you can invert the prediction and it means the same thing (20% chance of X == 80% chance of not X). So being below the calibration line in the 90% bucket and above the line in the 10% bucket are functionally the same thing.
Iām using overconfident here to mean closer to extreme confidence (0 or 100, depending on whether they are below or above 50%, respectively) than they should be.
Minor point, but I disagree with the unqualified claim of being well calibrated here except for the 90% bucket, at least a little.
Weak evidence that you are overconfident in each of the 0-10, 10-20, 70-80, 80-90 and 90%+ buckets is decent evidence of an overconfidence bias overall, even if those errors are mostly individually within the margin of error.
Very good point!
I see a few ways of assessing āglobal overconfidenceā:
Lump all predictions into two bins (under and over 50%) and check that the lower point is above the diagonal and the upper one is below the diagonal. I just did this and the points are where youād expect if we were overconfident, but the 90% credible intervals still overlap with the diagonal, so pooling all the bins in this way still provides weak evidence of overconfidence.
Calculating the OC score as defined by Metaculus (scroll down to the bottom of the page and click the
(+)
sign next toDetails
). A score between 0 and 1 indicates overconfidence. Open Philās score is 0.175, so this is evidence that weāre overconfident. I donāt know how to put a meaningful confidence/ācredible interval on that number, so itās hard to say how strong this evidence is.Run a linear regression on the calibration curve and check that the slope is <1. When I do this for the original curve with 10 points, statsmodels OLS method spits out [0.772, 0.996] as a 95% confidence interval for the slope. I see this as stronger evidence of overconfidence than the previous ones.
One thing to note here is it is plausible that your errors are not symmetric in expectation, if thereās some bias towards phrasing questions one way or another (this could be something like frequently asking āwill [event] happenā where optimism might cause you to be too high in general, for example). This might mean assuming linearity could be wrong.
This is probably easier for you to tell since you can see the underlying data.
I havenāt seen a rigorous analysis of this, but I like looking at the slope, and I expect that itās best to include each resolved prediction as a separate data point. So there would be 743 data points, each with a y value of either 0 or 1.
Iām probably missing something but doesnāt the graph show OP is under-confident in the 0-10 and 10-20 bins? e.g. those data points are above the dotted grey line of perfect calibration where the 90%+ bin is far below?
I think overconfident and underconfident arenāt crisp terms to describe this. With binary outcomes, you can invert the prediction and it means the same thing (20% chance of X == 80% chance of not X). So being below the calibration line in the 90% bucket and above the line in the 10% bucket are functionally the same thing.
Iām using overconfident here to mean closer to extreme confidence (0 or 100, depending on whether they are below or above 50%, respectively) than they should be.