There are several different sorts of systematic errors that you could look for in this kind of data, although checking for them requires including more features of each prediction than the ones that are here.
For example, to check for optimism bias you’d want to code whether each prediction is of the form “good thing will happen”, “bad thing will happen”, or neither. Then you can check if probabilities were too high for “good thing will happen” predictions and too low for “bad thing will happen” predictions. (Most of the example predictions were “good thing will happen” predictions, and it looks like probabilities were not generally too high, so probably optimism bias was not a major issue.)
Some other things you could check for:
tracking what the “default outcome” would be, or whether there is a natural base rate, to see if there has been a systematic tendency to overestimate the chances of a non-default outcome (or to underestimate it)
dividing predictions up into different types, such as predictions about outcomes in the world (e.g. >20 new global cage-free commitments), predictions about inputs / changes within the organization (e.g. will hire a comms person within 9 months), and predictions about people’s opinions (e.g. [expert] will think [the grantee’s] work is ‘very good’), to check for calibration & accuracy on each type of prediction
trying to distinguish the relative accuracy of different forecasters. If there are too few predictions per forecaster, you could check if any forecaster-level features are correlated with overconfidence or with Brier score (e.g., experience within the org, experience making these predictions, some measure of quantitative skills). The aggregate pattern of overconfidence in the >80% and <20% bins can show up even if most forecasters are well-calibrated and only (say) 25% are overconfident, as overconfident predictions are averaged with well-calibrated predictions. And those 25% influence these sorts of results graphs more than it seems, because well-calibrated forecasters use the extreme bins less often. Even if only 25% of all predictions are made by overconfident forecasters, half of the predictions in the >80% bins might be from overconfident forecasters
We do track whether predictions have a positive (“good thing will happen”) or negative (“bad thing will happen”) framing, so testing for optimism/pessimism bias is definitely possible. However, only 2% of predictions have a negative framing, so our sample size is too low to say anything conclusive about this yet.
Enriching our database with base rates and categories would be fantastic, but my hunch is that given the nature and phrasing of our questions this would be impossible to do at scale. I’m much more bullish on per-predictor analyses and that’s more or less what we’re doing with the individual dashboards.
There are several different sorts of systematic errors that you could look for in this kind of data, although checking for them requires including more features of each prediction than the ones that are here.
For example, to check for optimism bias you’d want to code whether each prediction is of the form “good thing will happen”, “bad thing will happen”, or neither. Then you can check if probabilities were too high for “good thing will happen” predictions and too low for “bad thing will happen” predictions. (Most of the example predictions were “good thing will happen” predictions, and it looks like probabilities were not generally too high, so probably optimism bias was not a major issue.)
Some other things you could check for:
tracking what the “default outcome” would be, or whether there is a natural base rate, to see if there has been a systematic tendency to overestimate the chances of a non-default outcome (or to underestimate it)
dividing predictions up into different types, such as predictions about outcomes in the world (e.g. >20 new global cage-free commitments), predictions about inputs / changes within the organization (e.g. will hire a comms person within 9 months), and predictions about people’s opinions (e.g. [expert] will think [the grantee’s] work is ‘very good’), to check for calibration & accuracy on each type of prediction
trying to distinguish the relative accuracy of different forecasters. If there are too few predictions per forecaster, you could check if any forecaster-level features are correlated with overconfidence or with Brier score (e.g., experience within the org, experience making these predictions, some measure of quantitative skills). The aggregate pattern of overconfidence in the >80% and <20% bins can show up even if most forecasters are well-calibrated and only (say) 25% are overconfident, as overconfident predictions are averaged with well-calibrated predictions. And those 25% influence these sorts of results graphs more than it seems, because well-calibrated forecasters use the extreme bins less often. Even if only 25% of all predictions are made by overconfident forecasters, half of the predictions in the >80% bins might be from overconfident forecasters
We do track whether predictions have a positive (“good thing will happen”) or negative (“bad thing will happen”) framing, so testing for optimism/pessimism bias is definitely possible. However, only 2% of predictions have a negative framing, so our sample size is too low to say anything conclusive about this yet.
Enriching our database with base rates and categories would be fantastic, but my hunch is that given the nature and phrasing of our questions this would be impossible to do at scale. I’m much more bullish on per-predictor analyses and that’s more or less what we’re doing with the individual dashboards.