Comparing average Brier scores between people only makes sense if they have made predictions on exactly the same questions, because making predictions on more certain questions (such as “will there be a 9.0 earthquake in the next year?”) will tend to give you a much better Brier score than making predictions on more uncertain questions (such as “will this coin come up head or tails?”). This is one of those things that lots of people know but then everyone (including me) keeps using them anyway because it’s a nice simple number to look at.
The Brier score for a binary prediction is the squared difference between the predicted probability and the actual outcome (O−p)2. For a given forecast, predicting the correct probability will give you the minimum possible Brier score (which is what you want). But this minimum possible score varies depending on the true probability of the event happening.
For the coin flip the true probability is 0.5, so if you make a perfect prediction you will get a Brier score of 0.25 (=0.5∗(1−0.5)2+0.5∗(0−0.5)2). For the earthquake question maybe the correct probability is 0.1, so the best expected Brier score you can get is 0.09 (=0.1∗(1−0.1)2+0.9∗(0−0.9)2), and it’s only if you are really badly wrong (you think p>0.5) that you can get a score higher than the best score you can get for the coin flip.
So if forecasters have a choice of questions to make predictions on, someone who mainly goes for things that are pretty certain will end up with a (much!) better average Brier score than someone who predicts things that are genuinely more 50⁄50. This also acts as a disincentive for predicting more uncertain things which seems bad.
We’ve just added Fatebook (which is great!) to our slack and I’ve noticed this putting me off making forecasts for things that are highly uncertain. I’m interested in if there is some lore around dealing with this among people who use Metaculus or other platforms where Brier scores are an important metric. I only really use prediction markets, which don’t suffer from this problem.
Yeah, I’m starting to believe that a severe limitation on Brier scores is this inability to use them in a forward-looking way. Brier scores reflect the performance of specific people on specific questions and using them as evidence for future prediction performance seems really fraught...but it’s the best we have as far as I can tell.
A complaint about using average Brier scores
Comparing average Brier scores between people only makes sense if they have made predictions on exactly the same questions, because making predictions on more certain questions (such as “will there be a 9.0 earthquake in the next year?”) will tend to give you a much better Brier score than making predictions on more uncertain questions (such as “will this coin come up head or tails?”). This is one of those things that lots of people know but then everyone (including me) keeps using them anyway because it’s a nice simple number to look at.
To explain:
The Brier score for a binary prediction is the squared difference between the predicted probability and the actual outcome (O−p)2. For a given forecast, predicting the correct probability will give you the minimum possible Brier score (which is what you want). But this minimum possible score varies depending on the true probability of the event happening.
For the coin flip the true probability is 0.5, so if you make a perfect prediction you will get a Brier score of 0.25 (=0.5∗(1−0.5)2+0.5∗(0−0.5)2). For the earthquake question maybe the correct probability is 0.1, so the best expected Brier score you can get is 0.09 (=0.1∗(1−0.1)2+0.9∗(0−0.9)2), and it’s only if you are really badly wrong (you think p>0.5) that you can get a score higher than the best score you can get for the coin flip.
So if forecasters have a choice of questions to make predictions on, someone who mainly goes for things that are pretty certain will end up with a (much!) better average Brier score than someone who predicts things that are genuinely more 50⁄50. This also acts as a disincentive for predicting more uncertain things which seems bad.
We’ve just added Fatebook (which is great!) to our slack and I’ve noticed this putting me off making forecasts for things that are highly uncertain. I’m interested in if there is some lore around dealing with this among people who use Metaculus or other platforms where Brier scores are an important metric. I only really use prediction markets, which don’t suffer from this problem.
Note: this also applies to log scores etc
Yeah, I’m starting to believe that a severe limitation on Brier scores is this inability to use them in a forward-looking way. Brier scores reflect the performance of specific people on specific questions and using them as evidence for future prediction performance seems really fraught...but it’s the best we have as far as I can tell.