Thank you for this post! I want to raise another potential issue with forecasting tournaments: using Brier scores.
My understanding is that Brier scores take the squared difference between your forecast and the true value. For example, if I say there’s a 70% chance something will happen, and then it happens, my brier score is 1-0.7 squared.
I think the fact that Brier scores use the squared difference (as opposed to the absolute difference)is non-trivial. I’ll illustrate this a bit with a simple example.
Consider two forecasters who forecast on three events. Let’s also say that all three events happen.
Forecaster A believed that all three events had a 70% chance of happening.
Forecaster B believed that two events had a 80% chance of happening, and one event had a 50% chance of happening.
Who is the better forecaster? I think this answer is pretty unclear. If we use absolute differences, the forecasters are tied:
Forecaster A-- (1-0.7) + (1-0.7) + (1-0.7) = 0.9
Forecaster B-- (1-0.8) + (1-0.8) + (1-0.5) = 0.9
But if we use Brier scores, Forecaster A has the edge (lower Brier scores are better):
In other words, Brier scores penalize you for being “very wrong” (relative to the scoring system that uses absolute differences).You could make an argument that this is justified, because people ought to be penalized more for being “more wrong.” But I haven’t seen this argument laid out—and I especially haven’t seen an argument to suggest that the penalty should be a “squared” penalty.
I haven’t considered all of the implications of this, but I imagine that people who are trying to win forecasting tournaments could find some ways to “game” the scoring system. At first glance, for instance, it seems like Brier scores penalize people for making “risky” forecasts (because being off by a lot is much worse than being off by a little bit).
I’m curious if others think this is a problem or think there are solutions.
In the particular example you propose, forecaster A assigns higher probability to X and Y and Z (0.7*0.7*0.7 = .343) than forecaster B (0.8*0.8*0.5 = 0.320). This seems intuitively correct.
Also, note that the squares are necessary to keep the scoring rule proper (the highest expected reward is obtained by reporting the true probability distribution), and this is in principle a crucial property (otherwise people could lie about what they think their probabilities are and get a better score). In particular, if you take out the square, then the “probability” which maximizes your expected score is either 0% or 100% (i.e., imagine that your probability was 60%, and just calculate the expected value of writing 60% vs 100% down).
An alternative to the Brier score which might interest you (or which you may have had in mind) is the logarithmic scoring rule, which in a sense tries to quantify how much information you add or substract from the aggregate. But it has other downsides, like being very harsh on mistakes. And it would also assign a worse score to forecaster B.
Thank you for this post! I want to raise another potential issue with forecasting tournaments: using Brier scores.
My understanding is that Brier scores take the squared difference between your forecast and the true value. For example, if I say there’s a 70% chance something will happen, and then it happens, my brier score is 1-0.7 squared.
I think the fact that Brier scores use the squared difference (as opposed to the absolute difference) is non-trivial. I’ll illustrate this a bit with a simple example.
Consider two forecasters who forecast on three events. Let’s also say that all three events happen.
Forecaster A believed that all three events had a 70% chance of happening.
Forecaster B believed that two events had a 80% chance of happening, and one event had a 50% chance of happening.
Who is the better forecaster? I think this answer is pretty unclear. If we use absolute differences, the forecasters are tied:
Forecaster A-- (1-0.7) + (1-0.7) + (1-0.7) = 0.9
Forecaster B-- (1-0.8) + (1-0.8) + (1-0.5) = 0.9
But if we use Brier scores, Forecaster A has the edge (lower Brier scores are better):
Forecaster A-- (1-0.7)^2 + (1-0.7)^2 + (1-0.7)^2 =0.27
Forecaster B-- (1-0.8)^2 + (1-0.8)^2+ (1-0.5)^2 = 0.33
In other words, Brier scores penalize you for being “very wrong” (relative to the scoring system that uses absolute differences). You could make an argument that this is justified, because people ought to be penalized more for being “more wrong.” But I haven’t seen this argument laid out—and I especially haven’t seen an argument to suggest that the penalty should be a “squared” penalty.
I haven’t considered all of the implications of this, but I imagine that people who are trying to win forecasting tournaments could find some ways to “game” the scoring system. At first glance, for instance, it seems like Brier scores penalize people for making “risky” forecasts (because being off by a lot is much worse than being off by a little bit).
I’m curious if others think this is a problem or think there are solutions.
In the particular example you propose, forecaster A assigns higher probability to X and Y and Z (0.7*0.7*0.7 = .343) than forecaster B (0.8*0.8*0.5 = 0.320). This seems intuitively correct.
Also, note that the squares are necessary to keep the scoring rule proper (the highest expected reward is obtained by reporting the true probability distribution), and this is in principle a crucial property (otherwise people could lie about what they think their probabilities are and get a better score). In particular, if you take out the square, then the “probability” which maximizes your expected score is either 0% or 100% (i.e., imagine that your probability was 60%, and just calculate the expected value of writing 60% vs 100% down).
An alternative to the Brier score which might interest you (or which you may have had in mind) is the logarithmic scoring rule, which in a sense tries to quantify how much information you add or substract from the aggregate. But it has other downsides, like being very harsh on mistakes. And it would also assign a worse score to forecaster B.