This is great, and it deals with a few points I didn’t, but here’s my tweetstorm from the beginning of last year about the distortion of scoring rules alone:
If you’re interested in probability scoring rules, here’s a somewhat technical and nit-picking tweetstorm about why proper scoring for predictions and supposedly “incentive compatible” scoring systems often aren’t actually a good idea.
First, some background. Scoring rules are how we “score” predictions—decide how good they are. Proper scoring rules are ones where a predictor’s score is maximized when it give it’s true best guess. Wikipedia explains; en.wikipedia.org/wiki/Scoring_r…
A typical improper scoring rule is the “better side of even” rule, where every time your highest probability is assigned to the actual outcome, you get credit. In that case, people have no reason to report probabilities correctly—just pick a most likely outcome and say 100%.
There are many proper scoring rules. Examples include logarithmic scoring, where your score is the log of the probability assigned to the correct answer, and Brier score, which is the mean squared error. de Finetti et al. lays out the details here; link.springer.com/chapter/10.100…
These scoring rules are all fine as long as people’s ONLY incentive is to get a good score.
In fact, in situations where we use quantitative rules, this is rarely the case. Simple scoring rules don’t account for this problem. So what kind of misaligned incentives exist?
Bad places to use proper scoring rules #1 - In many forecasting applications, like tournaments, there is a prestige factor in doing well without a corresponding penalty for doing badly. In that case, proper scoring rules incentivise “risk taking” in predictions, not honesty.
Bad places to use proper scoring rules #2 - In machine learning, scoring rules are used for training models that make probabilistic predictions. If predictions are then used to make decisions that have asymmetric payoffs for different types of mistakes., it’s misaligned.
Bad places to use proper scoring rules #3 - Any time you want the forecasters to have the option to say answer unknown. If this is important—and it usually is—proper scoring rules can disincentify or overincentify not guessing, depending on how that option is treated.
Using a metric that isn’t aligned with incentives is bad. (If you want to hear more, follow me. I can’t shut up about it.)
This is great, and it deals with a few points I didn’t, but here’s my tweetstorm from the beginning of last year about the distortion of scoring rules alone:
https://twitter.com/davidmanheim/status/1080458380806893568
If you’re interested in probability scoring rules, here’s a somewhat technical and nit-picking tweetstorm about why proper scoring for predictions and supposedly “incentive compatible” scoring systems often aren’t actually a good idea.
First, some background. Scoring rules are how we “score” predictions—decide how good they are. Proper scoring rules are ones where a predictor’s score is maximized when it give it’s true best guess. Wikipedia explains; en.wikipedia.org/wiki/Scoring_r…
A typical improper scoring rule is the “better side of even” rule, where every time your highest probability is assigned to the actual outcome, you get credit. In that case, people have no reason to report probabilities correctly—just pick a most likely outcome and say 100%.
There are many proper scoring rules. Examples include logarithmic scoring, where your score is the log of the probability assigned to the correct answer, and Brier score, which is the mean squared error. de Finetti et al. lays out the details here; link.springer.com/chapter/10.100…
These scoring rules are all fine as long as people’s ONLY incentive is to get a good score.
In fact, in situations where we use quantitative rules, this is rarely the case. Simple scoring rules don’t account for this problem. So what kind of misaligned incentives exist?
Bad places to use proper scoring rules #1 - In many forecasting applications, like tournaments, there is a prestige factor in doing well without a corresponding penalty for doing badly. In that case, proper scoring rules incentivise “risk taking” in predictions, not honesty.
Bad places to use proper scoring rules #2 - In machine learning, scoring rules are used for training models that make probabilistic predictions. If predictions are then used to make decisions that have asymmetric payoffs for different types of mistakes., it’s misaligned.
Bad places to use proper scoring rules #3 - Any time you want the forecasters to have the option to say answer unknown. If this is important—and it usually is—proper scoring rules can disincentify or overincentify not guessing, depending on how that option is treated.
Using a metric that isn’t aligned with incentives is bad. (If you want to hear more, follow me. I can’t shut up about it.)
Carvalho discusses how proper scoring is misused; https://viterbi-web.usc.edu/~shaddin/cs699fa17/docs/Carvalho16.pdf
Anyways, this paper shows a bit of how to do better; https://pubsonline.informs.org/doi/abs/10.1287/deca.1110.0216
Fin.
I enjoyed this tweetstorm when you mentioned it to me and should have highlighted it in the article as useful further reading, thanks for posting it!