Great post! As you allude to, I’m increasingly of the opinion that the best way to evaluate forecaster performance is via how much respect other forecasters give them. This has a number of problems:
The signal is not fully transparent: people who don’t do at least a bit of forecasting (or are otherwise engaged with forecasters) will be at a loss about which forecasters others respect.
The signal is not fully precise: I can give you a list of forecasters I respect and a loose approximation of how much I respect them, but I’d be hard-pressed to give a precise rank ordering.
Forecasters are not immune to common failures of human cognition: we might expect demographic or ideological biases creep in on forecasters’ evaluations of each other.
Though at least in the GJP/Metaculus style forecasting, a frequent pattern of (relative) anonymity hopefully alleviates this a lot
There are other systematic biases in subjective evaluation of ability that may diverge from “Platonic” forecasting skill
One that’s especially salient to me is that (I suspect) verbal ability likely correlates much more poorly with accuracy than it does with respect.
I also think it’s plausible that, especially in conversation, forecasters on average usually overweight complex explanations/nuance more than is warranted by the evidence.
It just pushes the evaluation problem up one level: how do forecasters evaluate each other?
However, as you mention, other metrics have as many if not more problems. So on balance, I think as of 2020, the metric “who do other forecasters respect” currently carries more signal than any other metric I’m aware of.
That said, part of me still holds out hope that “as of 2020” is doing most of the work here. Forecasting in many ways seems to me like a nascent and preparadigm field, and it would not shock me if in 5-15 years we have much better ontologies/tools of measurement so that (as with other more mature fields) more quantified metrics will be better in the domain of forecaster evaluation than loose subjective human impressions.
Alice, who starts off a poor forecaster but through diligent practice becomes a good (but not great) and regular contributor to a prediction platform has typically done something more valuable and praiseworthy than Bob
I think this is an underrated point. Debating praiseworthiness seems like it can get political real fast, but I want to emphasize the point about value: there are different reasons you may care about participation in a forecasting platform, for example:
“ranking” people on a leaderboard, so you can use good forecasters for other projects
you care about the results of the actual questions and the epistemic process used to gain those results.
For the latter use case, I think people who participate regularly on the forecasting platforms, contribute a lot of comments, etc, usually improve group epistemics much more than people who are unerringly accurate on just a few questions.
Metaculus, as you mention, is aware of this, and (relative to GJO) rewards activity more than accuracy. I think this has large costs (in particular I think it makes the leaderboard a worse signal for accuracy), but is still on balance better.
__
A side note about Goodhart’s law: I directionally agree with you, but I think Goodhart’s law (related: optimizer’s curse, specification gaming) is a serious issue to be aware of, but (as with nuance) I worry that in EA discussions about Goodhart’s law there’s a risk of being “too clever.” At any point you’re trying to collapse the complex/subtle/multivariate/multidimensional nature of reality to a small set of easily measurable/quantifiable dimensions (sometimes just one), you end up losing information. You hope that none of the information you lose is particularly important, but in practice this is rarely true.
Nonetheless, it is the case that to (a first approximation), imperfect metrics often work in getting the things you want to get done. For example, the image/speech recognition benchmarks often have glaring robustness holes that are easy to point out, yet I think it’s relatively uncontroversial that in many practical use cases, there are a plethora of situations where ML perception classifiers, created in large part by academics and industry optimizing along those metrics, are currently at or will soon approach superhuman quality.
Likewise, in many businesses, a common partial solution for principle-agent problems is for managers to give employees metrics of success (usually gameable ones that are only moderately correlated with the eventual goal of profit maximization). This can result in wasted effort via specification gaming, but nonetheless many businesses still end up being profitable as a direct result of employees having direct targets.
Perhaps best is a mix of practical wisdom and balance—taking the metrics as useful indicators but not as targets for monomaniacal focus. Some may be better at this than others.
I think (as with some of our other “disagreements”) I am again violently agreeing with you. Your position seems to be “we should take metrics as useful indicators but we should be worried about taking them too seriously” whereas my position is closer to “we should be worried about taking metrics too seriously, but we should care a lot about the good metrics, and in the absence of good metrics, try really hard to find better ones.”
Great post! As you allude to, I’m increasingly of the opinion that the best way to evaluate forecaster performance is via how much respect other forecasters give them. This has a number of problems:
The signal is not fully transparent: people who don’t do at least a bit of forecasting (or are otherwise engaged with forecasters) will be at a loss about which forecasters others respect.
The signal is not fully precise: I can give you a list of forecasters I respect and a loose approximation of how much I respect them, but I’d be hard-pressed to give a precise rank ordering.
Forecasters are not immune to common failures of human cognition: we might expect demographic or ideological biases creep in on forecasters’ evaluations of each other.
Though at least in the GJP/Metaculus style forecasting, a frequent pattern of (relative) anonymity hopefully alleviates this a lot
There are other systematic biases in subjective evaluation of ability that may diverge from “Platonic” forecasting skill
One that’s especially salient to me is that (I suspect) verbal ability likely correlates much more poorly with accuracy than it does with respect.
I also think it’s plausible that, especially in conversation, forecasters on average usually overweight complex explanations/nuance more than is warranted by the evidence.
In ML terms, this will be overparameterization
It just pushes the evaluation problem up one level: how do forecasters evaluate each other?
However, as you mention, other metrics have as many if not more problems. So on balance, I think as of 2020, the metric “who do other forecasters respect” currently carries more signal than any other metric I’m aware of.
That said, part of me still holds out hope that “as of 2020” is doing most of the work here. Forecasting in many ways seems to me like a nascent and preparadigm field, and it would not shock me if in 5-15 years we have much better ontologies/tools of measurement so that (as with other more mature fields) more quantified metrics will be better in the domain of forecaster evaluation than loose subjective human impressions.
I think this is an underrated point. Debating praiseworthiness seems like it can get political real fast, but I want to emphasize the point about value: there are different reasons you may care about participation in a forecasting platform, for example:
“ranking” people on a leaderboard, so you can use good forecasters for other projects
you care about the results of the actual questions and the epistemic process used to gain those results.
For the latter use case, I think people who participate regularly on the forecasting platforms, contribute a lot of comments, etc, usually improve group epistemics much more than people who are unerringly accurate on just a few questions.
Metaculus, as you mention, is aware of this, and (relative to GJO) rewards activity more than accuracy. I think this has large costs (in particular I think it makes the leaderboard a worse signal for accuracy), but is still on balance better.
__
A side note about Goodhart’s law: I directionally agree with you, but I think Goodhart’s law (related: optimizer’s curse, specification gaming) is a serious issue to be aware of, but (as with nuance) I worry that in EA discussions about Goodhart’s law there’s a risk of being “too clever.” At any point you’re trying to collapse the complex/subtle/multivariate/multidimensional nature of reality to a small set of easily measurable/quantifiable dimensions (sometimes just one), you end up losing information. You hope that none of the information you lose is particularly important, but in practice this is rarely true.
Nonetheless, it is the case that to (a first approximation), imperfect metrics often work in getting the things you want to get done. For example, the image/speech recognition benchmarks often have glaring robustness holes that are easy to point out, yet I think it’s relatively uncontroversial that in many practical use cases, there are a plethora of situations where ML perception classifiers, created in large part by academics and industry optimizing along those metrics, are currently at or will soon approach superhuman quality.
Likewise, in many businesses, a common partial solution for principle-agent problems is for managers to give employees metrics of success (usually gameable ones that are only moderately correlated with the eventual goal of profit maximization). This can result in wasted effort via specification gaming, but nonetheless many businesses still end up being profitable as a direct result of employees having direct targets.
I think (as with some of our other “disagreements”) I am again violently agreeing with you. Your position seems to be “we should take metrics as useful indicators but we should be worried about taking them too seriously” whereas my position is closer to “we should be worried about taking metrics too seriously, but we should care a lot about the good metrics, and in the absence of good metrics, try really hard to find better ones.”