Linch comments on Challenges in evaluating forecaster performance

Linch 12 Sep 2020 12:02 UTC
3 points
0 ∶ 0
Suppose Alice encountered the important poll result because she was looking for it (as part of her effort to come up with a new forecast).
This makes Alice a better forecaster, at least if the primary metric is accuracy. (If the metric includes other factors like efficiency, then we need to know eg. how many more minutes, if any, Alice spends than Bob).
At the end of the day what we really care about is how much weight we should place on any given forecast made by Alice/Bob.
If Alice updates daily and Bob updates once a month, and Alice has a lower average daily Brier score, then all else being equal, if you saw their forecasts at a random day, you should trust Alice’s forecasts more*.
If you happen to see their forecasts on the day Bob updates, I agree this is a harder comparison, but I also don’t think this is an unusually common use case.
I think part of the thing driving our intuition differences here is that I think lack of concurrency of forecasts (timeliness of opinions) is often a serious problem “in real life,” rather than just an artifact of the platforms. In other words, you are imagining that whether to trust Alice at time t vs Bob at time t-1 is an unfortunate side effect of forecasting platforms, and “in real life” you generally have access to concurrent predictions by Alice and Bob. Whereas I think the timeliness tradeoff is a serious problem in most attempts to get accurate answers.
If you’re trying to decide whether eg, a novel disease is airborne, you might have the choice of a meta-analysis from several months back, an expert opinion from 2 weeks ago, a prediction market median that was closed last week, or a single forecaster’s opinion today.
___
Griping aside, I agree that there are situations where you do want to know “conditional upon two people making a forecast at the same time, whose forecasts do I trust more?” There are different proposed and implemented approaches around this, for example prediction markets implicitly get around this problem since the only people trading are people who implicitly believe that their forecasts are current, so the latest trades reflect the most accurate market beliefs, etc. (though markets have other problems like greater fool, especially since the existing prediction markets are much smaller than other markets).
*I’ve noticed this in myself. I used to update my Metaculus forecasts several times a week, and climbed the leaderboard fairly quickly in March and April. I’ve since slowed down to averaging an update once 3-6 weeks for most questions (except for a few “hot” ones or ones I’m unusually interested in). My score has slipped as a result. On the one hand I think this is a bit unfair since I feel like there’s an important “meta” sense in which I’ve gotten better (more intuitive sense of probability, more acquired subject matter knowledge on the questions I’m forecasting). On the other, I think there’s a very real sense that alex alludes to in which LinchSeptember is just a worse object-level forecaster than LinchApril, even if in some important meta-level ones (I like to imagine) I’ve gotten better.
- Ofer 12 Sep 2020 14:19 UTC
  2 points
  0 ∶ 0
  Parent
  
  This makes Alice a better forecaster
  
  As long as we keep asking Alice and Bob questions via the same platform, and their incentives don’t change, I agree. But if we now need to decide whether to hire Alice and/or Bob to do some forecasting for us, comparing their average daily Brier score is problematic. If Bob just wasn’t motivated enough to update his forecast every day like Alice did, his lack of motivation can be fixed by paying him.