Ofer comments on Challenges in evaluating forecaster performance

Ofer Sep 11, 2020, 7:43 PM
1 point
1 vote
Overall karma indicates overall quality.
0 ∶ 0
Total points: 0
Agreement karma indicates agreement, separate from overall quality.

After thinking for a few more minutes, it seems that forecasting more often but at random moments shouldn’t impact the expected Brier score.

In my toy example (where the forecasting moments are predetermined), Alice’s Brier score for day X will be based on a”fresh” prediction made on that day (perhaps influenced by a new surprising poll result), while Bob’s Brier score for that day may be based on a prediction he made 3 weeks earlier (not taking into account the new poll result). So we should expect that the average daily Brier score will be affected by the forecasting frequency (even if the forecasting moments are uniformly sampled).

In this toy example the best solution seems to be using the average Brier score over the set of days in which both Alice and Bob made a forecast. If in practice this tends to leave us with too few data points, a more sophisticated solution is called for. ~~(Maybe partitioning days into bins and sampling a random forecast from each bin? [EDIT: this mechanism can be gamed.])~~
- alex lawsen Sep 11, 2020, 9:17 PM
  3 points
  2 votes
  Overall karma indicates overall quality.
  0 ∶ 0
  Total points: 0
  Agreement karma indicates agreement, separate from overall quality.
  Parent
  The long-term solution here is to allow forecasters to predict functions rather than just static values. This solves problems of things like people needing to update for time left.
  
  In terms of the specific example though, I think if a significant new poll comes out and Alice updates and Bob doesn’t, Alice is a better forecaster and deserves more reward than Bob.
  - Ofer Sep 12, 2020, 5:29 AM
    2 points
    2 votes
    Overall karma indicates overall quality.
    0 ∶ 0
    Total points: 0
    Agreement karma indicates agreement, separate from overall quality.
    Parent
    
    The long-term solution here is to allow forecasters to predict functions rather than just static values. This solves problems of things like people needing to update for time left.
    
    Do these functions map events to conditional probabilities? (I.e. mapping an event to the probability of something conditioned on that event happening)? How will this look like for the example of forecasting an election result?
    
    In terms of the specific example though, I think if a significant new poll comes out and Alice updates and Bob doesn’t, Alice is a better forecaster and deserves more reward than Bob.
    
    Suppose Alice encountered the important poll result because she was looking for it (as part of her effort to come up with a new forecast). At the end of the day what we really care about is how much weight we should place on any given forecast made by Alice/Bob. We don’t directly care about the average daily Brier score (which may be affected by the forecasting frequency). [EDIT: this isn’t true if the forecasting platform and the forecasters’ incentives are the same when we evaluate the forecasters and when we ask the questions we care about.]
    - Linch Sep 12, 2020, 12:02 PM
      3 points
      2 votes
      Overall karma indicates overall quality.
      0 ∶ 0
      Total points: 0
      Agreement karma indicates agreement, separate from overall quality.
      Parent
      Suppose Alice encountered the important poll result because she was looking for it (as part of her effort to come up with a new forecast).
      This makes Alice a better forecaster, at least if the primary metric is accuracy. (If the metric includes other factors like efficiency, then we need to know eg. how many more minutes, if any, Alice spends than Bob).
      At the end of the day what we really care about is how much weight we should place on any given forecast made by Alice/Bob.
      If Alice updates daily and Bob updates once a month, and Alice has a lower average daily Brier score, then all else being equal, if you saw their forecasts at a random day, you should trust Alice’s forecasts more*.
      If you happen to see their forecasts on the day Bob updates, I agree this is a harder comparison, but I also don’t think this is an unusually common use case.
      I think part of the thing driving our intuition differences here is that I think lack of concurrency of forecasts (timeliness of opinions) is often a serious problem “in real life,” rather than just an artifact of the platforms. In other words, you are imagining that whether to trust Alice at time t vs Bob at time t-1 is an unfortunate side effect of forecasting platforms, and “in real life” you generally have access to concurrent predictions by Alice and Bob. Whereas I think the timeliness tradeoff is a serious problem in most attempts to get accurate answers.
      If you’re trying to decide whether eg, a novel disease is airborne, you might have the choice of a meta-analysis from several months back, an expert opinion from 2 weeks ago, a prediction market median that was closed last week, or a single forecaster’s opinion today.
      ___
      Griping aside, I agree that there are situations where you do want to know “conditional upon two people making a forecast at the same time, whose forecasts do I trust more?” There are different proposed and implemented approaches around this, for example prediction markets implicitly get around this problem since the only people trading are people who implicitly believe that their forecasts are current, so the latest trades reflect the most accurate market beliefs, etc. (though markets have other problems like greater fool, especially since the existing prediction markets are much smaller than other markets).
      *I’ve noticed this in myself. I used to update my Metaculus forecasts several times a week, and climbed the leaderboard fairly quickly in March and April. I’ve since slowed down to averaging an update once 3-6 weeks for most questions (except for a few “hot” ones or ones I’m unusually interested in). My score has slipped as a result. On the one hand I think this is a bit unfair since I feel like there’s an important “meta” sense in which I’ve gotten better (more intuitive sense of probability, more acquired subject matter knowledge on the questions I’m forecasting). On the other, I think there’s a very real sense that alex alludes to in which LinchSeptember is just a worse object-level forecaster than LinchApril, even if in some important meta-level ones (I like to imagine) I’ve gotten better.
      - Ofer Sep 12, 2020, 2:19 PM
        2 points
        2 votes
        Overall karma indicates overall quality.
        0 ∶ 0
        Total points: 0
        Agreement karma indicates agreement, separate from overall quality.
        Parent
        
        This makes Alice a better forecaster
        
        As long as we keep asking Alice and Bob questions via the same platform, and their incentives don’t change, I agree. But if we now need to decide whether to hire Alice and/or Bob to do some forecasting for us, comparing their average daily Brier score is problematic. If Bob just wasn’t motivated enough to update his forecast every day like Alice did, his lack of motivation can be fixed by paying him.
- Misha_Yagudin Sep 11, 2020, 11:16 PM
  1 point
  1 vote
  Overall karma indicates overall quality.
  0 ∶ 0
  Total points: 0
  Agreement karma indicates agreement, separate from overall quality.
  Parent
  Here is a sketch of a formal argument, which will show that freshness doesn’t matter much.
  Let’s calculate the average Brier score of a forecaster. We can see the contribution of hypothetical forecasts on day $d$ toward sum: $Brier score of the forecast made on day d \times E [num. days the forecast d is active]$ . If forecasts are sufficiently random the expected number of days forecasts are active should be equal. Because $\sum_{d} E [num. days the forecast d is active] = total number of active days$ , expected average Brier score is equal to the average of Briers scores for all days.
  - axioman Sep 12, 2020, 8:21 AM
    3 points
    3 votes
    Overall karma indicates overall quality.
    0 ∶ 0
    Total points: 0
    Agreement karma indicates agreement, separate from overall quality.
    Parent
    I’m also not sure I follow your exact argument here. But frequency clearly matters whenever the forecast is essentially resolved before the official resolution date, or when the best forecast based on evidence at time t behaves monotonically (think of questions of the type “will event Event x that (approximately) has a small fixed probability of happening each day happen before day y?”, where each day passing without x happening should reduce your credence).
    What links here?
    Ofer's comment on Challenges in evaluating forecaster performance by Gregory Lewis🔸 (Sep 12, 2020, 2:13 PM; 1 point)
    - Misha_Yagudin Sep 12, 2020, 11:39 AM
      1 point
      1 vote
      Overall karma indicates overall quality.
      0 ∶ 0
      Total points: 0
      Agreement karma indicates agreement, separate from overall quality.
      Parent
      I mildly disagree. I think intuition to use here is that the sample mean is an unbiased estimator of expectation (this doesn’t depend on frequency/number of samples). One complication here is that we are weighing samples potentially unequally, but if we expect each forecast to be active for an equal number of days this doesn’t matter.
      ETA: I think the assumption of “forecasts have an equal expected number of active days” breaks around the closing date, which impacts things in the monotonical example (this effect is linear in the expected number of active days and could be quite big in extremes).
      - Gregory Lewis🔸Sep 12, 2020, 1:33 PM
        3 points
        2 votes
        Overall karma indicates overall quality.
        0 ∶ 0
        Total points: 0
        Agreement karma indicates agreement, separate from overall quality.
        Parent
        I’m afraid I’m also not following. Take an extreme case (which is not that extreme given I think ’average number of forecasts per forecaster per question on GJO is 1.something). Alice predicts a year out P(X) = 0.2 and never touches her forecast again, whilst Bob predicts P(X) = 0.3, but decrements proportionately as time elapses. Say X doesn’t happen (and say the right ex ante probability a year out was indeed 0.2). Although Alice > Bob on the initial forecast (and so if we just scored that day she would be better), if we carry forward Bob overtakes her overall [I haven’t checked the maths for this example, but we can tweak initial forecasts so he does].
        As time elapses, Alice’s forecast steadily diverges from the ‘true’ ex ante likelihood, whilst Bob’s converges to it. A similar story applies if new evidence emerges which dramatically changes the probability, if Bob updates on it and Alice doesn’t. This seems roughly consonant with things like the stock-market—trading off month (or more) old prices rather than current prices seems unlikely to go well.
        Misha_Yagudin Sep 12, 2020, 3:35 PM
        2 points
        2 votes
        Overall karma indicates overall quality.
        0 ∶ 0
        Total points: 0
        Agreement karma indicates agreement, separate from overall quality.
        Parent
        Thanks, everyone, for engaging with me. I will summarize my thoughts and would likely not actively comment here anymore:
        I think the argument holds given the assumption [(a) probability to forecast on each day are proportional for the forecasters (previously we assumed uniformity) + (b) expected number of active days] I made.
        > I think intuition to use here is that the sample mean is an unbiased estimator of expectation (this doesn’t depend on the frequency/number of samples). One complication here is that we are weighing samples potentially unequally, but if we expect each forecast to be active for an equal number of days this doesn’t matter.
        The second assumption seems to be approximately correct assuming the uniformity but stops working on the edge [around the resolution date], which impacts the average score on the order of $expected num. active days / total num. days$ .
        This effect could be noticeable, this is an update.
        Overall, given the setup, I think that forecasting weekly vs. daily shouldn’t differ much for forecasts with a resolution date in 1y.
        I intended to use this toy model to emphasize that the important difference between the active and semi-active forecasters is the distribution of days they forecast on.
        This difference, in my opinion, is mostly driven by the ‘information gain’ (e.g. breaking news, pull is published, etc).
        This makes me skeptical about features s.a. automatic decay and so on.
        This makes me curious about ways to integrate information sources automatically.
        And less so about notifications that community/followers forecasts have significantly changed. [It is already possible to sort by the magnitude of crowd update since your last forecast on GJO].
        On a meta-level, I am
        Glad I had the discussion and wrote this comment :)
        Confused about people’s intuitions about the linearity of EV.
        I would encourage people to think more carefully through my argument.
        This makes me doubt I am correct, but still, I am quite certain. I undervalued the corner cases in the initial reasoning. I think I might undervalue other phenomena, where models don’t capture reality well and hence triggers people’s intuitions:
        E.g. randomness of the resolution day might magnify the effect of the second assumption not holding, but it seems like it shouldn’t be given that in expectation one resolves the question exactly once.
        Confused about not being able to communicate my intuitions effectively.
        I would appreciate any feedback [not necessary on communication], I have a way to submit it anonymously: https://admonymous.co/misha
        What links here?
        Misha_Yagudin's comment on Challenges in evaluating forecaster performance by Gregory Lewis🔸 (Sep 12, 2020, 3:40 PM; 1 point)
        Misha_Yagudin Sep 12, 2020, 3:37 PM
        1 point
        1 vote
        Overall karma indicates overall quality.
        0 ∶ 0
        Total points: 0
        Agreement karma indicates agreement, separate from overall quality.
        Parent
        This example is somewhat flawed (because forecasting only once breaks the assumption I am making) but might challenge your intuitions a bit :)
  - Ofer Sep 12, 2020, 5:26 AM
    1 point
    1 vote
    Overall karma indicates overall quality.
    0 ∶ 0
    Total points: 0
    Agreement karma indicates agreement, separate from overall quality.
    Parent
    I didn’t follow that last sentence.
    
    Notice that in the limit it’s obvious we should expect the forecasting frequency to affect the average daily Brier score: Suppose Alice makes a new forecast every day while Bob only makes a single forecast (which is equivalent to him making an initial forecast and then blindly making the same forecast every day until the question closes).
    - Misha_Yagudin Sep 12, 2020, 10:43 AM
      2 points
      2 votes
      Overall karma indicates overall quality.
      0 ∶ 0
      Total points: 0
      Agreement karma indicates agreement, separate from overall quality.
      Parent
      re: limit — a nice example. Please notice, that Bob makes a forecast on a (uniformly) random day, so when you take an expectation over the days he is making forecasts on you get the average of scores for all days as if he forecasted every day.
      Let $N$ be the number of total days, $P_{d} = \frac{1}{N}$ be the probability Bob forecasted on a day $d$ , ${Brier}_{d}$ be the brier score of the forecast made on day $d$ :
      $\begin{matrix} E avg. Brier & = \sum d P_{d} \times \frac{{Brier}_{d} \times num. days forecast will be active}{total num. of active days} = \sum d P_{d} \times \frac{{Brier}_{d} \times (N - d)}{N - d} = \sum d P_{d} \times {Brier}_{d} = \frac{\sum {Brier}_{d}}{N} \end{matrix} .$
      I am a bit surprised that it worked out here because it breaks the assumption of the equality of the expected number of days forecast will be active. Lack of this assumption will play out if when aggregating over multiple questions [weighted by the number of active days]. Still, I hope this example gives helpful intuitions
      .
      What links here?
      Misha_Yagudin's comment on Challenges in evaluating forecaster performance by Gregory Lewis🔸 (Sep 12, 2020, 3:37 PM; 1 point)
      - Ofer Sep 12, 2020, 2:13 PM
        1 point
        1 vote
        Overall karma indicates overall quality.
        0 ∶ 0
        Total points: 0
        Agreement karma indicates agreement, separate from overall quality.
        Parent
        Thanks for the explanation!
        
        I don’t think this formal argument conflicts with the claim that we should expect the forecasting frequency to affect the average daily Brier score. In the example that Flodorner gave where the forecast is essentially resolved before the official resolution date, Alice will have perfect daily Brier scores: ${Brier}_{d} = 0$ , for any $d > N^{'}$ , while in those days Bob will have imperfect Brier scores: ${Brier}_{d} = B r i e r_{N^{'}}$ .
        Misha_Yagudin Sep 12, 2020, 3:40 PM
        1 point
        1 vote
        Overall karma indicates overall quality.
        0 ∶ 0
        Total points: 0
        Agreement karma indicates agreement, separate from overall quality.
        Parent
        Thanks for challenging me :) I wrote my takes after this discussion above.