# Flodorner comments on Challenges in evaluating forecaster performance

• I’m also not sure I follow your exact argument here. But frequency clearly matters whenever the forecast is essentially resolved before the official resolution date, or when the best forecast based on evidence at time t behaves monotonically (think of questions of the type “will event Event x that (approximately) has a small fixed probability of happening each day happen before day y?”, where each day passing without x happening should reduce your credence).

• I mildly disagree. I think intuition to use here is that the sample mean is an unbiased estimator of expectation (this doesn’t depend on frequency/​number of samples). One complication here is that we are weighing samples potentially unequally, but if we expect each forecast to be active for an equal number of days this doesn’t matter.

ETA: I think the assumption of “forecasts have an equal expected number of active days” breaks around the closing date, which impacts things in the monotonical example (this effect is linear in the expected number of active days and could be quite big in extremes).

• I’m afraid I’m also not following. Take an extreme case (which is not that extreme given I think ’average number of forecasts per forecaster per question on GJO is 1.something). Alice predicts a year out P(X) = 0.2 and never touches her forecast again, whilst Bob predicts P(X) = 0.3, but decrements proportionately as time elapses. Say X doesn’t happen (and say the right ex ante probability a year out was indeed 0.2). Although Alice > Bob on the initial forecast (and so if we just scored that day she would be better), if we carry forward Bob overtakes her overall [I haven’t checked the maths for this example, but we can tweak initial forecasts so he does].

As time elapses, Alice’s forecast steadily diverges from the ‘true’ ex ante likelihood, whilst Bob’s converges to it. A similar story applies if new evidence emerges which dramatically changes the probability, if Bob updates on it and Alice doesn’t. This seems roughly consonant with things like the stock-market—trading off month (or more) old prices rather than current prices seems unlikely to go well.

• Thanks, everyone, for engaging with me. I will summarize my thoughts and would likely not actively comment here anymore:

• I think the argument holds given the assumption [(a) probability to forecast on each day are proportional for the forecasters (previously we assumed uniformity) + (b) expected number of active days] I made.

• > I think intuition to use here is that the sample mean is an unbiased estimator of expectation (this doesn’t depend on the frequency/​number of samples). One complication here is that we are weighing samples potentially unequally, but if we expect each forecast to be active for an equal number of days this doesn’t matter.

• The second assumption seems to be approximately correct assuming the uniformity but stops working on the edge [around the resolution date], which impacts the average score on the order of .

• This effect could be noticeable, this is an update.

• Overall, given the setup, I think that forecasting weekly vs. daily shouldn’t differ much for forecasts with a resolution date in 1y.

• I intended to use this toy model to emphasize that the important difference between the active and semi-active forecasters is the distribution of days they forecast on.

• This difference, in my opinion, is mostly driven by the ‘information gain’ (e.g. breaking news, pull is published, etc).

• This makes me skeptical about features s.a. automatic decay and so on.

• This makes me curious about ways to integrate information sources automatically.

• And less so about notifications that community/​followers forecasts have significantly changed. [It is already possible to sort by the magnitude of crowd update since your last forecast on GJO].

On a meta-level, I am

• I would encourage people to think more carefully through my argument.

• This makes me doubt I am correct, but still, I am quite certain. I undervalued the corner cases in the initial reasoning. I think I might undervalue other phenomena, where models don’t capture reality well and hence triggers people’s intuitions:

• E.g. randomness of the resolution day might magnify the effect of the second assumption not holding, but it seems like it shouldn’t be given that in expectation one resolves the question exactly once.

• Confused about not being able to communicate my intuitions effectively.

• This example is somewhat flawed (because forecasting only once breaks the assumption I am making) but might challenge your intuitions a bit :)