What this is: A technical post about how I believe binary forecasts should be made. There are probably some minor mathematical mistakes, but I doubt there are major mistakes.
To analyze how the passage of time affects binary forecasts I use conditional prediction curvesps(t), your predicted forecast at time t given the information at time s≤t. It’s often possible to construct well-motivated conditional prediction curves automatically, and I provide some details in the context of the “Will x happen by time τ?”″ type of question.
The benefits of thinking with conditional prediction curves are large, both from the point of view of the individual forecaster, the forecast aggregator, and the scorer.
The forecaster has no need to repeatedly go and update his predictions despite no new information coming in,
The aggregator can aggregate the probabilistic forecasts at any point in time,
The scorer can score a continuous stream of predictions instead of a dislocated bunch of them.
Some downsides of using prediction curves include:
Different kinds of questions require different models. Questions of the sort “Will the event x happen before the event y?” shouldn’t be handled in the same way as the “Will x happen by time τ?” kind of question.
Making informal forecasts using conditional prediction curves is harder than making point forecasts. Tutorials and non-technical explanations would probably be required.
It would take some effort to create technical solutions for them.
But prediction curves help us in understanding forecasting too.
Any rational forecaster’s conditional prediction curve will be decreasing if the question looks like “Will x happen by time τ?”
But the conditional prediction curve will be constant for questions like “Will the event x happen on time τ?”
More complicated kinds of questions won’t be as regular. Questions on the form “Will the event x happen before the event y?”″ can have arbitrary conditional prediction curves.
Prediction curves
Define the variables
X:binary outcome variable, success if equal to 1,T:event time, i.e., the time when the outcome X becomes known. You can think about X and T using Metaculus questions. For instance, if the question is “Will a non-state actor develop their own nuclear weapon by 2030?”, X would be 1 if a non-state actor develops a nuke by 2030 and 0 otherwise. The random variable T equals 2030 if X=0 and the point in time the nuke is developed if X=1.
A prediction curve for X,T is a random function p(t)∈[0,1], that forecasts the outcome X at every time-point t. If t>T, I’ll assume that p(t)=0 if X=0 and 1 otherwise, so you can’t make any studip predictions after the event time to screw yourself over.
We can score prediction curves using the integrated scoring rule s′(p,X)=∫∞t0s(p(t),X)dt, where s is any proper scoring rule (with 0 being best) and t0 is the starting time. Then s′ is a proper scoring rule for prediction curves (the exact formulation of what this means is a little technical, see the appendix for a proof), meaning it will always be beneficial to supply your the prediction curve you believe in the most. The scoring rule might be strictly proper too, with the proper definitions and assumptions, but I haven’t investigated it yet. One reason to use an integrated scoring rule is to insentivize honest reporting even when the event time T is far away.
Prediction curves are not used by forecasting sites such as Metaculus. That might be because supplying prediction curves is too much to ask of their audience. But it is possible to construct reasonable prediction curves without too much additional work.
The rational forecaster
Define the information set Ft,
Ft:the information about X,T available at time t. A rational forecaster with information set Ft is one who makes probabilistic forecasts at each time point t using his best available evidence Ft. Define the rational prediction curve as the stochastic process p(t)=P(X=1∣Ft), Here p(t) is random since Ft is random. When s is a proper scoring rule p(t) is the optimal prediction given the information set Ft according to s′ in the sense that p(t)=argminf∈F∫∞t0E[s(f(t),X)∣Ft]dt, when F is a suitable class of random functions.
The conditional prediction curve
Define the conditional prediction curve as the expected forecast at time t given the information at time s, but conditioned on the event not yet having happened
ps(t)=E(p(s)∣T>t),=E(E[X∣Fs]∣T>t),=P(X=1∣Fs,T>t). Then ps(t) is the best possible prediction curve based on Fs in the sense that ps(t)=argminf∫∞sE[s(f(t),X)∣Fs,T>t]dt. You can interpret ps(t) as
the rational prediction at time t of a forecaster who missed all the information from time s to time t.
the rational prediction at time t of a lazy forecaster who did not bother to look for any new information after time s.
the actually rational prediction at time t when information arrives in bursts, not continuously, and the last bit of information became available at time s.
In practice we need the conditional prediction curve because no one is able to update continuously. Call it bounded rationality if you want. The idea is to have each forecaster update their prediction curve whenever they make a forecast, yielding the final prediction curve. If a forecaster provides conditional prediction curves at the times t0=s0<s1<⋯<sk≤T, the final prediction curve isp(t)={psi(t)when si<t<si+1,Xwhen t>T.
The prediction curve below contains two updates, one at s1=0.6 and one at s2=0.8. The curves in-between updates are conditional prediction curves: The black curve is the conditional prediction curve p0(t), the red is p0.6(t), the blue is p0.8(t). Together they form your final prediction curve. The event time is T=1, but that is random an unknown to the forecaster. The conditional forecasts are made using the constant hazard model, discussed in a later section.
Three common question categories
Questions of the sort described above are probably too general to work with, but most can be placed into one of three categories.
Type 1: “Will the event x fail to happen by time τ?”
Most questions on Metaculus can be written on the form “Will x happen by time τ?”. Examples include “Will a coup or regime change take place in Russia in 2022 or 2023?” and “Will Putin and Zelenskyy meet to discuss the peaceful resolution of the Russian-Ukrainian conflict before 2023?”. We will look at the questions formulated on the form “Will the event xfail to happen by time τ?” as it makes the mathematics slightly cleaner.
For these questions we don’t need to model the probability of X at all, yielding the prediction curve p(t)=P(X=1∣Ft)=P(T≥τ∣Ft) and the conditional prediction curve ps(t)=P(T≥τ∣Fs,T>t).
Proposition 1
The conditional prediction curve ps(t) is non-decreasing in t for every s. Moreover, ps(t) is strictly increasing in t under the very minor condition that the hazard rate of T is strictly positive. (This means that there’s a possibility of the question resolving at every time point t.)
Thus a rational forecaster will always expect the probability of a positive resolution to increase in time. Not expecting the probability to decrease is irrational.
Aside from being non-decreasing and starting in p(s)=ps(s), there are no restrictions on the conditional prediction curve. There are plenty of examples of conditional prediction curves for this kind of question in in the next section.
Type 2: “Will the event x happen on time τ?”
Some questions on Metaculus can be written on this form. Examples include “Will Ontario’s Conservative Party (PC) win the a majority in the election on 2022-06-02?” and “Will Volodymyr Zelenskyy be named Time Person of the Year in 2022?”″. In these questions the resolution data is fixed at τ, so time has no influence except through the information source Ft. Thus the prediction curve is constant in t, and we’re in the intuitive setting that we cannot expect our prediction to change in the future.
Proposition 2
For questions of type “Will the event x happen on time τ?”, the conditional predictive curve is constant ps(t)=P(X=1∣Fs,T>t)=P(X=1∣Fs,T>t)=p(s).
Type 3: “Will the event x happen before the event y?”
Questions of this nature are uncommon on Metaculus. The only example I found in my search was “Alexei Navalny to become president or prime minister of Russia in his lifetime?” This questions resolves positively if Navalny becomes PM/president (x) happens before Navalny dies (y). Models for problems of this nature are known as competing risk models.
To model it, define the two times S and R together with X=1[S<R] and T=min(S,R). Then ps(t)=P(X=1∣Fs,T>t),=P(S<R∣Fs,S>t,R>t). There is no general regularity in ps(t) unless we know something special about the hazard rates of S and R.
Proposition 3
Let f(t) be any function with range (0,1). Then there is a model for S,R and an information set Fs so that ps(t)=f(t).
On one hand, the additional complexity suggests that questions of the “Will the event x happen before the event y?” form should be avoided. One the other hand, they are quite easy to model, provided you’re willing to provide the prediction curve or hazard rate for both variables S and R. This can be done using the techniques in the next section.
Parametric conditional prediction curves in “Will the event x fail to happen by time τ?” types of questions
Suppose we know the conditional hazard function at time s and denote it hs(t)=f(t∣T>t,Fs). Then we can write the prediction curve as ps(t)=exp[−∫τths(x)dx], see the appendix for the proof. This formulation of the conditional prediction curve is helpful as it’s relatively easy to interpret hazard rates. Much of the literature in survival analysis / time-to-event data is formulated in terms of hazard rates too. If you’re willing to assume a parametric form for the hazard rate you can construct conditional prediction curves semi-automatically. We’ll take a closer look at three examples: Constant hazard rates, Weibull hazard, and Gompertz—Makeham hazards.
If you’re willing to assume a parametric form for the hazard rate you can construct conditional prediction curves (semi-)automatically.
Constant hazard
Suppose we may assume the hazard rate is constant, i.e., hs(t)=λs with λs unknown. Using the equation ps(t)=exp[−∫τths(x)dx] we see that ps(t)=e−λs(τ−t). If you know the point forecast p(s)=ps(s), we may use it to derive λs. Solving for λs, we find that λs=−logp(s)τ−s, so the implied conditional prediction curve is ps(t)=elogpt(τ−tτ−s)=p(s)τ−tτ−s.
Example
Suppose the current date is in the middle 2022 and we consider the “Will Putin and Zelenskyy not meet to discuss the peaceful resolution of the Russian-Ukrainian conflict before 2023?”. Then we can put τ=1 and s=0.5. In the plot below we show p(s)=0.2,0.6,0.8, where 0.8 was the Metaculus prediction at the time. When p(s) is reasonably large, the conditional prediction curve is almost linear, making p(s)+(1−p(s))τ−tτ−s a reasonable approximation to ps(t).
The benefits of assuming a constant hazard rate lies in its simplicity.
The forecaster doesn’t have to put in more work in the constant hazard model, everything happens automatically.
From the aggregator’s point of view, the constant hazard model allows you to do principled aggregation using only one point data, as you can derive the most up-to-date prediction for every forecaster.
The scorer can calculate principled scores using the scoring rule s(p(T),X) straight from the data.
Weibull hazard
The Weibull hazard is usually written on the form h(t)=bktk−1. It is used to model increasing (when k>1) or decreasing (k<1) hazard rates. The conditional prediction curve is ps(t)=exp[−b(τk−tk)].
To use the Weibull hazard you can an provide a point estimate at the current time and then either
visually modify the curve until you’re pleased with the look,
provide another point estimates and deduce the values of b,k mathematically,
provide more than two points and use e.g. least squares to find the best-fitting curve.
Visual modification
Take the logarithm of p(s) and solve for b−logp(s)τk−sk=b. The conditional prediction curve can be written in terms of k and p(s)ps(t)=p(s)τk−tkτk−sk.
Now you can plot the hazard rate and the conditional prediction curve while sliding k around. You can stop at the k you’re most comfortable with.
Two predictions
Suppose r>t and 1>ps(t)>ps(r)>0. We need to solve logps(t)τk−tk=logps(r)τk−rk This is equivalent to solving τk−rkτk−tk=logps(r)logps(t). Since 1>ps(t)>ps(r)>0 we have 1>logps(r)logps(t)>0. In addition, 0<τk−rkτk−tk<1 is increasing in k, as can be verified by taking its derivative, and has asymptotes at −∞ and ∞, so the equality has a solution that can be found using root-finding.
Example: “Will India have at least 200 nuclear warheads at the end of 2023?”
This plausibly a question with increasing rate. The description says that “As of May 2021, the Federation of American Scientists estimated India as having 160 nuclear warheads.” In order to reach 200 warheads, they first have to reach 161, then 162, and so on, making it more likely they finally reach 200 at a given instance as time goes on.
Due to the way I’ve formulated the mathematics, we have to analyze the opposite question “Will India have less than 200 nuclear warheads at the end of 2023?” instead.
Suppose I make the forecast p(s)=0.95 at time s=0 equal to the the last day of 2022, and suppose that τ=1 on the last day of 2023. Then p(s)τk−tkτk−sk=0.95(1−tk). The plot below shows the resulting prediction curves for k∈{1/3,1/2,1,2,3}. To interpret the red line, observe that the prediction barely changes when t is small enough. This reflects that the probability of India obtaining 200 nukes is small in the short term. However, as t approach 1 and they still haven’t acquired nukes, the probability of them not acquiring them increases rapidly.
Gompertz—Makeham hazard
The Gompertz—Makeham hazard has the form h(t)=αeβt+λ. From ps(t)=exp[−∫τths(t)dx] we find that ps(t)=exp[−∫τtαeβt+λdx],=exp[−αβ(eβτ−eβt)]exp[−λ(τ−t)].The Gompertz—Makeham hazard has an age-dependent″ term αeβt (the Gompertz term) and a age-independent term λ (the Makeham term). We can potentially think of them independently. In some cases there are multiple sources both of age-dependent and age-independent terms, making it a multi-Gompertz—Makeham hazards. If we have k Gompertz components the k-Gompertz—Makeham hazard is h(t)=k∑i=1αieβit+λ, with conditional prediction curve ps(t)=exp[−∫τtk∑i=1αieβit+λdx],=exp[−λ(τ−t)]k∏i=1exp[−αiβi(eβiτ−eβit)].
Example: “Will Putin be stay in power until August 11th 2030?”
We can divide the hazards into three parts: Mortality, time-independent hazard for being kicked out of power, time-independent hazard for a coup, and time-dependent hazard for a coup.
Mortality.This document estimates a Gamma—Gompertz—Makeham model on US data and finds parameters β≈0.1 and λ=0.001 and α30=0.00035 (this means the baseline age is 30, i.e., the mortality starts increasing with age only at age 30). This is not the right country nor the right model but the parameters should be close enough. Since there are some rumors of Putin being sick, I’ll modify the constant hazard to λ=0.01. Since e0.1⋅39≈50, the hazard 0.00035e0.1(t+69)≈0.02e0.1t.
Time-independent hazard for a coup. I haven’t found a good source on this, but it’s probably not too hard to find following the leads in e.g. this paper. I’m guessing a 1% yearly ambient risk of a coup.
Time-dependent hazard for a coup. For instance, one might reasonably think this one will decrease with temporal distance from the start of the Ukraine war. Let’s say the Ukraine conflict adds an annual mortality of 5% right now, expected to decrease to 1% in two years time. Thus α=0.05 and 0.05e2β=0.01, which implies β=log(1/5)/2≈−1.4.
We end up with the hazard rate 0.05e−1.4t+0.02e0.1t+0.03, a sum of two Gompertz components and one Makeham component.
It appears that my complicated Gompertz-Makeham modelling has been for naught, as the prediction curve is virtually identical to the constant hazard prediction curve. I don’t know if we should expect this to happen in general or not. It might be because the two Gompertz components cancel each other other out.
As a side effect, this analysis also yields a density for Date Putin Exits Presidency of Russia. The expression for the survival curve is P(T>t)=exp[−k∑i=1αiβi(eβit−1)−λt], which can be differentiated to find the density f(t) as seen below.
Concluding thoughts
I feel quite confident that conditional prediction curves is the best option for handling the time problem in binary forecasts. There are some alternatives, such as providing the entire distribution p(x,t), but that looks quite cumbersome. There are many benefits from using conditional prediction curves (for the forcasters, aggregators, and scorers), they are not too difficult to implement for forecasting platforms, and it should be possible to develop good tutorials that makes forecasters comfortable with them.
It would be great to find out if the complicated hazard functions are worth the hassle—maybe the constant hazard is enough for most purposes? The Putin example suggest a constant hazard rate might be enough, as the complicated multi-Gompertz—Makeham prediction curve plot is virtually the same as the constant hazard prediction curve based on the same p(s)!
I don’t have too many hints for how to choose among the different hazard functions. But you might use empirics as a guide. For instance, the Gompertz—Makeham hazard appears to fit mortality data better than the Weibull—Makeham hazard but the difference appears to be marginal. If you’re dealing with questions such as “Will Putin be ousted as president of Russia by 2030?”, such observations might help you. There are also theoretical reasons to prefer one over the other in some cases, but I don’t know if they are useful.
It could be reasonable to mix the Weibull and Gompertz components too, for instance following the same kind of reasoning as in the Putin example above. There are infinitely many hazard functions I haven’t talked about at all, such as the log-normal hazard. Some of these may have nice interpretations that could help the forecaster.
Appendix
Proof that s′w is proper
We show that the weighted version s′w(q,X)=∫∞t0w(t)s(q(t),X)dt is a proper scoring rule for any positive weighting function w. Let p(t) denote the true probability P(X=1∣Ft), where Ft is the information observed until time t. Let q(t) be any other stochastic process adapted to Ft.Since s(q(t),X) is non-positive, we can apply Fubini’s theoremto get
E[∫∞t0w(t)w(t)s(q(t),X)dt]=∫∞t0E[w(t)s(q(t),X)]dt,=∫∞t0E[E[s(q(t),X)∣Ft]]dt, where the second equality follows from iterated expectations. Since p(t) is the true probability of X=1 conditioned on Ft,we have E[s(q(t),X)∣Ft]≥E[s(p(t),X)∣Ft],since s is a proper scoring rule. It follows that E[∫∞t0w(t)s(q(t),X)dt]≥E[∫∞t0w(t)s(p(t),X)dt], hence s′w is a proper scoring rule.
Comment on the scoring rule
The scoring rule s′(p,X)=∫∞t0s(p(t),X)dt has the weakness that early forecasters are penalized. If the scoring rule is bounded above, such as the Brier score, early forecasting can be incentivized by setting p(t)=1−X for all time points t before the forecaster made their first forecast. Other than that, it appears to me to be a reasonable scoring rule to evaluate forecasts in time. There are other potential scoring rules, such as s(p(T),X), which do not appear to be proper for predicition curves; but it might also be that prediction curves aren’t the correct abstraction.
Proof that ps(t)=exp[−∫τths(t)dt].
We know that
ps(t)=P(T=τ∣T>t,Fs),=1−P(T<τ∣T>t,Fs).
Using the equality S(t)=exp[−∫t0h(t)dt], where S(t) is the survival function and h(t) the hazard rate, we find that P(T<τ∣T>t,Fs)=P(T<τ∣Fs)−P(T<t∣Fs)P(T>t∣Fs)=(1−exp[−∫τ0hs(t)dt])−(1+exp[−∫t0hs(t)dt])exp[−∫t0hs(t)dt]=1−exp[−∫τths(t)dt]. The equality ps(t)=exp[−∫τths(t)dt] follows from the definition of ps(t).
Proof of Proposition 2, that ps(t) is non-decreasing in t for every s.
Suppose that r>t. Using ps(t)=exp[−∫τths(t)dt] we find that ps(r)/ps(t)=exp[−∫τrhs(t)dt+∫τths(t)dt],=exp[∫rths(t)dt]. Since hs(t)≥0, ps(r)/ps(t)≥1, hence ps(r)≥ps(t). In the same way, if the hazard rate is strictly positive, we have hs(t)>0 for all t, ps(r)/ps(t)>1, hence ps(r)>ps(t).
Proof of Proposition 3
We can ignore the dependence on Fs and work directly on probability measures. In this case P(S<R∣S>t,R>t)=ps(t), and we see that P(S<R∣S>t,R>t)=P(t<S<R)P(S>t)P(R>t),=P(t<S<R)exp[−∫t0hS(x)dx]exp[−∫t0hR(x)dx].
We find that P(t<S<R)=∫∞tP(t<s<R)p(s)ds,=∫∞texp[−∫s0hR(t)dx]p(s)ds,=∫∞texp[−∫s0hR(t)dx]hS(t)exp[−∫s0hS(t)dx]ds. Thus we need to equate ∫∞texp[−∫s0hR(t)dx]hS(t)exp[−∫s0hS(t)dx]dsexp[−∫t0hS(x)dx]exp[−∫t0hR(x)dx]=f(t).
Multiply both sides by exp[∫t0hS(x)dx]exp[∫t0hR(x)dx] to obtain ∫∞texp[−∫s0hR(t)dx]hS(t)exp[−∫s0hS(t)dx]=f(t)exp[−∫t0hS(x)dx]exp[−∫t0hR(x)dx], and differentiate with respect to t to get exp[−∫s0hR(t)dx]hS(t)exp[−∫s0hS(t)dx]=f(t)(hS(t)+hR(t))exp[−∫t0hS(x)dx]exp[−∫t0hR(x)dx],−f′(t)exp[−∫t0hS(x)dx]exp[−∫t0hR(x)dx].
Multiply both sides by exp[∫s0hS(t)dx]exp[∫s0hR(t)dx] to obtain hS(t)=f(t)(hS(t)+hR(t))−f′(t), which can be rearranged to hR(t)=hS(t)+f′(t)f(t)−hS(t). The function hR(t) is a hazard function if and only if it is non-negative, hence we require hS(t)+f′(t)f(t)≥hS(t), i.e., hS(t)≥f(t)hS(t)−f′(t). Solving the equality hS(t)≥f(t)hS(t)−f′(t) yields hS(t)=−f′(t)/(1−f(t)), but this function is negative when f′(t) is positive, hence it’s not in general a hazard function.
We can fix this by defining hS(t)=max(−f′(t),0)1−f(t), for if −f′(t) is non-positive, hS(t)=0 while f(t)hS(t)−f′(t)=−f′(t)≤0.
Updating on the passage of time and conditional prediction curves
What this is: A technical post about how I believe binary forecasts should be made. There are probably some minor mathematical mistakes, but I doubt there are major mistakes.
Introduction
Probabilistic forecasts can rationally be updated without anything happening except the passage of time. This appears to be a violation of conservation of expected evidence but it isn’t so, as the passage of time is often evidence in itself.
To analyze how the passage of time affects binary forecasts I use conditional prediction curves ps(t), your predicted forecast at time t given the information at time s≤t. It’s often possible to construct well-motivated conditional prediction curves automatically, and I provide some details in the context of the “Will x happen by time τ?”″ type of question.
The benefits of thinking with conditional prediction curves are large, both from the point of view of the individual forecaster, the forecast aggregator, and the scorer.
The forecaster has no need to repeatedly go and update his predictions despite no new information coming in,
The aggregator can aggregate the probabilistic forecasts at any point in time,
The scorer can score a continuous stream of predictions instead of a dislocated bunch of them.
Some downsides of using prediction curves include:
Different kinds of questions require different models. Questions of the sort “Will the event x happen before the event y?” shouldn’t be handled in the same way as the “Will x happen by time τ?” kind of question.
Making informal forecasts using conditional prediction curves is harder than making point forecasts. Tutorials and non-technical explanations would probably be required.
It would take some effort to create technical solutions for them.
But prediction curves help us in understanding forecasting too.
Any rational forecaster’s conditional prediction curve will be decreasing if the question looks like “Will x happen by time τ?”
But the conditional prediction curve will be constant for questions like “Will the event x happen on time τ?”
More complicated kinds of questions won’t be as regular. Questions on the form “Will the event x happen before the event y?”″ can have arbitrary conditional prediction curves.
Prediction curves
Define the variables X:binary outcome variable, success if equal to 1,T:event time, i.e., the time when the outcome X becomes known. You can think about X and T using Metaculus questions. For instance, if the question is “Will a non-state actor develop their own nuclear weapon by 2030?”, X would be 1 if a non-state actor develops a nuke by 2030 and 0 otherwise. The random variable T equals 2030 if X=0 and the point in time the nuke is developed if X=1.
A prediction curve for X,T is a random function p(t)∈[0,1], that forecasts the outcome X at every time-point t. If t>T, I’ll assume that p(t)=0 if X=0 and 1 otherwise, so you can’t make any studip predictions after the event time to screw yourself over.
We can score prediction curves using the integrated scoring rule s′(p,X)=∫∞t0s(p(t),X)dt, where s is any proper scoring rule (with 0 being best) and t0 is the starting time. Then s′ is a proper scoring rule for prediction curves (the exact formulation of what this means is a little technical, see the appendix for a proof), meaning it will always be beneficial to supply your the prediction curve you believe in the most. The scoring rule might be strictly proper too, with the proper definitions and assumptions, but I haven’t investigated it yet. One reason to use an integrated scoring rule is to insentivize honest reporting even when the event time T is far away.
Prediction curves are not used by forecasting sites such as Metaculus. That might be because supplying prediction curves is too much to ask of their audience. But it is possible to construct reasonable prediction curves without too much additional work.
The rational forecaster
Define the information set Ft, Ft:the information about X,T available at time t. A rational forecaster with information set Ft is one who makes probabilistic forecasts at each time point t using his best available evidence Ft. Define the rational prediction curve as the stochastic process p(t)=P(X=1∣Ft), Here p(t) is random since Ft is random. When s is a proper scoring rule p(t) is the optimal prediction given the information set Ft according to s′ in the sense that p(t)=argminf∈F∫∞t0E[s(f(t),X)∣Ft]dt, when F is a suitable class of random functions.
The conditional prediction curve
Define the conditional prediction curve as the expected forecast at time t given the information at time s, but conditioned on the event not yet having happened ps(t)=E(p(s)∣T>t),=E(E[X∣Fs]∣T>t),=P(X=1∣Fs,T>t). Then ps(t) is the best possible prediction curve based on Fs in the sense that ps(t)=argminf∫∞sE[s(f(t),X)∣Fs,T>t]dt. You can interpret ps(t) as
the rational prediction at time t of a forecaster who missed all the information from time s to time t.
the rational prediction at time t of a lazy forecaster who did not bother to look for any new information after time s.
the actually rational prediction at time t when information arrives in bursts, not continuously, and the last bit of information became available at time s.
In practice we need the conditional prediction curve because no one is able to update continuously. Call it bounded rationality if you want. The idea is to have each forecaster update their prediction curve whenever they make a forecast, yielding the final prediction curve. If a forecaster provides conditional prediction curves at the times t0=s0<s1<⋯<sk≤T, the final prediction curve isp(t)={psi(t)when si<t<si+1,Xwhen t>T.
The prediction curve below contains two updates, one at s1=0.6 and one at s2=0.8. The curves in-between updates are conditional prediction curves: The black curve is the conditional prediction curve p0(t), the red is p0.6(t), the blue is p0.8(t). Together they form your final prediction curve. The event time is T=1, but that is random an unknown to the forecaster. The conditional forecasts are made using the constant hazard model, discussed in a later section.
Three common question categories
Questions of the sort described above are probably too general to work with, but most can be placed into one of three categories.
Type 1: “Will the event x fail to happen by time τ?”
Most questions on Metaculus can be written on the form “Will x happen by time τ?”. Examples include “Will a coup or regime change take place in Russia in 2022 or 2023?” and “Will Putin and Zelenskyy meet to discuss the peaceful resolution of the Russian-Ukrainian conflict before 2023?”. We will look at the questions formulated on the form “Will the event x fail to happen by time τ?” as it makes the mathematics slightly cleaner.
For these questions we don’t need to model the probability of X at all, yielding the prediction curve p(t)=P(X=1∣Ft)=P(T≥τ∣Ft) and the conditional prediction curve ps(t)=P(T≥τ∣Fs,T>t).
Proposition 1
Thus a rational forecaster will always expect the probability of a positive resolution to increase in time. Not expecting the probability to decrease is irrational.
Aside from being non-decreasing and starting in p(s)=ps(s), there are no restrictions on the conditional prediction curve. There are plenty of examples of conditional prediction curves for this kind of question in in the next section.
Type 2: “Will the event x happen on time τ?”
Some questions on Metaculus can be written on this form. Examples include “Will Ontario’s Conservative Party (PC) win the a majority in the election on 2022-06-02?” and “Will Volodymyr Zelenskyy be named Time Person of the Year in 2022?”″. In these questions the resolution data is fixed at τ, so time has no influence except through the information source Ft. Thus the prediction curve is constant in t, and we’re in the intuitive setting that we cannot expect our prediction to change in the future.
Proposition 2
Type 3: “Will the event x happen before the event y?”
Questions of this nature are uncommon on Metaculus. The only example I found in my search was “Alexei Navalny to become president or prime minister of Russia in his lifetime?” This questions resolves positively if Navalny becomes PM/president (x) happens before Navalny dies (y). Models for problems of this nature are known as competing risk models.
To model it, define the two times S and R together with X=1[S<R] and T=min(S,R). Then ps(t)=P(X=1∣Fs,T>t),=P(S<R∣Fs,S>t,R>t). There is no general regularity in ps(t) unless we know something special about the hazard rates of S and R.
Proposition 3
Let f(t) be any function with range (0,1). Then there is a model for S,R and an information set Fs so that ps(t)=f(t).
On one hand, the additional complexity suggests that questions of the “Will the event x happen before the event y?” form should be avoided. One the other hand, they are quite easy to model, provided you’re willing to provide the prediction curve or hazard rate for both variables S and R. This can be done using the techniques in the next section.
Parametric conditional prediction curves in “Will the event x fail to happen by time τ?” types of questions
Suppose we know the conditional hazard function at time s and denote it hs(t)=f(t∣T>t,Fs). Then we can write the prediction curve as ps(t)=exp[−∫τths(x)dx], see the appendix for the proof. This formulation of the conditional prediction curve is helpful as it’s relatively easy to interpret hazard rates. Much of the literature in survival analysis / time-to-event data is formulated in terms of hazard rates too. If you’re willing to assume a parametric form for the hazard rate you can construct conditional prediction curves semi-automatically. We’ll take a closer look at three examples: Constant hazard rates, Weibull hazard, and Gompertz—Makeham hazards.
If you’re willing to assume a parametric form for the hazard rate you can construct conditional prediction curves (semi-)automatically.
Constant hazard
Suppose we may assume the hazard rate is constant, i.e., hs(t)=λs with λs unknown. Using the equation ps(t)=exp[−∫τths(x)dx] we see that ps(t)=e−λs(τ−t). If you know the point forecast p(s)=ps(s), we may use it to derive λs. Solving for λs, we find that λs=−logp(s)τ−s, so the implied conditional prediction curve is ps(t)=elogpt(τ−tτ−s)=p(s)τ−tτ−s.
Example
Suppose the current date is in the middle 2022 and we consider the “Will Putin and Zelenskyy not meet to discuss the peaceful resolution of the Russian-Ukrainian conflict before 2023?”. Then we can put τ=1 and s=0.5. In the plot below we show p(s)=0.2,0.6,0.8, where 0.8 was the Metaculus prediction at the time. When p(s) is reasonably large, the conditional prediction curve is almost linear, making p(s)+(1−p(s))τ−tτ−s a reasonable approximation to ps(t).
The benefits of assuming a constant hazard rate lies in its simplicity.
The forecaster doesn’t have to put in more work in the constant hazard model, everything happens automatically.
From the aggregator’s point of view, the constant hazard model allows you to do principled aggregation using only one point data, as you can derive the most up-to-date prediction for every forecaster.
The scorer can calculate principled scores using the scoring rule s(p(T),X) straight from the data.
Weibull hazard
The Weibull hazard is usually written on the form h(t)=bktk−1. It is used to model increasing (when k>1) or decreasing (k<1) hazard rates. The conditional prediction curve is ps(t)=exp[−b(τk−tk)].
To use the Weibull hazard you can an provide a point estimate at the current time and then either
visually modify the curve until you’re pleased with the look,
provide another point estimates and deduce the values of b,k mathematically,
provide more than two points and use e.g. least squares to find the best-fitting curve.
Visual modification
Take the logarithm of p(s) and solve for b −logp(s)τk−sk=b. The conditional prediction curve can be written in terms of k and p(s) ps(t)=p(s)τk−tkτk−sk.
Now you can plot the hazard rate and the conditional prediction curve while sliding k around. You can stop at the k you’re most comfortable with.
Two predictions
Suppose r>t and 1>ps(t)>ps(r)>0. We need to solve logps(t)τk−tk=logps(r)τk−rk This is equivalent to solving τk−rkτk−tk=logps(r)logps(t). Since 1>ps(t)>ps(r)>0 we have 1>logps(r)logps(t)>0. In addition, 0<τk−rkτk−tk<1 is increasing in k, as can be verified by taking its derivative, and has asymptotes at −∞ and ∞, so the equality has a solution that can be found using root-finding.
Example: “Will India have at least 200 nuclear warheads at the end of 2023?”
This plausibly a question with increasing rate. The description says that “As of May 2021, the Federation of American Scientists estimated India as having 160 nuclear warheads.” In order to reach 200 warheads, they first have to reach 161, then 162, and so on, making it more likely they finally reach 200 at a given instance as time goes on.
Due to the way I’ve formulated the mathematics, we have to analyze the opposite question “Will India have less than 200 nuclear warheads at the end of 2023?” instead.
Suppose I make the forecast p(s)=0.95 at time s=0 equal to the the last day of 2022, and suppose that τ=1 on the last day of 2023. Then p(s)τk−tkτk−sk=0.95(1−tk). The plot below shows the resulting prediction curves for k∈{1/3,1/2,1,2,3}. To interpret the red line, observe that the prediction barely changes when t is small enough. This reflects that the probability of India obtaining 200 nukes is small in the short term. However, as t approach 1 and they still haven’t acquired nukes, the probability of them not acquiring them increases rapidly.
Gompertz—Makeham hazard
The Gompertz—Makeham hazard has the form h(t)=αeβt+λ. From ps(t)=exp[−∫τths(t)dx] we find that ps(t)=exp[−∫τtαeβt+λdx],=exp[−αβ(eβτ−eβt)]exp[−λ(τ−t)].The Gompertz—Makeham hazard has an age-dependent″ term αeβt (the Gompertz term) and a age-independent term λ (the Makeham term). We can potentially think of them independently. In some cases there are multiple sources both of age-dependent and age-independent terms, making it a multi-Gompertz—Makeham hazards. If we have k Gompertz components the k-Gompertz—Makeham hazard is h(t)=k∑i=1αieβit+λ, with conditional prediction curve ps(t)=exp[−∫τtk∑i=1αieβit+λdx],=exp[−λ(τ−t)]k∏i=1exp[−αiβi(eβiτ−eβit)].
Example: “Will Putin be stay in power until August 11th 2030?”
We can divide the hazards into three parts: Mortality, time-independent hazard for being kicked out of power, time-independent hazard for a coup, and time-dependent hazard for a coup.
Mortality. This document estimates a Gamma—Gompertz—Makeham model on US data and finds parameters β≈0.1 and λ=0.001 and α30=0.00035 (this means the baseline age is 30, i.e., the mortality starts increasing with age only at age 30). This is not the right country nor the right model but the parameters should be close enough. Since there are some rumors of Putin being sick, I’ll modify the constant hazard to λ=0.01. Since e0.1⋅39≈50, the hazard 0.00035e0.1(t+69)≈0.02e0.1t.
Time-independent hazard for a coup. I haven’t found a good source on this, but it’s probably not too hard to find following the leads in e.g. this paper. I’m guessing a 1% yearly ambient risk of a coup.
Time-dependent hazard for a coup. For instance, one might reasonably think this one will decrease with temporal distance from the start of the Ukraine war. Let’s say the Ukraine conflict adds an annual mortality of 5% right now, expected to decrease to 1% in two years time. Thus α=0.05 and 0.05e2β=0.01, which implies β=log(1/5)/2≈−1.4.
We end up with the hazard rate 0.05e−1.4t+0.02e0.1t+0.03, a sum of two Gompertz components and one Makeham component.
It appears that my complicated Gompertz-Makeham modelling has been for naught, as the prediction curve is virtually identical to the constant hazard prediction curve. I don’t know if we should expect this to happen in general or not. It might be because the two Gompertz components cancel each other other out.
As a side effect, this analysis also yields a density for Date Putin Exits Presidency of Russia. The expression for the survival curve is P(T>t)=exp[−k∑i=1αiβi(eβit−1)−λt], which can be differentiated to find the density f(t) as seen below.
Concluding thoughts
I feel quite confident that conditional prediction curves is the best option for handling the time problem in binary forecasts. There are some alternatives, such as providing the entire distribution p(x,t), but that looks quite cumbersome. There are many benefits from using conditional prediction curves (for the forcasters, aggregators, and scorers), they are not too difficult to implement for forecasting platforms, and it should be possible to develop good tutorials that makes forecasters comfortable with them.
It would be great to find out if the complicated hazard functions are worth the hassle—maybe the constant hazard is enough for most purposes? The Putin example suggest a constant hazard rate might be enough, as the complicated multi-Gompertz—Makeham prediction curve plot is virtually the same as the constant hazard prediction curve based on the same p(s)!
I don’t have too many hints for how to choose among the different hazard functions. But you might use empirics as a guide. For instance, the Gompertz—Makeham hazard appears to fit mortality data better than the Weibull—Makeham hazard but the difference appears to be marginal. If you’re dealing with questions such as “Will Putin be ousted as president of Russia by 2030?”, such observations might help you. There are also theoretical reasons to prefer one over the other in some cases, but I don’t know if they are useful.
It could be reasonable to mix the Weibull and Gompertz components too, for instance following the same kind of reasoning as in the Putin example above. There are infinitely many hazard functions I haven’t talked about at all, such as the log-normal hazard. Some of these may have nice interpretations that could help the forecaster.
Appendix
Proof that s′w is proper
We show that the weighted version s′w(q,X)=∫∞t0w(t)s(q(t),X)dt is a proper scoring rule for any positive weighting function w. Let p(t) denote the true probability P(X=1∣Ft), where Ft is the information observed until time t. Let q(t) be any other stochastic process adapted to Ft.Since s(q(t),X) is non-positive, we can apply Fubini’s theoremto get
E[∫∞t0w(t)w(t)s(q(t),X)dt]=∫∞t0E[w(t)s(q(t),X)]dt,=∫∞t0E[E[s(q(t),X)∣Ft]]dt, where the second equality follows from iterated expectations. Since p(t) is the true probability of X=1 conditioned on Ft,we have E[s(q(t),X)∣Ft]≥E[s(p(t),X)∣Ft],since s is a proper scoring rule. It follows that E[∫∞t0w(t)s(q(t),X)dt]≥E[∫∞t0w(t)s(p(t),X)dt], hence s′w is a proper scoring rule.
Comment on the scoring rule
The scoring rule s′(p,X)=∫∞t0s(p(t),X)dt has the weakness that early forecasters are penalized. If the scoring rule is bounded above, such as the Brier score, early forecasting can be incentivized by setting p(t)=1−X for all time points t before the forecaster made their first forecast. Other than that, it appears to me to be a reasonable scoring rule to evaluate forecasts in time. There are other potential scoring rules, such as s(p(T),X), which do not appear to be proper for predicition curves; but it might also be that prediction curves aren’t the correct abstraction.
Proof that ps(t)=exp[−∫τths(t)dt].
We know that ps(t)=P(T=τ∣T>t,Fs),=1−P(T<τ∣T>t,Fs). Using the equality S(t)=exp[−∫t0h(t)dt], where S(t) is the survival function and h(t) the hazard rate, we find that P(T<τ∣T>t,Fs)=P(T<τ∣Fs)−P(T<t∣Fs)P(T>t∣Fs)=(1−exp[−∫τ0hs(t)dt])−(1+exp[−∫t0hs(t)dt])exp[−∫t0hs(t)dt]=1−exp[−∫τths(t)dt]. The equality ps(t)=exp[−∫τths(t)dt] follows from the definition of ps(t).
Proof of Proposition 2, that ps(t) is non-decreasing in t for every s.
Suppose that r>t. Using ps(t)=exp[−∫τths(t)dt] we find that ps(r)/ps(t)=exp[−∫τrhs(t)dt+∫τths(t)dt],=exp[∫rths(t)dt]. Since hs(t)≥0, ps(r)/ps(t)≥1, hence ps(r)≥ps(t). In the same way, if the hazard rate is strictly positive, we have hs(t)>0 for all t, ps(r)/ps(t)>1, hence ps(r)>ps(t).
Proof of Proposition 3
We can ignore the dependence on Fs and work directly on probability measures. In this case P(S<R∣S>t,R>t)=ps(t), and we see that P(S<R∣S>t,R>t)=P(t<S<R)P(S>t)P(R>t),=P(t<S<R)exp[−∫t0hS(x)dx]exp[−∫t0hR(x)dx].
We find that P(t<S<R)=∫∞tP(t<s<R)p(s)ds,=∫∞texp[−∫s0hR(t)dx]p(s)ds,=∫∞texp[−∫s0hR(t)dx]hS(t)exp[−∫s0hS(t)dx]ds. Thus we need to equate ∫∞texp[−∫s0hR(t)dx]hS(t)exp[−∫s0hS(t)dx]dsexp[−∫t0hS(x)dx]exp[−∫t0hR(x)dx]=f(t).
Multiply both sides by exp[∫t0hS(x)dx]exp[∫t0hR(x)dx] to obtain ∫∞texp[−∫s0hR(t)dx]hS(t)exp[−∫s0hS(t)dx]=f(t)exp[−∫t0hS(x)dx]exp[−∫t0hR(x)dx], and differentiate with respect to t to get exp[−∫s0hR(t)dx]hS(t)exp[−∫s0hS(t)dx]=f(t)(hS(t)+hR(t))exp[−∫t0hS(x)dx]exp[−∫t0hR(x)dx],−f′(t)exp[−∫t0hS(x)dx]exp[−∫t0hR(x)dx]. Multiply both sides by exp[∫s0hS(t)dx]exp[∫s0hR(t)dx] to obtain hS(t)=f(t)(hS(t)+hR(t))−f′(t), which can be rearranged to hR(t)=hS(t)+f′(t)f(t)−hS(t). The function hR(t) is a hazard function if and only if it is non-negative, hence we require hS(t)+f′(t)f(t)≥hS(t), i.e., hS(t)≥f(t)hS(t)−f′(t). Solving the equality hS(t)≥f(t)hS(t)−f′(t) yields hS(t)=−f′(t)/(1−f(t)), but this function is negative when f′(t) is positive, hence it’s not in general a hazard function.
We can fix this by defining hS(t)=max(−f′(t),0)1−f(t), for if −f′(t) is non-positive, hS(t)=0 while f(t)hS(t)−f′(t)=−f′(t)≤0.