Judgmental forecasting is a fairly recent and (in my humble opinion)
fairly under-researched & under-appreciated human endeavour & field of
research, with some low-hanging fruit (which are getting picked almost
as fast as I can write them up).
The Five Horsemen of Hard Forecasting
In general, judgmental forecasting methods operate best in areas with
fast feedback loops, large existing datasets (or at least good reference
classes for base rates) and continuous historical trends.
We can therefore identify the five horsemen of hard forecasting:
Long time horizons: Because most forecasters and traders
discount the
future (either due to rewards further in the future being less
certain, or because whatever investment is bound up in a bet
could be used in the mean term, or because they actually weigh
the future lower), and because long term thinking activates far
mode
from construal level
theory,
the incentives to perform well on long-term questions are weaker
than on short-term questions. Additionally, forecasters receive
much more & better feedback on short-term questions. One would
expect long-term questions to receive less accurate forecasts
because of this, and the evidence points to this being the case (Dillon
2021,
Niplav
2022).
But we’re often especially interested in long-term questions: How can
we incentivize or create good forecasts on those questions?
Reward-correlated predictions: The clearest examples of this problem
are questions on extinction events: If you forecast doom, you’re never
going to get rewarded for it, because the resolution happens only in
worlds where the bad outcome didn’t occur. Forecasters are embedded
agents in the world
they are predicting on, and there is no Cartesian boundary. This can
happen with prediction markets as well: when making predictions on the
outcome of a decision, with the payout of the prediction market being
in a currency that is affected by the decision (for example devaluing
it respective to other currencies), the market might choose the “worse”
decision (according to the metric used for scoring it) because it prevents
the currency from being devalued as much.
Low probability events: Some events are very important, but
have a low probability (extreme stock market crashes, extinction
events, rare diseases, encounters with aliens etc.). But low
probability events are maybe even harder to forecast than long
time horizon events: they often don’t have good reference classes,
while long time horizon questions do (that’s why we have history and
time series data!), and forecasters very rarely encounter them. We
might just round all probabilities <1% to 0%, lest we get Pascal’s
mugged, but in doing
so we close our eyes to possible dangers (and prizes) out there, the
Talebian approach
of erring on the side of caution by “rounding them up” condemns us to
eternal overcaution and conservatism, so as a first step we definitely
want our probabilities to be as accurate as possible.
Out-of-distribution situations: Whenever things with no
clear existing reference class occur, such as novel technologies
(social media, the internet in general, nuclear weapons,
international shipping logistics, and in the future potentially
genetic engineering or self-driving cars), forecasters struggle to
anticipate the consequences (or foresee those shifts). This isn’t
limited to forecasters and prediction markets: if regular people,
pundits and domain experts on average do worse than top forecasters
(though as a counterpoint to forecasters>experts see Leech & Yagudin
2022),
then we wouldn’t expect them to do much better specifically in very
novel & unforeseen situations (reasons why this could still happen:
experts might have detailed causal models that are outperformed by
simple heuristics in the modal case, but as we go outside of the normal
course of events, those causal & theoretical models break down much more
gracefully than simple surface heuristics).
Hard-to-specify events: Maybe we are slicing up forecasting
the wrong way: as the old adage goes, the hard part is not coming
up with the answer, it is coming up with the right question to
ask. Similarly, for forecasting, we often run into the problem of
specifying exactly what we want to know about: Too broad and
you drive away forecasters and traders who don’t want to waste
their time on predicting the whims of whoever resolves the market in the
end,
too narrow and you miss what you actually care about or invite
Goodharting. An
additional layer of complexity is added when hobbyists do your
forecasting, in which case narrow questions just aren’t very interesting
to do predictions on. This could be seen with the Metaculus clean meat
tournament:
many questions were just different combinatorial variations on
each other, with maybe five being interesting to predict on,
but not all fourteen, leading to many questions receiving less
than 100 predictions during the tournament. But “interestingness”
and “specifiability” appear to be tugging in opposite directions:
hobbyists are probably most interested in making broad claims that flow
from their worldview, instead of finding minutiae for very specific
questions. Finding ways to create more specific questions on events
(or avoid doing so with clever tricks while still receiving accurate
forecasts) is important and difficult. Latent variable prediction
markets
offer one approach—how easy are they to implement with acceptable UX?
We can use these categories as guideposts: How bad are these as
problems? What approaches have been proposed/tried/implemented so far? If
we can improve one of them without harming our ability to perform well
on the others, we have made progress, if we improve several in tandem,
that’s even better.
How Good Are We At Forecasting?
How good are long-term forecasts?
How quickly does our forecasting ability decrease with increasing range of the question/forecast?
Does it decrease at all, or just oscillate wildly?
How quickly does performance degrade in different categories of questions (finance, meteorology, global economics, technological development) and by different forecasters (prediction markets, superforecasters & teams)?
Are there people who are better long-term forecasters and people who are better short-term forecasters?
Are better short-term forecasters also better long-term forecasters?
Do forecasters become better at forecasting over time?
How quickly?
Over time/over more forecasts
How much does forecaster quantity affect forecast quality on continuous questions? (i.e., extend Dillon 2021 to continuous data)
How much does forecasting time affect forecast quality? That is, what is the relation of accuracy of prediction to the time spent on refining that prediction?
Generally, scaling laws for forecasting would be interesting/cool to see.
How much do number of resolutions/forecasts matter for forecast quality/learning?
Do laypeople/pundits/domain experts perform better than forecasters/superforecasters/forecasting teams/prediction markets specifically under novel & unforeseen situations?
Are more extreme views or more conservative views more accurate?
How well does forecasting expertise in one domain transfer to another?
That is, if a forecaster starts by forecasting in some domain D, and after a while switches to domain D′, how much better is the forecaster than if he’d started out in D′ without any other experience?
This would be even more interesting when also having a metric for the difference between D and D′.
How Can We Become Better At Forecasting?
Scoring Rules
What possible forecasting scoring rules could we develop?
Taking into account:
Accuracy compared to others
Importance of question
That incentivize collaboration and positive-sum interactions instead of information-hiding
The literature on information elicitation could be useful here
How can we compare the skill and reliability of forecasters to one another?
Metaculus at the moment does this by “who writes good comments”. That seems inadequate.
Taking into account:
Number of questions each forecaster predicted on
Calibration
Resolution
Importance of questions
Two boundary methods:
Compare using a scoring rule on any question the forecasters predicted on
Compare using a scoring rule on the intersection of the questions the forecasters predicted on
Two functions of scoring rules: Rewarding or comparing forecasters
Related field: honest reporting and information elicitation
See also: Section 27.4.2 from Algorithmic Game Theory (Nisan et al. 2007)
Difficult Types of Questions
How can we deal with questions with unclear resolution criteria?
Collect Metaculus experiments on this
How do we incentivise good predictions on long-term questions?
Ideas:
chained temporal forecasts
How do we incentivise good predictions on low-probability events?
Ideas:
chained conditional forecasts
Is there any conceivable way of incentivizing good predictions on extinction events?
Forecasting Techniques
Question Decomposition
If we say ”X will happen if and only if Y1 and Y2 and
Y3… all happen, so we estimate P(Y1) and P(Y2|Y1)
and P(Y3|Y1,Y2) &c, and then multiply them together to estimate
P(X)=P(Y1)⋅P(Y2|Y1)⋅P(Y3|Y2,Y1⋅)⋅…”, do we usually get a
probability that is close to P(X)? Does this improve forecasts where
one tries to estimate P(X) directly?
Decomposition methods are designed to improve accuracy by splitting
the judgmental task into a series of smaller and cognitively less
demanding tasks, and then combining the resulting judgements. Armstrong
(2001)
distinguishes between decomposition, where the breakdown of
the task is multiplicative (e.g. sales forecast=market size
forecast×market share forecast), and segmentation, where it is
additive (e.g. sales forecast=Northern region forecast+Western
region forecast+Central region forecast), but we will use the
term for both approaches here. Surprisingly, there has been
relatively little research over the last 25 years into the value
of decomposition and the conditions under which it is likely to
improve accuracy. In only a few cases has the accuracy of forecasts
resulting from decomposition been tested against those of control
groups making forecasts holistically. One exception is Edmundson
(1990)
who found that for a time series extrapolation task, obtaining separate
estimates of the trend, seasonal and random components and then combining
these to obtain forecasts led to greater accuracy than could be obtained
from holistic forecasts. Similarly, Webby, O’Connor and Edmundson
(2005)
showed that, when a time series was disturbed in some periods by several
simultaneous special events, accuracy was greater when forecasters were
required to make separate estimates for the effect of each event, rather
than estimating the combined effects holistically. Armstrong and Collopy
(1993) also constructed
more accurate forecasts by structuring the selection and weighting
of statistical forecasts around the judge’s knowledge of separate
factors that influence the trends in time series (causal forces).
Many other proposals for decomposition methods have been based on an
act of faith that breaking down judgmental tasks is bound to improve
accuracy or upon the fact that decomposition yields an audit trail
and hence a defensible rationale for the forecasts (Abramson & Finizza,
1991;
Bunn & Wright,
1991;
Flores, Olson, & Wolfe,
1992;
Saaty &
Vargas, 1991; Salo & Bunn,
1995;
Wolfe & Flores,
1990).
Yet, as Goodwin and Wright
(1993)
point out, decomposition is not guaranteed to improve accuracy and may
actually reduce it when the decomposed judgements are psychologically more
complex or less familiar than holistic judgements, or where the increased
number of judgements required by the decomposition induces fatigue.
(Emphasis mine).
The types of decomposition described here seem quite different from
the ones used in the sources above: Decomposed time series are quite
dissimilar to multiplied probabilities for binary predictions, and in
combination with the conceptual counter-arguments the evidence appears
quite weak.
It appears as if a team of a few (let’s say 4) dedicated forecasters could
run a small experiment to determine whether multiplicative decomposition
for binary forecasts a good method, by randomly spending 20 minutes either
making explicitely decomposed forecasts or control forecasts (although
the exact method for control needs to be elaborated on). Working in
parallel, making 70 forecasts should take 70 forecasts⋅1hr3 forecasts⋅14≈5.8hr less than 6 hours, although it’d be useful to search for
more recent literature on the question.
Would decomposition work better if one were operating with log-odds instead of probabilities?
Classification and Improvements
The description of such decomposition in this
section is, of course, lacking: A
better way of decomposition would be, for a specific outcome,
to find a set of preconditions for X that are mutually
exclusive
and collectively
exhaustive, find
a chain that precedes them (or another MECE decomposition), and iterate
until a whole (possibly interweaving) tree of options has been found.
Thus one can define three types of question decomposition:
Multiplicative Decomposition: Given an event X, find conditions Y1,…Yn so that X if any only if all of Y1,…,Yn happen. Estimate P(Y1) and P(Y2|Y1) and P(Y3|Y1,Y2) &c, and then multiply them together to estimate P(X)=P(Y1)⋅P(Y2|Y1)⋅P(Y3|Y2,Y1⋅)…P(Yn|Yn−1,…,Y2,Y1).
Additive Decomposition or MECE Decomposition: Given an event X, find a set of scenarios Y1,…Yn such that X happens if any Y happens, and only then, and no two Yk,Yl have P(Yk∩Yl)>0. Estimate P(Y1),P(Y2),…P(Yn) and then estimate P(X)=∑ni=1P(Yi).
Recursive Decomposition: For each scenario X′, decide to pursue one of the following strategies:
Estimate P(X′) directly
Multiplicative decomposition of P(X′)
Find a multiplicative decomposition Y′1,…Y′n for X′
Estimate P(Y′1),…P(Y′n|Y′1,…Y′n−1) each via recursive decomposition
Find a multiplicative decomposition Y′1,…Y′n for X′
Estimate P(Y′1),…P(Y′n) each via recursive decomposition
Determine P(X′)=P(Y′1)+P(Y′2)+…P(Y′n).
A keen reader will notice that recursive decomposition is similar to
Bayes nets. True, though
it doesn’t deal as well with conditional probabilities.
Using LLMs
This is a scenario where large language models are quite useful, and
we have a testable hypothesis: Does question decomposition (or MECE
decomposition) improve language model forecasts by any amount?
Frontier LLMs are atbestmediocre at
forecasting real-world events, but similar to how asking
for calibration
improves performance, so perhaps
chain-of-thought-like
question decomposition improves (or reduces) their performance (and
therefore gives us reason to believe that similar practices will (or
won’t) work with human forecasters).
Direct:
Provide your best probabilistic estimate for the following question.
Give ONLY the probability, no other words or explanation. For example:
10%. Give the most likely guess, as short as possible; not a complete
sentence, just the guess!
The question is: ${QUESTION}. ${RESOLUTION_CRITERIA}.
Multiplicative decomposition:
Provide your best probabilistic estimate a question.
Your output should be structured in three parts.
First, determine a list of factors X₁, …, X_n that are necessary
and sufficient for the question to be answered "Yes". You can choose
any number of factors.
Second, for each factor X_i, estimate and output the conditional
probability P(X_i|X₁, X₂, …, X_{i-1}), the probability that X_i
will happen, given all the previous factors *have* happened. Then, arrive
at the probability for Q by multiplying the conditional probabilities
P(X_i):
P(Q)=P(X₁)*P(X₂|X₁)…P(X_n|X₁, X₂, …, X_{n-1}).
Third and finally, In the last line, report P(Q), WITHOUT ANY ADDITIONAL
TEXT. Just write the probability, and nothing else.
Example (Question: "Will my wife get bread from the bakery today?"):
Necessary factors:
1. My wife remembers to get bread from the bakery.
2. The car isn't broken.
3. The bakery is open.
4. The bakery still has bread.
1. P(My wife remembers to get bread from the bakery)=0.75
2. P(The car isn't broken|My wife remembers to get bread from the bakery)=0.99
3. P(The bakery is open|The car isn't broken, My wife remembers to get bread from the bakery)=0.7
4. P(The bakery still has bread|The bakery is open, The car isn't broken, My wife remembers to get bread from the bakery)=0.9
Multiplying out the probabilities: 0.75*0.99*0.7*0.9=0.467775
46.7775%
(End of output)
The question is: ${QUESTION}. ${RESOLUTION_CRITERIA}
Forecasters: What Do They Know? Do They Know Things?? Let’s Find Out!
In the spirit of mandatory draft amnesty day
Beginnings of a research agenda about judgmental forecasting.
Judgmental forecasting is a fairly recent and (in my humble opinion) fairly under-researched & under-appreciated human endeavour & field of research, with some low-hanging fruit (which are getting picked almost as fast as I can write them up).
The Five Horsemen of Hard Forecasting
In general, judgmental forecasting methods operate best in areas with fast feedback loops, large existing datasets (or at least good reference classes for base rates) and continuous historical trends.
We can therefore identify the five horsemen of hard forecasting:
Long time horizons: Because most forecasters and traders discount the future (either due to rewards further in the future being less certain, or because whatever investment is bound up in a bet could be used in the mean term, or because they actually weigh the future lower), and because long term thinking activates far mode from construal level theory, the incentives to perform well on long-term questions are weaker than on short-term questions. Additionally, forecasters receive much more & better feedback on short-term questions. One would expect long-term questions to receive less accurate forecasts because of this, and the evidence points to this being the case (Dillon 2021, Niplav 2022). But we’re often especially interested in long-term questions: How can we incentivize or create good forecasts on those questions?
Reward-correlated predictions: The clearest examples of this problem are questions on extinction events: If you forecast doom, you’re never going to get rewarded for it, because the resolution happens only in worlds where the bad outcome didn’t occur. Forecasters are embedded agents in the world they are predicting on, and there is no Cartesian boundary. This can happen with prediction markets as well: when making predictions on the outcome of a decision, with the payout of the prediction market being in a currency that is affected by the decision (for example devaluing it respective to other currencies), the market might choose the “worse” decision (according to the metric used for scoring it) because it prevents the currency from being devalued as much.
Low probability events: Some events are very important, but have a low probability (extreme stock market crashes, extinction events, rare diseases, encounters with aliens etc.). But low probability events are maybe even harder to forecast than long time horizon events: they often don’t have good reference classes, while long time horizon questions do (that’s why we have history and time series data!), and forecasters very rarely encounter them. We might just round all probabilities <1% to 0%, lest we get Pascal’s mugged, but in doing so we close our eyes to possible dangers (and prizes) out there, the Talebian approach of erring on the side of caution by “rounding them up” condemns us to eternal overcaution and conservatism, so as a first step we definitely want our probabilities to be as accurate as possible.
Out-of-distribution situations: Whenever things with no clear existing reference class occur, such as novel technologies (social media, the internet in general, nuclear weapons, international shipping logistics, and in the future potentially genetic engineering or self-driving cars), forecasters struggle to anticipate the consequences (or foresee those shifts). This isn’t limited to forecasters and prediction markets: if regular people, pundits and domain experts on average do worse than top forecasters (though as a counterpoint to forecasters>experts see Leech & Yagudin 2022), then we wouldn’t expect them to do much better specifically in very novel & unforeseen situations (reasons why this could still happen: experts might have detailed causal models that are outperformed by simple heuristics in the modal case, but as we go outside of the normal course of events, those causal & theoretical models break down much more gracefully than simple surface heuristics).
Hard-to-specify events: Maybe we are slicing up forecasting the wrong way: as the old adage goes, the hard part is not coming up with the answer, it is coming up with the right question to ask. Similarly, for forecasting, we often run into the problem of specifying exactly what we want to know about: Too broad and you drive away forecasters and traders who don’t want to waste their time on predicting the whims of whoever resolves the market in the end, too narrow and you miss what you actually care about or invite Goodharting. An additional layer of complexity is added when hobbyists do your forecasting, in which case narrow questions just aren’t very interesting to do predictions on. This could be seen with the Metaculus clean meat tournament: many questions were just different combinatorial variations on each other, with maybe five being interesting to predict on, but not all fourteen, leading to many questions receiving less than 100 predictions during the tournament. But “interestingness” and “specifiability” appear to be tugging in opposite directions: hobbyists are probably most interested in making broad claims that flow from their worldview, instead of finding minutiae for very specific questions. Finding ways to create more specific questions on events (or avoid doing so with clever tricks while still receiving accurate forecasts) is important and difficult. Latent variable prediction markets offer one approach—how easy are they to implement with acceptable UX?
We can use these categories as guideposts: How bad are these as problems? What approaches have been proposed/tried/implemented so far? If we can improve one of them without harming our ability to perform well on the others, we have made progress, if we improve several in tandem, that’s even better.
How Good Are We At Forecasting?
How good are long-term forecasts?
How quickly does our forecasting ability decrease with increasing range of the question/forecast?
Does it decrease at all, or just oscillate wildly?
How quickly does performance degrade in different categories of questions (finance, meteorology, global economics, technological development) and by different forecasters (prediction markets, superforecasters & teams)?
Are there people who are better long-term forecasters and people who are better short-term forecasters?
See here
How good are our forecasts on low-probability events?
How good are our forecasts on extinction events?
How good are our forecasts in situations where we have historical discontinuities?
How quickly/slowly do our forecasts converge to the final answer?
When don’t they converge?
Can we classify convergence/divergence/oscillation behaviors?
How do prediction markets, professional forecasting teams, internet enthusiasts and large language models compare?
Arb 2022
GPT-3 forecasting ability
What is a good formalization of the idea of a forecaster being accurate at a level of n%?
See Precision of Sets of Forecasts
Are better short-term forecasters also better long-term forecasters?
Do forecasters become better at forecasting over time?
How quickly?
Over time/over more forecasts
How much does forecaster quantity affect forecast quality on continuous questions? (i.e., extend Dillon 2021 to continuous data)
How much does forecasting time affect forecast quality? That is, what is the relation of accuracy of prediction to the time spent on refining that prediction?
Generally, scaling laws for forecasting would be interesting/cool to see.
How much do number of resolutions/forecasts matter for forecast quality/learning?
Do laypeople/pundits/domain experts perform better than forecasters/superforecasters/forecasting teams/prediction markets specifically under novel & unforeseen situations?
Are more extreme views or more conservative views more accurate?
Question originally asked in Hanson 2007
Are there people who are better long-term forecasters and people who are better short-term forecasters?
See here
How well does forecasting expertise in one domain transfer to another?
That is, if a forecaster starts by forecasting in some domain D, and after a while switches to domain D′, how much better is the forecaster than if he’d started out in D′ without any other experience?
This would be even more interesting when also having a metric for the difference between D and D′.
How Can We Become Better At Forecasting?
Scoring Rules
What possible forecasting scoring rules could we develop?
Taking into account:
Accuracy compared to others
Importance of question
That incentivize collaboration and positive-sum interactions instead of information-hiding
The literature on information elicitation could be useful here
How can we compare the skill and reliability of forecasters to one another?
Metaculus at the moment does this by “who writes good comments”. That seems inadequate.
Taking into account:
Number of questions each forecaster predicted on
Calibration
Resolution
Importance of questions
Two boundary methods:
Compare using a scoring rule on any question the forecasters predicted on
Compare using a scoring rule on the intersection of the questions the forecasters predicted on
Two functions of scoring rules: Rewarding or comparing forecasters
Related field: honest reporting and information elicitation
See also: Section 27.4.2 from Algorithmic Game Theory (Nisan et al. 2007)
Difficult Types of Questions
How can we deal with questions with unclear resolution criteria?
Collect Metaculus experiments on this
How do we incentivise good predictions on long-term questions?
Ideas:
chained temporal forecasts
How do we incentivise good predictions on low-probability events?
Ideas:
chained conditional forecasts
Is there any conceivable way of incentivizing good predictions on extinction events?
Forecasting Techniques
Question Decomposition
If we say ”X will happen if and only if Y1 and Y2 and Y3… all happen, so we estimate P(Y1) and P(Y2|Y1) and P(Y3|Y1,Y2) &c, and then multiply them together to estimate P(X)=P(Y1)⋅P(Y2|Y1)⋅P(Y3|Y2,Y1⋅)⋅…”, do we usually get a probability that is close to P(X)? Does this improve forecasts where one tries to estimate P(X) directly?
This type of question decomposition (which one could call multiplicative decomposition) appears to be a relatively common method for forecasting, see Allyn-Feuer & Sanders 2023, Silver 2016, Kaufman 2011, Carlsmith 2022 and Hanson 2011, but there have been conceptual arguments against this technique, see Yudkowsky 2017, AronT 2023 and Gwern 2019, which all argue that it reliably underestimates the probability of events.
What is the empirical evidence for decomposition being a technique that improves forecasts?
Lawrence et al. 2006 summarize the state of research on the question:
(Emphasis mine).
The types of decomposition described here seem quite different from the ones used in the sources above: Decomposed time series are quite dissimilar to multiplied probabilities for binary predictions, and in combination with the conceptual counter-arguments the evidence appears quite weak.
It appears as if a team of a few (let’s say 4) dedicated forecasters could run a small experiment to determine whether multiplicative decomposition for binary forecasts a good method, by randomly spending 20 minutes either making explicitely decomposed forecasts or control forecasts (although the exact method for control needs to be elaborated on). Working in parallel, making 70 forecasts should take 70 forecasts⋅1hr3 forecasts⋅14≈5.8hr less than 6 hours, although it’d be useful to search for more recent literature on the question.
Would decomposition work better if one were operating with log-odds instead of probabilities?
Classification and Improvements
The description of such decomposition in this section is, of course, lacking: A better way of decomposition would be, for a specific outcome, to find a set of preconditions for X that are mutually exclusive and collectively exhaustive, find a chain that precedes them (or another MECE decomposition), and iterate until a whole (possibly interweaving) tree of options has been found.
Thus one can define three types of question decomposition:
Multiplicative Decomposition: Given an event X, find conditions Y1,…Yn so that X if any only if all of Y1,…,Yn happen. Estimate P(Y1) and P(Y2|Y1) and P(Y3|Y1,Y2) &c, and then multiply them together to estimate P(X)=P(Y1)⋅P(Y2|Y1)⋅P(Y3|Y2,Y1⋅)…P(Yn|Yn−1,…,Y2,Y1).
Additive Decomposition or MECE Decomposition: Given an event X, find a set of scenarios Y1,…Yn such that X happens if any Y happens, and only then, and no two Yk,Yl have P(Yk∩Yl)>0. Estimate P(Y1),P(Y2),…P(Yn) and then estimate P(X)=∑ni=1P(Yi).
Recursive Decomposition: For each scenario X′, decide to pursue one of the following strategies:
Estimate P(X′) directly
Multiplicative decomposition of P(X′)
Find a multiplicative decomposition Y′1,…Y′n for X′
Estimate P(Y′1),…P(Y′n|Y′1,…Y′n−1) each via recursive decomposition
Determine P(X′)=P(Y′1)⋅P(Y′2|Y′1)⋅P(Y′3|Y′2,Y′1)…P(Y′n|Y′n−1,…,Y′2,Y′1).
Additive decomposition of P(X′)
Find a multiplicative decomposition Y′1,…Y′n for X′
Estimate P(Y′1),…P(Y′n) each via recursive decomposition
Determine P(X′)=P(Y′1)+P(Y′2)+…P(Y′n).
A keen reader will notice that recursive decomposition is similar to Bayes nets. True, though it doesn’t deal as well with conditional probabilities.
Using LLMs
This is a scenario where large language models are quite useful, and we have a testable hypothesis: Does question decomposition (or MECE decomposition) improve language model forecasts by any amount?
Frontier LLMs are at best mediocre at forecasting real-world events, but similar to how asking for calibration improves performance, so perhaps chain-of-thought-like question decomposition improves (or reduces) their performance (and therefore gives us reason to believe that similar practices will (or won’t) work with human forecasters).
Direct:
Multiplicative decomposition:
Discussions
LessWrong
Effective Altruism Forum
How Can We Ask Better Forecasting Questions?
What are methods of scoring/defining how good a question was?
How many questions resolve due to technicalities in the resolution criteria?
Are the ratios here different across different question categories?
How does this ratio develop as one puts more effort into specifying resolution criteria?
This might be studied qualitatively/semi-quantitatively.
Other Questions
Where are the big datasets of past judgmental forecasts?
What is the rate of positive resolution by range?
How good a predictor is forecasting performance of intra-individual cognitive performance?
How difficult is it to manipulate real existing prediction platforms?
Markets
PredictIt
BetFair
Hobbyist sites
Metaculus
PredictionBook
How can we develop better forecast aggregation methods?
Use momentum of past forecasts
Use the generalized mean with changing p as the time to question resolution shrinks
Should p be increasing/decreasing/following a more complicated pattern?
Can we do something cool with the quasi-arithmetic mean?