[Relevant context/COI: I’m CEO at the Forecasting Research Institute (FRI), an organization which I co-founded with Phil Tetlock and others. Much of the below is my personal perspective, though it is informed by my work. I don’t speak for others on my team. I’m sharing an initial reply now, and our team at FRI will share a larger post in future that offers a more comprehensive reflection on these topics.]
Thanks for the post — I think it’s important to critically question the value of funds going to forecasting, and this post offers a good opportunity for reflection and discussion.
In brief, I share many of your concerns about forecasting and related research, but I’m also more positive on both its impact so far and its future expected impact.
A summary of some key points:
Much of the impact of forecasting research on specific decision-makers is not public. For example, FRI has informed decisions on frontier AI companies’ capability scaling policies, has advised senior US national security decision-makers, and has informed research at key US and UK government agencies. But, we are not able to share many details of this work publicly. However, there is also public evidence that forecasting research is widely cited and informs discourse and some decision-making (some examples below).
AI timelines, adoption, and risk forecasts play a huge role in both individual career decisions and the broader AI discourse. Forecasting research still seems like one of the best tools available for getting specific and accountable beliefs on these topics. For example, comparing ‘AI safety’ community forecasts to more ‘typical’ experts’ forecasts seems especially important for understanding how much to trust each group’s views. These comparisons will become increasingly relevant for government policymakers over time, especially if there is extremely rapid AI capabilities progress that leads to major societal impacts in the short-run.
When evaluating the impact of FRI-style forecasting research, I think the closest relevant comparison classes are more like broad public goods/measurement-oriented research (e.g., Our World in Data, Epoch) or think-tank research (e.g. GovAI, IAPS). By its nature, the impact of this kind of research tends to be more diffuse and difficult to measure. However, I’d be interested in more intensive comparative evaluation of this type of research and agree that funders should be responsive to evidence about relative impact in these fields.
Forecasting research still has a ton of flaws, and its impact has been far from the dream I’ve long had for it. There are still big challenges around identifying accurate forecasters on questions related to AI, integrating conditional policy forecasts with actual decision-makers’ needs, and combining deep, individual qualitative research with high-quality, group-generated quantitative forecasts.
My extremely simplified narrative is: Tetlock et al. established the modern judgmental forecasting field and created a proof of concept for better forecasts on important topics (“superforecasting”)---this work was largely academic; some forecasting platforms were created to build on that work and apply it to a range of important issues; targeted efforts to make forecasting more directly useful to decision-makers are relatively nascent (i.e., have largely begun in the last few years), and are accumulating impact over time, but still have room for improvement.
FRI’s research, in particular, aims to close many of the gaps left by prediction markets and historical forecasting approaches: it is particularly focused on conditional policy forecasts, medium-to-long-run forecasts that do not get much detailed engagement on prediction markets/platforms, and systematically eliciting forecasts from experts who would not typically participate in forecasting platforms but whom decision-makers want to rely on (while also eliciting forecasts from generalists with strong forecasting track records).
However, some factors make the future potential impact of this work look more promising:
AI-enhanced forecasting research is a huge factor that will unlock cheaper, faster, high-quality forecasts on any question of one’s choosing.
The next few years of forecasting AI progress/adoption/impact seem critical, and like they’ll deliver a lot of answers on whose forecasts we should trust. It seems good to be ready to support decision-makers during this time.
Leaders in the AI space seem particularly interested in using forecasting in their decision-making; they tend to be both quantitative and open-minded. This creates more potential for forecasting to be useful. More minorly, prediction markets and forecasting are generally becoming more credible within governments.
More detail on some select points below. This comment already got very long (!), so I’ll reserve more elaboration for a future, more comprehensive post.
Examples of impact
Forecasting research has informed some very important decisions. Unfortunately, many of the details of the relevant evidence here cannot be made public. However, there is evidence of substantial public citation of this research, and some public evidence of affecting particular decisions.
A few examples of relevant impact include:
Forecasting has been particularly relevant for decision-making around capability scaling policies. The near-term magnitude of AI-biorisk, how growing AI capabilities may increase it, and what safeguards need to be in place to respond to it, are highly uncertain. Frontier AI companies, the EU AI Code of Practice, and other governments are trying to track and respond to AI impacts on biorisk, cybersecurity, AI R&D, and other domains. We’ve had substantial engagement with the relevant actors, including some focused partnerships, and believe our work in this area has affected important decisions, though we unfortunately cannot share many of the details publicly.
Our work on ForecastBench, a benchmark of AI’s ability to do forecasting, showed that AI-produced forecasts could catch up to top human forecasters in roughly the next year if trends persist. This generated interest among senior decision-makers in U.S. national security. We cannot share details, but this is another example of important decision-makers paying attention to and using forecasts.
We have completed commissioned research to directly inform grantmaking at Coefficient Giving, and also have indirectly affected grantmaking. For an example of the latter, our work on the Existential Risk Persuasion Tournament (XPT) partially inspired Coefficient Giving (formerly Open Philanthropy) to launch an RFP on improved AI benchmarks. The XPT forecasts predicted that most existing benchmarks would likely saturate in the next few years, and showed that progress on these benchmarks was not crux-y for disagreements about AI impact. We were told that this played a role in the launch and conception of the RFP, and the XPT is cited in the public write-up.
Some examples of more diffuse impacts — e.g., impact on public understanding of AI and research for policymakers or philanthropists, include:
FRI has given presentations to, and has ongoing connections and conversations with, important government agencies such as the Congressional Budget Office, US CAISI, the UK Department of Science, Innovation, and Technology, and others. We cannot share many details, but the potential to inform decisions at these organizations is highly important.
For context: FRI has been operating for a little over 3 years, and we’re accumulating substantially more momentum in terms of connections to top decision-makers as time goes on.
(To be clear: I am mostly discussing FRI here since it’s what I’m most familiar with.)
AI timelines, impact, and adoption forecasts drive a huge amount of career decision-making, attention, etc.
Forecasts about AI timelines and risk have had major effects on people’s career decisions and the broader AI discourse. AI 2027 underlies popular YouTube videos, 80,000 Hours advises people on career decisions based ontimelines forecasts, Dario Amodei’s “country of geniuses in a datacenter by 2027” forecast informs a lot of Anthropic’s work and policy outreach, the AI Impacts survey on AI researchers’ forecasts of existential risk is highly cited, etc.
A major reason I got into this field is that many people are making very intense claims about the effect that AI will have on the world soon, and I want to bring as much rigor and reflection as possible to those claims. So far, it looks like most forecasters are substantiallyunderestimatingAI capabilitiesprogress (with some exceptions, e.g. on uplift studies); the evidence on forecasts about AI adoption, societal impacts, and risk is less clear, but I expect we will have more evidence soon, particularly from the Longitudinal Expert AI Panel (LEAP), especially as some forecasters are predicting transformative change in the next few years.
As the expected impact and timing of AI progress is sharpened and clarified, talent and money can be allocated more efficiently.
Case study: Economic impacts of AI
In some cases, it looks to me like forecasting research is picking relatively low-hanging fruit.
The economic impact of AI is a prominent topic of public discussion right now, and it is likely that governments will spend many billions of dollars to address it in the coming years.
Currently, economists hold major sway in public policy about the economic impacts of AI. Perhaps you think top economists, as a group, are badly mistaken about the likely near-term impacts of AI, as some Epoch researchers and others believe. Perhaps you think they are likely to be fairly accurate, as Tyler Cowen, Séb Krier, or typical economists believe. It seems like a valuable common sense intervention to at least document what various groups believe, so that when we are making economic policy going forward we can rely on that evidence to determine who is trustworthy. I believe that studies like this one (and its follow-ups) will be the clearest evidence on the topic.
Relevant comparison class for forecasting research
When thinking about the impact and cost-effectiveness of forecasting, I think it’s more appropriate to compare this work to public goods-oriented research organizations (e.g., Our World in Data, Epoch, etc.) and policy-oriented think-tank research (e.g. GovAI, IAPS, CSET, etc.).
I’ve been disappointed by most impact evaluation of think-tanks and public goods-oriented research that I’ve seen. I believe this is partly because it is very difficult to quantify the impact of this type of work because it has diffuse benefits. But, I still think it’s possible to do better and I would like FRI to do better on this front going forward.
That said, I still believe there are reasonable heuristics for why this research area could be highly cost-effective. There are many billions of dollars of philanthropic and government capital being spent on AI policy topics. If there is a meaningful indication that forecasting is changing people’s views on these questions (as I believe there is; see discussion above), it seems reasonable to me to spend a very small fraction of that capital on getting more epistemic clarity.
My critiques of forecasting research
Forecasting research, and FRI’s research in particular, still has major areas for improvement.
Examples of a few key issues:
I’ve been underwhelmed by the accuracy of typical experts and superforecasters on questions about AI capabilities progress (as measured by benchmarks); they often underestimate AI progress (with exceptions). I think this underestimation is a useful fact to document, but it would be much more helpful if our research identified experts you should trust. We’re in the process of identifying ‘Top AI forecasters’ through LEAP and aim to share updates on this soon.
I think forecasting research is at its best when combined with in-depth research reports that provide more narratives and key arguments underlying forecasts. For example, Luca Righetti’s work on estimating (certain kinds of) AI-biorisk provides a lot of valuable analysis that usefully complements our expert panel study on the topic. [Note: Luca is an FRI senior advisor and a co-author of our forecasting study.] For decision-makers to build sufficiently detailed models, and for forecasters to test their arguments, we’d ideally have detailed research like Luca’s on most major topics where we collect forecasts — ideally from a few experts who disagree with each other. Unfortunately, this research often doesn’t readily exist, but we are investigating ways to generate it.
I have been somewhat surprised by how few experts in AI industry, AI policy, and other domains predict transformative impacts of AI similar to what are commonly discussed by AI lab leaders, people in the AI safety community, and others. This has made it harder to have a true horse-race between the ‘transformative AI’ school of thought that seems to drive a lot of discourse and decision-making vs. more gradual views of AI impacts. Though we have some transformative AI forecasters in our studies, in future work we aim to explicitly collect more forecasts from the ‘transformative AI’ school of thought in order to set up clearer comparisons between worldviews and to better anticipate what will happen if the ‘transformative AI’ school makes more accurate forecasts.
I will save other thoughts on how forecasting, and FRI’s research, could be made more useful to decision-makers for a future post.
But, to be clear: I have a lot of genuine uncertainty about whether forecasting research will be sufficiently impactful going forward. There are promising signs, and increasing momentum, but to more fully deliver on its promise, more improvements will be necessary.
Some notes on FRI-style forecasting research vs. other forecasting interventions
On the value of FRI-style forecasting research in particular:
Prediction markets do not have good ways to collect causal policy forecasts, but in our experience, conditional policy forecasts (e.g., how much would various safeguards reduce AI-cyber risk) are often the most helpful forecasts for decision-makers.
Similarly, prediction markets do not create good incentives for longer run forecasts or low-probability forecasts, and incentivize against sharing the rationales behind forecasts. Directly paying and incentivizing relevant experts and forecasters to answer questions is often more useful.
Typical forecasting platforms do not get forecasts from the kinds of experts that policymakers typically rely on, and aren’t the kind of evidence that can easily be cited in government reports. (This may be unfortunate, but it is the current state of the world.)
Reasons for optimism about future impact
Finally, there are a few factors that have the potential to dramatically change the field going forward:
It looks like AI may soon make it >100x cheaper and faster to get high-quality forecasts on any topic of one’s choosing. Policy researchers will be able to ask the precise question they’re interested in, will be able to upload confidential documents to inform forecasts (something we’ve heard is especially important to decision-makers), and will be able to get detailed explanations for all forecasts. AI-produced forecasts will also be much easier to test for accuracy due to the volume of forecasts they can provide, and it will be easier to generate ‘crux’ questions since AI will not get bored of producing huge numbers of conditional forecasts (which are necessary for identifying cruxes). Building benchmarks and tooling to harness AI-produced forecasts will be a much larger part of our work going forward.
The next few years seem very unusual in human history: very thoughtful researchers are predicting “Superhuman Coders” by 2029, with attendant large impacts. There is a spectrum of views, but the scope for disagreement among reasonable people about what the world will look like in 2030 is huge. This is a particularly important time to make predictions testable, update on what we observe, and make better policy and personal decisions on the basis of this information.
People working in the AI space seem particularly interested in using forecasting, perhaps due to a mix of being quantitatively oriented and because they’re facing unusual degrees of uncertainty. This bodes well for forecasting being useful in the coming years. More minorly, it appears that there is a broader cultural change around forecasting-related topics. Prediction markets are increasingly being cited by government officials, and the public is paying more attention to them than ever before. Much of the impact for prediction markets specifically seems negative (e.g. via incentivizing gambling on low-value topics), but the broader cultural shift suggests there may be an opportunity for better uses of forecasting to enter public consciousness as well.
Stripped of all AI-centred argumentation, the reply is left mostly empty. This suggests that judgmental forecasting, at least as exercised by FRI, should perhaps be thought of as a sub-domain of AI safety. In such a case, its impact would need to be evaluated in the portfolio context of all AI safety budgets, meaning a much higher hurdle rate would have to be cleared to justify its activities.
What more broadly applies to judgmental forecasting and online betting platforms—and is also the basis for many arguments in this defence of forecasting—is the circular reasoning regarding the field’s importance, frequently repeated by the field’s own and those adjacent to it. But, in contrast to the opinionated voices, the evidence is lacking. Merely stating that forecasting has informed some policy or that career decisions have been influenced is not sufficient. Similarly, whether its impact is positive or negative is taken at face value and never substantiated.
All this isn’t to say that judgmental forecasting research or its funding should be dispensed with. In fact, hybrids that combine quantitative predictive models with expert judgment are among the foundational tools of large organisations’ decision-making processes. However, I believe the field’s association with online betting (high time we called things for what they are) as well as over-reliance on AI for its services is actually hurting it.
Whose job is it to identify EA questions which could benefit from better forecasts?
Consider two different hypotheses:
Forecasting is only helpful for AI
Forecasting is helpful outside of AI, but AI has captured much more forecasting interest than other cause areas
How much time are non-AI org leaders spending trying to think up decision-relevant forecasts related to their cause areas?
If leaders are not spending any time trying to think up such forecasts, maybe there is low-hanging fruit here. Maybe EA has latent forecasting capability which can be tapped to improve organizational decision-making. Or maybe such forecasting capability will free up in a few years if AI turns out to be a nothingburger.
If leaders have spent a lot of time trying to think up useful forecasts, and failed, maybe forecasting really is fairly useless outside of AI.
If I was leading a non-AI EA organization, and I had a forecast I really wanted to see the result of, who would I even talk to? Which forecasting organizations are actively soliciting ideas for EA-related forecast questions?
It seems to me that a lot of what EA does is implicit forecasting in some sense, e.g. if you give someone a grant, it’s an implicit forecast about the probability that they will be able to accomplish something with that grant. EA is often critiqued for neglecting “systemic change”. If you want to do systemic change, being able to forecast the effects of various systemic changes is really useful. If you take any action, there’s an implicit forecast that it will lead to a good outcome and not backfire somehow. Wouldn’t it be better to make this forecast explicit? All else equal, wouldn’t it be good to get some perspective from people outside of the organization, who are perhaps forecasting in their free time as a replacement for watching TV or other downtime activities?
My understanding of the original post’s intent is that it calls for evidence of the field’s impact, given the funding it receives. I don’t believe it critiques judgmental forecasting as an analytical method and neither do I think that I signal this in my comment.
I stand by my opinion, however, that the community is correct to ask for tactile proof, burden of which rests on organizations that receive the funding.
I regret if this doesn’t satisfy the questions in your comment.
“Stripped of all AI-centred argumentation, the reply is left mostly empty.”
The bulk of our funding has gone toward AI-focused forecasting projects (e.g. LEAP, AI-biorisk, economic effects of AI) or ‘automating forecasting research’-type work that has the ultimate goal of assisting decisionmakers (e.g. ForecastBench), so I think this is most of what FRI should be evaluated on.
“...meaning a much higher hurdle rate would have to be cleared to justify its activities.”
I’m not sure what comparison class people had in mind previously, but I agree it seems broadly correct to consider this work alongside other AI-related funding opportunities. As noted above, I’d argue that it is appropriate and valuable to have “AI measurement” as an important funding domain alongside areas like “AI governance,” “Technical AI safety research,” “AI field-building,” etc. It seems valuable for one part of the AI grantmaking portfolio to be generating evidence that can be used to sharpen views on AI timelines, to assess risk in various domains (bio, cyber, catastrophic risk), to assess magnitudes of benefits (for calibrating cost-benefit analyses on policies), and to predict the likelihood and impact of various policies (e.g. the effectiveness of DNA synthesis screening for biorisk), etc. This type of fundamental research can inform and support more effective action in the other domains.
I also think forecasting research can have direct impacts on AI governance via direct decision-making partnerships like I described above: i.e., directly partnering with and advising important government agencies and frontier AI companies, among others, on high-stakes decisions related to AI regulation, implementing effective safeguards to reduce AI-cyber risk, and more. We have already seen some early impacts along these lines, as previously mentioned.
“Merely stating that forecasting has informed some policy or that career decisions have been influenced is not sufficient. Similarly, whether its impact is positive or negative is taken at face value and never substantiated.”
I agree. Due to confidentiality, we have primarily shared details of our impact case studies with our funders and had them assess the value of the impact we are making. Establishing evidence of impact publicly is more challenging due to confidentiality considerations. But elsewhere in the thread people have mentioned citations as one reasonable metric for evidence of impact for research organizations that have more diffuse impacts. We have targets for growing our prominent citations over time to assess our impact, and I’ve shared examples of prominent citations to FRI research in my comment above. I also hope that over time, we can share more case studies publicly and provide more of the reasoning for why we believe we had an impact and whether it was positive. The benchmarks RFP case study described above is one example that can be discussed relatively publicly.
“All this isn’t to say that judgmental forecasting research or its funding should be dispensed with. In fact, hybrids that combine quantitative predictive models with expert judgment are among the foundational tools of large organisations’ decision-making processes. However, I believe the field’s association with online betting (high time we called things for what they are) as well as over-reliance on AI for its services is actually hurting it.”
I broadly agree on these points. We are running longitudinal expert panels, partnering with important institutions to improve their decision-making, and automating forecasting research, so I see our work as distinct from online betting/forecasting platforms.
I hate to do this, especially at the start, but I want to point out for you and others who have jobs related to forecasting that it’s difficult to convince someone of something when their job relies on them not believing it. I think you should assume that you will think forecasting is more useful than it is.
As for your points, I’ll respond to some of them.
If you want to DM me, I can sign an NDA, and I may update my opinion depending on what these non-public uses of forecasting are.
I don’t think this is all that relevant. I’m not sure what forecasting research has really elicited on AI timelines. I agree that talk about timelines creates a lot of “buzz” around AI but depending on your views, this is good or bad.
I agree that the impact of measurement-oriented research is difficult to measure, but importantly, not impossible. OWID for example should count how much their work is being cited and looked up. Conversely, I think it would be good to estimate, for FRI, how much $$ the change of the decision was worth and by what amount/percentage did FRI make that change more likely. I don’t think you really gave a good reason that FRI should be funded over anything else that simply has very diffuse benefits.
When do you think it’s reasonable, if ever, for the EA community to “give up” on funding more forecasting work?
If I’m being cynical, almost every field can say “AI will transform the field” though I’m not sure how much this is worth debating.
Not Josh, and also conflicted through the Social Science Prediction Platform (though we had pretty minimal funding from EA sources), but I wonder if it would be worth pooling non-public projects we know of and making BOTE estimates of hypothetical impact. It’s tricky because I don’t know of any RCTs (though I’m working on one now). But I’m extremely confident that across us we would think of some combination of orgs/governments that collectively spend over $100 billion per year (… I can think of that alone) that are interested in forecasts in different ways. Now, imo the vast majority of places interested in forecasting are not going to do anything substantive with it, and it’s hard to know what it means for one of these places to integrate forecasts—for example, for an org spending $X, do forecasts inform 1% of their funding or what? Of the share they inform, how much do they move the needle? If estimates from people who work on forecasts may be optimistic (I’m not paid at all for it, but I choose to work on it because I think it’s useful), happy to describe the situation to an outside observer privately.
I think the Social Science Prediction Platform (alongside a friend of mine who is doing something similar for clinical trials) are among the more interesting uses of forecasting/PMs but I’m skeptical they will be uptaken to the degree/impact you might hope for.
do forecasts inform 1% of their funding or what?
I’m skeptical of things of the form “small percentage chance * big number”. I think humans are really bad at estimating small percentages.
Would be happy to talk privately about any situations you are thinking of.
Thanks! I agree, I’m also generally skeptical of small chance * big number things—I was not intending 1% as an anchor but as an open question—and not as a probability but as a concrete percent of the funding. For example, a big org uses forecasts, but perhaps they only use them in particular workstreams responsible for X% of funding, and those workstreams could be tracked. Then out of X%, how much do they move the needle?
“Prediction markets are increasingly being cited by government officials, and the public is paying more attention to them than ever before. Much of the impact for prediction markets specifically seems negative (e.g. via incentivizing gambling on low-value topics), but the broader cultural shift suggests there may be an opportunity for better uses of forecasting to enter public consciousness as well.”
I think that this is a reason for pessimism on impact, not optimism. Kalshi and Polymarket are primarily sports gambling platforms by volume, immune to state regulation for reasons that may, in the perspective of a cynic, be related to them paying Donald Trump Jr. undisclosed sums of money for undisclosed quantities of work. This does not, I think, inspire particular trust in their efficacy or accuracy. The new legislative push could shift this (I haven’t dug into it deeply), but by default I expect the shift from “odd thing some experts claim is good” to “the tool for corruption, leaking military secrets, insider trading, and sports gambling” to worsen perceptions of accuracy (broadly defined halo effect).
[Relevant context/COI: I’m CEO at the Forecasting Research Institute (FRI), an organization which I co-founded with Phil Tetlock and others. Much of the below is my personal perspective, though it is informed by my work. I don’t speak for others on my team. I’m sharing an initial reply now, and our team at FRI will share a larger post in future that offers a more comprehensive reflection on these topics.]
Thanks for the post — I think it’s important to critically question the value of funds going to forecasting, and this post offers a good opportunity for reflection and discussion.
In brief, I share many of your concerns about forecasting and related research, but I’m also more positive on both its impact so far and its future expected impact.
A summary of some key points:
Much of the impact of forecasting research on specific decision-makers is not public. For example, FRI has informed decisions on frontier AI companies’ capability scaling policies, has advised senior US national security decision-makers, and has informed research at key US and UK government agencies. But, we are not able to share many details of this work publicly. However, there is also public evidence that forecasting research is widely cited and informs discourse and some decision-making (some examples below).
AI timelines, adoption, and risk forecasts play a huge role in both individual career decisions and the broader AI discourse. Forecasting research still seems like one of the best tools available for getting specific and accountable beliefs on these topics. For example, comparing ‘AI safety’ community forecasts to more ‘typical’ experts’ forecasts seems especially important for understanding how much to trust each group’s views. These comparisons will become increasingly relevant for government policymakers over time, especially if there is extremely rapid AI capabilities progress that leads to major societal impacts in the short-run.
When evaluating the impact of FRI-style forecasting research, I think the closest relevant comparison classes are more like broad public goods/measurement-oriented research (e.g., Our World in Data, Epoch) or think-tank research (e.g. GovAI, IAPS). By its nature, the impact of this kind of research tends to be more diffuse and difficult to measure. However, I’d be interested in more intensive comparative evaluation of this type of research and agree that funders should be responsive to evidence about relative impact in these fields.
Forecasting research still has a ton of flaws, and its impact has been far from the dream I’ve long had for it. There are still big challenges around identifying accurate forecasters on questions related to AI, integrating conditional policy forecasts with actual decision-makers’ needs, and combining deep, individual qualitative research with high-quality, group-generated quantitative forecasts.
My extremely simplified narrative is: Tetlock et al. established the modern judgmental forecasting field and created a proof of concept for better forecasts on important topics (“superforecasting”)---this work was largely academic; some forecasting platforms were created to build on that work and apply it to a range of important issues; targeted efforts to make forecasting more directly useful to decision-makers are relatively nascent (i.e., have largely begun in the last few years), and are accumulating impact over time, but still have room for improvement.
FRI’s research, in particular, aims to close many of the gaps left by prediction markets and historical forecasting approaches: it is particularly focused on conditional policy forecasts, medium-to-long-run forecasts that do not get much detailed engagement on prediction markets/platforms, and systematically eliciting forecasts from experts who would not typically participate in forecasting platforms but whom decision-makers want to rely on (while also eliciting forecasts from generalists with strong forecasting track records).
However, some factors make the future potential impact of this work look more promising:
AI-enhanced forecasting research is a huge factor that will unlock cheaper, faster, high-quality forecasts on any question of one’s choosing.
The next few years of forecasting AI progress/adoption/impact seem critical, and like they’ll deliver a lot of answers on whose forecasts we should trust. It seems good to be ready to support decision-makers during this time.
Leaders in the AI space seem particularly interested in using forecasting in their decision-making; they tend to be both quantitative and open-minded. This creates more potential for forecasting to be useful. More minorly, prediction markets and forecasting are generally becoming more credible within governments.
More detail on some select points below. This comment already got very long (!), so I’ll reserve more elaboration for a future, more comprehensive post.
Examples of impact
Forecasting research has informed some very important decisions. Unfortunately, many of the details of the relevant evidence here cannot be made public. However, there is evidence of substantial public citation of this research, and some public evidence of affecting particular decisions.
A few examples of relevant impact include:
Forecasting has been particularly relevant for decision-making around capability scaling policies. The near-term magnitude of AI-biorisk, how growing AI capabilities may increase it, and what safeguards need to be in place to respond to it, are highly uncertain. Frontier AI companies, the EU AI Code of Practice, and other governments are trying to track and respond to AI impacts on biorisk, cybersecurity, AI R&D, and other domains. We’ve had substantial engagement with the relevant actors, including some focused partnerships, and believe our work in this area has affected important decisions, though we unfortunately cannot share many of the details publicly.
Our work on ForecastBench, a benchmark of AI’s ability to do forecasting, showed that AI-produced forecasts could catch up to top human forecasters in roughly the next year if trends persist. This generated interest among senior decision-makers in U.S. national security. We cannot share details, but this is another example of important decision-makers paying attention to and using forecasts.
We have completed commissioned research to directly inform grantmaking at Coefficient Giving, and also have indirectly affected grantmaking. For an example of the latter, our work on the Existential Risk Persuasion Tournament (XPT) partially inspired Coefficient Giving (formerly Open Philanthropy) to launch an RFP on improved AI benchmarks. The XPT forecasts predicted that most existing benchmarks would likely saturate in the next few years, and showed that progress on these benchmarks was not crux-y for disagreements about AI impact. We were told that this played a role in the launch and conception of the RFP, and the XPT is cited in the public write-up.
Some examples of more diffuse impacts — e.g., impact on public understanding of AI and research for policymakers or philanthropists, include:
FRI has given presentations to, and has ongoing connections and conversations with, important government agencies such as the Congressional Budget Office, US CAISI, the UK Department of Science, Innovation, and Technology, and others. We cannot share many details, but the potential to inform decisions at these organizations is highly important.
Major reports for policymakers, like the International AI Safety Report, the AI Index, and relevant RAND reports, also prominently cite FRI research.
FRI research is cited in places like the New York Times, The Economist, and Bloomberg to inform readers about the economic impacts of AI, AI-biorisk, general catastrophic and existential risk, AI-enhanced forecasting, and the future of AI more generally.
Forecasts are widely cited in cause prioritization research and by experts in relevant domains: as a few examples, see citations from Ethan Mollick on AI progress, 80,000 Hours on biorisk, Dr. Richard Moulange on AI-biorisk, Tyler Cowen on the economic effects of AI, Will MacAskill on AI progress and risk, etc.
For context: FRI has been operating for a little over 3 years, and we’re accumulating substantially more momentum in terms of connections to top decision-makers as time goes on.
(To be clear: I am mostly discussing FRI here since it’s what I’m most familiar with.)
AI timelines, impact, and adoption forecasts drive a huge amount of career decision-making, attention, etc.
Forecasts about AI timelines and risk have had major effects on people’s career decisions and the broader AI discourse. AI 2027 underlies popular YouTube videos, 80,000 Hours advises people on career decisions based on timelines forecasts, Dario Amodei’s “country of geniuses in a datacenter by 2027” forecast informs a lot of Anthropic’s work and policy outreach, the AI Impacts survey on AI researchers’ forecasts of existential risk is highly cited, etc.
A major reason I got into this field is that many people are making very intense claims about the effect that AI will have on the world soon, and I want to bring as much rigor and reflection as possible to those claims. So far, it looks like most forecasters are substantially underestimating AI capabilities progress (with some exceptions, e.g. on uplift studies); the evidence on forecasts about AI adoption, societal impacts, and risk is less clear, but I expect we will have more evidence soon, particularly from the Longitudinal Expert AI Panel (LEAP), especially as some forecasters are predicting transformative change in the next few years.
As the expected impact and timing of AI progress is sharpened and clarified, talent and money can be allocated more efficiently.
Case study: Economic impacts of AI
In some cases, it looks to me like forecasting research is picking relatively low-hanging fruit.
The economic impact of AI is a prominent topic of public discussion right now, and it is likely that governments will spend many billions of dollars to address it in the coming years.
Currently, economists hold major sway in public policy about the economic impacts of AI. Perhaps you think top economists, as a group, are badly mistaken about the likely near-term impacts of AI, as some Epoch researchers and others believe. Perhaps you think they are likely to be fairly accurate, as Tyler Cowen, Séb Krier, or typical economists believe. It seems like a valuable common sense intervention to at least document what various groups believe, so that when we are making economic policy going forward we can rely on that evidence to determine who is trustworthy. I believe that studies like this one (and its follow-ups) will be the clearest evidence on the topic.
Relevant comparison class for forecasting research
When thinking about the impact and cost-effectiveness of forecasting, I think it’s more appropriate to compare this work to public goods-oriented research organizations (e.g., Our World in Data, Epoch, etc.) and policy-oriented think-tank research (e.g. GovAI, IAPS, CSET, etc.).
I’ve been disappointed by most impact evaluation of think-tanks and public goods-oriented research that I’ve seen. I believe this is partly because it is very difficult to quantify the impact of this type of work because it has diffuse benefits. But, I still think it’s possible to do better and I would like FRI to do better on this front going forward.
That said, I still believe there are reasonable heuristics for why this research area could be highly cost-effective. There are many billions of dollars of philanthropic and government capital being spent on AI policy topics. If there is a meaningful indication that forecasting is changing people’s views on these questions (as I believe there is; see discussion above), it seems reasonable to me to spend a very small fraction of that capital on getting more epistemic clarity.
My critiques of forecasting research
Forecasting research, and FRI’s research in particular, still has major areas for improvement.
Examples of a few key issues:
I’ve been underwhelmed by the accuracy of typical experts and superforecasters on questions about AI capabilities progress (as measured by benchmarks); they often underestimate AI progress (with exceptions). I think this underestimation is a useful fact to document, but it would be much more helpful if our research identified experts you should trust. We’re in the process of identifying ‘Top AI forecasters’ through LEAP and aim to share updates on this soon.
I think forecasting research is at its best when combined with in-depth research reports that provide more narratives and key arguments underlying forecasts. For example, Luca Righetti’s work on estimating (certain kinds of) AI-biorisk provides a lot of valuable analysis that usefully complements our expert panel study on the topic. [Note: Luca is an FRI senior advisor and a co-author of our forecasting study.] For decision-makers to build sufficiently detailed models, and for forecasters to test their arguments, we’d ideally have detailed research like Luca’s on most major topics where we collect forecasts — ideally from a few experts who disagree with each other. Unfortunately, this research often doesn’t readily exist, but we are investigating ways to generate it.
I have been somewhat surprised by how few experts in AI industry, AI policy, and other domains predict transformative impacts of AI similar to what are commonly discussed by AI lab leaders, people in the AI safety community, and others. This has made it harder to have a true horse-race between the ‘transformative AI’ school of thought that seems to drive a lot of discourse and decision-making vs. more gradual views of AI impacts. Though we have some transformative AI forecasters in our studies, in future work we aim to explicitly collect more forecasts from the ‘transformative AI’ school of thought in order to set up clearer comparisons between worldviews and to better anticipate what will happen if the ‘transformative AI’ school makes more accurate forecasts.
I will save other thoughts on how forecasting, and FRI’s research, could be made more useful to decision-makers for a future post.
But, to be clear: I have a lot of genuine uncertainty about whether forecasting research will be sufficiently impactful going forward. There are promising signs, and increasing momentum, but to more fully deliver on its promise, more improvements will be necessary.
Some notes on FRI-style forecasting research vs. other forecasting interventions
On the value of FRI-style forecasting research in particular:
Prediction markets do not have good ways to collect causal policy forecasts, but in our experience, conditional policy forecasts (e.g., how much would various safeguards reduce AI-cyber risk) are often the most helpful forecasts for decision-makers.
Similarly, prediction markets do not create good incentives for longer run forecasts or low-probability forecasts, and incentivize against sharing the rationales behind forecasts. Directly paying and incentivizing relevant experts and forecasters to answer questions is often more useful.
Typical forecasting platforms do not get forecasts from the kinds of experts that policymakers typically rely on, and aren’t the kind of evidence that can easily be cited in government reports. (This may be unfortunate, but it is the current state of the world.)
Reasons for optimism about future impact
Finally, there are a few factors that have the potential to dramatically change the field going forward:
It looks like AI may soon make it >100x cheaper and faster to get high-quality forecasts on any topic of one’s choosing. Policy researchers will be able to ask the precise question they’re interested in, will be able to upload confidential documents to inform forecasts (something we’ve heard is especially important to decision-makers), and will be able to get detailed explanations for all forecasts. AI-produced forecasts will also be much easier to test for accuracy due to the volume of forecasts they can provide, and it will be easier to generate ‘crux’ questions since AI will not get bored of producing huge numbers of conditional forecasts (which are necessary for identifying cruxes). Building benchmarks and tooling to harness AI-produced forecasts will be a much larger part of our work going forward.
The next few years seem very unusual in human history: very thoughtful researchers are predicting “Superhuman Coders” by 2029, with attendant large impacts. There is a spectrum of views, but the scope for disagreement among reasonable people about what the world will look like in 2030 is huge. This is a particularly important time to make predictions testable, update on what we observe, and make better policy and personal decisions on the basis of this information.
People working in the AI space seem particularly interested in using forecasting, perhaps due to a mix of being quantitatively oriented and because they’re facing unusual degrees of uncertainty. This bodes well for forecasting being useful in the coming years. More minorly, it appears that there is a broader cultural change around forecasting-related topics. Prediction markets are increasingly being cited by government officials, and the public is paying more attention to them than ever before. Much of the impact for prediction markets specifically seems negative (e.g. via incentivizing gambling on low-value topics), but the broader cultural shift suggests there may be an opportunity for better uses of forecasting to enter public consciousness as well.
Stripped of all AI-centred argumentation, the reply is left mostly empty. This suggests that judgmental forecasting, at least as exercised by FRI, should perhaps be thought of as a sub-domain of AI safety. In such a case, its impact would need to be evaluated in the portfolio context of all AI safety budgets, meaning a much higher hurdle rate would have to be cleared to justify its activities.
What more broadly applies to judgmental forecasting and online betting platforms—and is also the basis for many arguments in this defence of forecasting—is the circular reasoning regarding the field’s importance, frequently repeated by the field’s own and those adjacent to it. But, in contrast to the opinionated voices, the evidence is lacking. Merely stating that forecasting has informed some policy or that career decisions have been influenced is not sufficient. Similarly, whether its impact is positive or negative is taken at face value and never substantiated.
All this isn’t to say that judgmental forecasting research or its funding should be dispensed with. In fact, hybrids that combine quantitative predictive models with expert judgment are among the foundational tools of large organisations’ decision-making processes. However, I believe the field’s association with online betting (high time we called things for what they are) as well as over-reliance on AI for its services is actually hurting it.
Whose job is it to identify EA questions which could benefit from better forecasts?
Consider two different hypotheses:
Forecasting is only helpful for AI
Forecasting is helpful outside of AI, but AI has captured much more forecasting interest than other cause areas
How much time are non-AI org leaders spending trying to think up decision-relevant forecasts related to their cause areas?
If leaders are not spending any time trying to think up such forecasts, maybe there is low-hanging fruit here. Maybe EA has latent forecasting capability which can be tapped to improve organizational decision-making. Or maybe such forecasting capability will free up in a few years if AI turns out to be a nothingburger.
If leaders have spent a lot of time trying to think up useful forecasts, and failed, maybe forecasting really is fairly useless outside of AI.
If I was leading a non-AI EA organization, and I had a forecast I really wanted to see the result of, who would I even talk to? Which forecasting organizations are actively soliciting ideas for EA-related forecast questions?
It seems to me that a lot of what EA does is implicit forecasting in some sense, e.g. if you give someone a grant, it’s an implicit forecast about the probability that they will be able to accomplish something with that grant. EA is often critiqued for neglecting “systemic change”. If you want to do systemic change, being able to forecast the effects of various systemic changes is really useful. If you take any action, there’s an implicit forecast that it will lead to a good outcome and not backfire somehow. Wouldn’t it be better to make this forecast explicit? All else equal, wouldn’t it be good to get some perspective from people outside of the organization, who are perhaps forecasting in their free time as a replacement for watching TV or other downtime activities?
My understanding of the original post’s intent is that it calls for evidence of the field’s impact, given the funding it receives. I don’t believe it critiques judgmental forecasting as an analytical method and neither do I think that I signal this in my comment.
I stand by my opinion, however, that the community is correct to ask for tactile proof, burden of which rests on organizations that receive the funding.
I regret if this doesn’t satisfy the questions in your comment.
The bulk of our funding has gone toward AI-focused forecasting projects (e.g. LEAP, AI-biorisk, economic effects of AI) or ‘automating forecasting research’-type work that has the ultimate goal of assisting decisionmakers (e.g. ForecastBench), so I think this is most of what FRI should be evaluated on.
I’m not sure what comparison class people had in mind previously, but I agree it seems broadly correct to consider this work alongside other AI-related funding opportunities. As noted above, I’d argue that it is appropriate and valuable to have “AI measurement” as an important funding domain alongside areas like “AI governance,” “Technical AI safety research,” “AI field-building,” etc. It seems valuable for one part of the AI grantmaking portfolio to be generating evidence that can be used to sharpen views on AI timelines, to assess risk in various domains (bio, cyber, catastrophic risk), to assess magnitudes of benefits (for calibrating cost-benefit analyses on policies), and to predict the likelihood and impact of various policies (e.g. the effectiveness of DNA synthesis screening for biorisk), etc. This type of fundamental research can inform and support more effective action in the other domains.
I also think forecasting research can have direct impacts on AI governance via direct decision-making partnerships like I described above: i.e., directly partnering with and advising important government agencies and frontier AI companies, among others, on high-stakes decisions related to AI regulation, implementing effective safeguards to reduce AI-cyber risk, and more. We have already seen some early impacts along these lines, as previously mentioned.
I agree. Due to confidentiality, we have primarily shared details of our impact case studies with our funders and had them assess the value of the impact we are making. Establishing evidence of impact publicly is more challenging due to confidentiality considerations. But elsewhere in the thread people have mentioned citations as one reasonable metric for evidence of impact for research organizations that have more diffuse impacts. We have targets for growing our prominent citations over time to assess our impact, and I’ve shared examples of prominent citations to FRI research in my comment above. I also hope that over time, we can share more case studies publicly and provide more of the reasoning for why we believe we had an impact and whether it was positive. The benchmarks RFP case study described above is one example that can be discussed relatively publicly.
I broadly agree on these points. We are running longitudinal expert panels, partnering with important institutions to improve their decision-making, and automating forecasting research, so I see our work as distinct from online betting/forecasting platforms.
Hi Josh, thanks for the response.
I hate to do this, especially at the start, but I want to point out for you and others who have jobs related to forecasting that it’s difficult to convince someone of something when their job relies on them not believing it. I think you should assume that you will think forecasting is more useful than it is.
As for your points, I’ll respond to some of them.
If you want to DM me, I can sign an NDA, and I may update my opinion depending on what these non-public uses of forecasting are.
I don’t think this is all that relevant. I’m not sure what forecasting research has really elicited on AI timelines. I agree that talk about timelines creates a lot of “buzz” around AI but depending on your views, this is good or bad.
I agree that the impact of measurement-oriented research is difficult to measure, but importantly, not impossible. OWID for example should count how much their work is being cited and looked up. Conversely, I think it would be good to estimate, for FRI, how much $$ the change of the decision was worth and by what amount/percentage did FRI make that change more likely. I don’t think you really gave a good reason that FRI should be funded over anything else that simply has very diffuse benefits.
When do you think it’s reasonable, if ever, for the EA community to “give up” on funding more forecasting work?
If I’m being cynical, almost every field can say “AI will transform the field” though I’m not sure how much this is worth debating.
Not Josh, and also conflicted through the Social Science Prediction Platform (though we had pretty minimal funding from EA sources), but I wonder if it would be worth pooling non-public projects we know of and making BOTE estimates of hypothetical impact. It’s tricky because I don’t know of any RCTs (though I’m working on one now). But I’m extremely confident that across us we would think of some combination of orgs/governments that collectively spend over $100 billion per year (… I can think of that alone) that are interested in forecasts in different ways. Now, imo the vast majority of places interested in forecasting are not going to do anything substantive with it, and it’s hard to know what it means for one of these places to integrate forecasts—for example, for an org spending $X, do forecasts inform 1% of their funding or what? Of the share they inform, how much do they move the needle? If estimates from people who work on forecasts may be optimistic (I’m not paid at all for it, but I choose to work on it because I think it’s useful), happy to describe the situation to an outside observer privately.
Hi Eva,
I think the Social Science Prediction Platform (alongside a friend of mine who is doing something similar for clinical trials) are among the more interesting uses of forecasting/PMs but I’m skeptical they will be uptaken to the degree/impact you might hope for.
I’m skeptical of things of the form “small percentage chance * big number”. I think humans are really bad at estimating small percentages.
Would be happy to talk privately about any situations you are thinking of.
Thanks! I agree, I’m also generally skeptical of small chance * big number things—I was not intending 1% as an anchor but as an open question—and not as a probability but as a concrete percent of the funding. For example, a big org uses forecasts, but perhaps they only use them in particular workstreams responsible for X% of funding, and those workstreams could be tracked. Then out of X%, how much do they move the needle?
Anyway, happy to chat sometime!
(COI flag: I have an application out with FRI)
“Prediction markets are increasingly being cited by government officials, and the public is paying more attention to them than ever before. Much of the impact for prediction markets specifically seems negative (e.g. via incentivizing gambling on low-value topics), but the broader cultural shift suggests there may be an opportunity for better uses of forecasting to enter public consciousness as well.”
I think that this is a reason for pessimism on impact, not optimism. Kalshi and Polymarket are primarily sports gambling platforms by volume, immune to state regulation for reasons that may, in the perspective of a cynic, be related to them paying Donald Trump Jr. undisclosed sums of money for undisclosed quantities of work. This does not, I think, inspire particular trust in their efficacy or accuracy. The new legislative push could shift this (I haven’t dug into it deeply), but by default I expect the shift from “odd thing some experts claim is good” to “the tool for corruption, leaking military secrets, insider trading, and sports gambling” to worsen perceptions of accuracy (broadly defined halo effect).