Yarrow Bouchard 🔸 comments on A major flaw in the Forecasting Research Institute’s “Longitudinal Expert AI Panel” survey

Yarrow Bouchard 🔸 17 Nov 2025 23:16 UTC
2 points
0 ∶ 0
Thanks for replying again. This is helpful. (I am strongly upvoting your comments because I’m grateful for your contribution to the conversation and I think you deserve to have that little plant icon next to your name go away.)

Apologies for the word count of this comment. I’m really struggling to compress what I’m trying to say to something shorter.
On “extreme”: Thank you for clarifying that non-standard/technical use of the word “extreme”. I was confused because I just interpreted it in the typical, colloquial way.
On the content of the three scenarios: I have a hard time understanding how someone could say the slow progress scenario has already been achieved (or that it represents barely an improvement over existing capabilities), but the more I have these kinds of discussions, the more I realize people interpret exactly the same descriptions of hypothetical future AI systems in wildly different ways.

This seems like a problem for forecasting surveys — different respondents may mean completely different things yet, on paper, their responses are exactly the same. (I don’t fault you or your co-authors for this, though, because you didn’t create this problem and I don’t think that I could do any better at writing unambiguous scenarios.)
But, more importantly, it’s also a problem that goes far beyond the scope of just forecasting surveys. It’s a problem for the whole community of people who want to have discussions about AI progress, which we have a shared responsibility to address. I am not sure quite what to do yet, but I’ve been thinking about it a bit over the last few weeks.^[1]
On intersubjective resolution/metaprediction: My confusion about the intersubjective resolution or metaprediction for the three scenarios question is I don’t know how respondents are supposed to express their probability of a scenario being best matching vs. expressing how ambiguous or unambiguous they think the resolution of the prediction will be. If I think there’s a 51% chance that before the end of 2030 the Singularity will happen, in which case the prediction would resolve completely unambiguously for the rapid progress scenario, what should my response to the survey be?
Should I predict 100% of respondents will agree, retrospectively, that the rapid progress scenario is the best matching one, since that is what will happen in the scenario I think is 51% probable? Or should I predict 51% of respondents will pick the rapid progress scenario, even though that’s not what the question is literally asking, because 51% is my probability? (Let’s say for simplicity I think there’s a 51% chance of an unambiguous Singularity of the sort described by futurists like Ray Kurzweil or Vernor Vinge before December 2030 and a 49% chance AI will make no meaningful progress between now and December 2030. And nothing in between.)

It’s possible I just have no idea how intersubjective resolution/metaprediction is supposed to work, but then, was this explained to the respondents? Can you count on them understanding how it works?

On “tasks” vs. “occupations”: I agree that, once you think about it, you can understand why people would think automating all “tasks” and automating all “occupations” wouldn’t mean the same thing. However, this is not obvious (at least, not to everyone) in advance of asking two variants of the question and noticing the difference in the responses. The reasoning is that, logically, an occupation is just a set of tasks, so an AI that can do all tasks can also do all occupations. The authors of the AI Impacts survey were themselves surprised by the framing effect here. On page 7 of their pre-print about the survey, they say (emphasis added by me):
Predictions for a 50% chance of the arrival of FAOL are consistently more than sixty years later than those for a 50% chance of the arrival of HLMI. This was seen in the results from the surveys of 2023, 2022, and 2016. This is surprising because HLMI and FAOL are quite similar: FAOL asks about the automation of all occupations; HLMI asks about the feasible automation of all tasks. Since occupations might naturally be understood either as complex tasks, composed of tasks, or closely connected with one of these, achieving HLMI seems to either imply having already achieved FAOL, or suggest being close.

We do not know what accounts for this gap in forecasts. Insofar as HLMI and FAOL refer to the same event, the difference in predictions about the time of their arrival would seem to be a framing effect.

However, the relationship between “tasks” and “occupations” is debatable. And the question sets do differ beyond definitions: only the HLMI questions are preceded by the instruction to “assume that human scientific activity continues without major negative disruption,” and the FAOL block asks a sequence of questions about the automation of specific occupations before asking about full automation of labor. So conceivably this wide difference could be caused by respondents expecting major disruption to scientific progress, or by the act of thinking through specific examples shifting overall anticipations. From our experience with question testing, it also seems possible that the difference is due to other differences in interpretation of the questions, such as thinking of automating occupations but not tasks as including physical manipulation, or interpreting FAOL to require adoption of AI in automating occupations, not mere feasibility (contrary to the question wording).
The broader problem with Benjamin Tereick’s reply is that he seems to be saying (if I’m understanding correctly) you can conclude there is no significant framing effect just by looking at the responses to one variant of one question. But if the AI Impacts survey only asked about HLMI and not FAOL, and just assumed the two were logically equivalent and equivalent in the eyes of respondents, how would they know, just from that information, that the HLMI question was susceptible to a significant framing effect or not? They wouldn’t know.

I don’t see how someone could argue that the authors of the AI Impacts survey would be able to infer from the results of just the HLMI question, without comparing it to anything else, whether or not the framing of the question introduced significant bias. They wouldn’t know. You have to run the experiment to know — that’s the whole point. Benjamin’s argument, which I may just be misunderstanding, seems analogous to the argument that a clinical trial of a drug doesn’t need a control group because you can tell how effective the drug is just from the experimental group. (Benjamin, what am I missing here?)

That’s why I brought up the AI Impacts survey example and the 2023 Forecasting Research Institute survey example. Just to drive home the point that framing effects/question wording bias/anchoring effects can be extremely significant, and we don’t necessarily know that until we run two versions of the same question. So, I’m glad that you at least agree with the general point that this an important topic to consider.
1. ^
  I think, unfortunately, it’s not a problem that’s easily or quickly resolved, but will most likely involve a lot of reading and writing to get everyone on the same page about some core concepts. I’ve tried to do a little bit of this work already in posts like this one, but that’s just a tiny step in the right direction. Concepts like data efficiency, generalization, continual learning, and fluid intelligence are helpful and much under-discussed. Open technical challenges like learning efficiently from video data (a topic the AI researcher Yann LeCun has talked a lot about) and complex, long-term hierarchical planning (a longstanding problem in reinforcement learning) are also helpful for understanding what the disagreements are about and are also much under-discussed.
  
  One of the distinctions that seems to be causing trouble is understanding intelligence as the ability to complete tasks vs. intelligence as the ability to learn to complete tasks.
  
  Another problem is people interpreting (sometimes despite instructions or despite what’s stipulated in the scenario) an AI system’s ability to complete a task in a minimal, technical sense vs. in a robust, meaningful sense, e.g., an LLM writing a terrible, incoherent novel that nobody reads or likes vs. a good, commercially successful, critically well-received novel (or a novel at that quality level).
  
  A third problem is (again, sometimes despite warnings or qualifications that were meant to forestall this) around reliability: the distinction between an AI system being able to successfully complete a task sometimes, e.g., 50% or 80% or 95% of the time, vs. being able to successfully complete it at the same rate as humans, e.g. 99.9% or 99.999% of the time.
  
  I suspect, but don’t know, that another interpretive difficulty for scenarios like the ones in your survey is around people filling in the gaps (or not). If we say in a scenario that an AI system can do these five things we describe, like make a good song, write a good novel, load a dishwasher, and so on, some people can interpret that to mean the AI system can only do those five things. Other people can interpret these tasks as just representative of the overall set of tasks the AI system can do, such that there a hundred or a thousand or a million other things it can do, and these are just a few examples.
  
  A little discouragingly, similar problems have persisted in discussions around philosophy of mind, cognitive science, and AI for decades — for example, in debates around the Turing test — despite the masterful interventions of brilliant writers who have tried to clear up the ambiguity and confusion (e.g. the philosopher Daniel Dennett’s wonderful essay on the Turing test “Can machines think?” in the anthology Brainchildren).
What links here?
- A major flaw in the Forecasting Research Institute’s “Longitudinal Expert AI Panel” survey by Yarrow Bouchard 🔸 (14 Nov 2025 8:54 UTC; 39 points)