(An author of the report here.) Thanks for engaging with this question and providing your feedback! Iāll provide a few of my thoughts. But, I will first note that EA forum posts by individuals affiliated with FRI do not constitute official positions.
I do think the following qualification we provided to forecasters (also noted by Benjamin) is important: Reasonable people may disagree with our characterization of what constitutes slow, moderate, or rapid AI progress. Or they may expect to see slow progress observed with some AI capabilities and moderate or fast progress in others. Nevertheless, we ask you to select which scenario, in sum, you feel best represents your views.
I would also agree with Benjamin that ābest matchingā covers scenarios with slower and faster progress than the slow and fast progress scenarios, respectively. And, I believe our panel is sophisticated and capable understanding this feature. Additionally, this question was not designed to capture the extreme possibilities for AI progress, and I personally wouldnāt use it to inform my views on these extreme possibilities (I think the mid-probability space is interesting and undeexplored, and we want LEAP to fill this gap). Given this, however, you are correct that we ought to include the ābest matchingā qualification when we present these results, and Iāve added this to our paper revision to-do list. Thanks for pointing that out.
I think other questions in the survey do a better job of covering the full range of possibilities, both in scenarios questions (i.e., TRS) and our more traditional, easily resolvable forecasting questions. The latter group comprise the vast majority of our surveys. I think itās impossible to write a single forecasting question that satisfies any reasonable and comprehensive set of desiderata, so Iād view LEAP as a portfolio of questions.
On edit #2, I would first note that it is challenging to write a set of scenarios for AI progress without an explosion of scenarios (and an associated increase in survey burden, which would itself degrade response quality); we face a tradeoff between parsimony and completeness. This specific question in the first survey is uniquely focused on parsimony, and we attempted to include questions that take other stances on that tradeoff. However, weād love to hear any suggestions you have for writing these types of questions, as we could certainly improve on this front. I think youāve identified many of the shortcomings in this specific question already. Second, I would defend our choice to present as probabilities (but we should add the ābest matchingā qualifier). Weāre making an appeal to intersubjective resolution. Witkowski et al. (2017) is one example, and some people at FRI have done similar work (Karger et al. 2021). These two metrics rely on wisdom-of-the-crowd effects. Again, however, I donāt think itās clear that weāre making this appeal, so Iāve added a note to clarify this in the paper. We use a resolution criterion (metaprediction) that some find unintuitive, but it allows us to incentivize this question. But, others might argue that incentives are less important.
While I think framing effects obviously matter in surveys, I do think that your edit #3 is conflating an elicitation/āmeasurement/āinstrumentation issue in low-probability forecasting with the broader phenomenon of framing, which I view as being primarily but not exclusively about question phrasing. Weāre including tests on framing and the elicitation environment in LEAP itself to make sure our results arenāt too sensitive to any framing effects, and weāll be sharing more on those in the future. Iād love to hear any ideas for experiments we should run there.
In sum, I largely defend the choices we made in writing this question. LEAP includes many different types of questions, because consumers of the research will have different views of the types of questions they will find informative and convincing. I will note that even within FRI some people personally find the scenarios questions much less compelling than the other questions in the survey. Nevertheless, I think you identified issues with our framing of the results, and we will make some changes. I appreciate you laying out your criticisms of the paper clearly so that we can dig into them, and Iād welcome any additional feedback!
Thank you very much for your reply. I especially want to give you my profound appreciation for being willing to revise how your results are described in the report. (I hope you will make the same revision in public communications as well, such as blog posts or posts on this forum.) A few responses which I tried to keep as succinct as possible, but failed to keep succinct:
I donāt think the ābest matchingā qualifier mitigates the reasonable concern (not just mine, but othersā as well) with the three-scenario framing. The concern is that the design of question may create an anchoring effect. The sophistication of the respondents also does not mitigate this concern.
I donāt think the disclaimer you and Benjamin Tereick both quoted makes much difference to our discussion here, on this forum. Itās just an elaboration on what ābest matchingā means, which is good to include in a survey, but which is already intuitive to readers here.
You seem to be saying that something significantly less than the slow progress scenario would be an āextreme possibilityā. Is that correct, or am I misunderstanding? If so, I strongly disagree. If you and the other authors of this report view that as an extreme possibility, I would worry about you baking in your own personal AI forecasts into the design of the survey.
If I were designing a survey question like this, I would make the lowest progress scenario one in which there is very little progress toward more powerful, more general AI systems. For example, a scenario in which LLM and AI progress more or less stagnates from now until 2031, or a scenario in which only modest, incremental progress occurs. If you wanted to stick with three scenarios, that could be the slow progress scenario. (Iām no expert on survey design, but I donāt see how you could reasonably avoid having something like that as your lowest progress scenario and still have a well-designed question.)
My complaint about framing the results as expertsā āprobabilitiesā is that this directly contradicts the ābest matchingā qualifier. I didnāt raise a complaint with the intersubjective resolution framing of the question.
That being said, I do find the intersubjective resolution framing counterintuitive. I didnāt bring this up until now because I find it difficult to wrap by head around, I wasnāt sure if I was misunderstanding the logic behind it, and it seemed like a distraction from the more important point. The reason I find this framing counterintuitive is best explained through an analogy. Letās say you want to ask experts about the probability of the volcano Eyjafjallajƶkull in Iceland erupting again between now and 2050. Letās say you ask, āIn 2050, what percentage of volcanologists will say Eyjafjallajƶkull has erupted?ā This is confusing because you would think the percentage would be around 0% or 100% in virtually any realistic scenario. If someone thinks thereās a 51% chance the volcano will erupt, then they should say 100% of vulcanologists in 2050 will think that. If they think thereās a 49% chance it will erupt, they will say 0%. I donāt understand how you translate this into probabilities, since the only numbers the respondents are telling you are 0% and 100%, and neither is their probability of an eruption. Even if there is a logical way to do this that Iām not getting, can you rely on all your survey respondents to understand the question in the way you intended and answer accordingly?
With regard to edit #3, I was indeed only pointing out an example of the broader phenomenon of framing bias and question wording bias. Another example I raised in the comments was the AI Impacts survey that found a 69-year difference in the median date of AGI essentially just by changing the word ātasksā to āoccupationsā. Most would not expect differences so large ahead of time and most people have expressed surprise at the amount of difference in responses to the different versions of these questions. Itās relevant to bring up because Benjamin Tereickās argument was that the results to the one version of the three-scenarios question donāt seem to him to indicate a question wording bias. My counterargument is that you canāt tell how much bias there is from just one version of the question; you have to ask at least two different versions and see the results. For what itās worth, the concern with the three-scenarios framing is a potential anchoring effect and I think it makes sense to understand percentage probabilities as causing an anchoring effect for the reason titotal explained here.
Thanks again for a helpful, cooperative, and open reply.
I am using āextremeā in a very narrow sense, meaning anything above and below the scale provided for this specific question, rather than any normative sense, or making any statement about probabilities. I think people interpret this word differently. I additionally think we have some questions that represent a broader swath of possible outcomes (e.g., TRS), taking a different position on the the parsimony and completeness frontier. I suspect we have different goals in mind for this question.
I think others would argue that the slow progress scenario is barely an improvement over current capabilities. Given the disagreement people have over current capabilities, this disagreement on how much progress a certain scenario represents will always exist. We notably had some people who take the opposite stance you do, that the slow progress scenario has already been achieved.
I would maintain that we can express these results as the probability that reality best matches a certain scenario, hence the needed addition of the ābest matchesā qualifier. So, Iām not following your points here, apologies.
And for what itās worth, I think the view that tasks = occupations is reasonably disputed. Again, I still grant the point that framing matters, and absolutely could be at play here. In fact, Iād argue that itās always at play everywhere, and we can and should do our best to limit its influence.
This is a really great exchange, and thank you for responding to the post.
I just wanted to leave a quick comment to say: It seems crazy to me that someone would say the āslowā scenario has āalready been achievedā!
Unless Iām missing something, the āslowā scenario says that half of all freelance software engineering jobs taking <8 hours can be fully automated, that any task a competent human assistant can do in <1 hour can be fully automated with no drop in quality (what if I ask my human assistant to solve some ARC-2 problems for me?), that the majority of customer complaints in a typical business will be fully resolved by AI in those businesses that use it, and that AI will be capable of writing hit songs (at least if humans arenāt made aware that it is AI-generated)?
I suppose the scenario is framed only to say that AI is capable of all of the above, rather than that it is being used like this in practice. That still seems like an incorrect summary of current capability to me, but is slightly more understandable. But in that case, it seems the scenario should have just been framed that way: āSlow progress: No significant improvement in AI capabilities from 2025, though possibly a significant increase in adoptionā. There could then be a separate question on what people think about the level that current capabilities are at?
Otherwise disagreements about current capabilities and progress are getting blurred in the single question. Describing the āslowā scenario as āslowā and putting it at the extreme end of the spectrum is inevitably priming people to think about current capabilities in a certain way. Still struggling to understand the point of view that says this is an acceptable way to frame this question.
Thanks for the thoughts! The question is indeed framed as being about capabilities and not adoption, and this is absolutely central.
Second, people have a wide range of views on any given topic, and surveys reflect this distribution. I think this is a feature, not a bug. Additionally, if you take any noisy measurement (which all surveys are), reading too much into the tails can lead one astray (I donāt think thatās happening in this specific instance, but I want to guard against the view that the existence of noise implies the nonexistence of signal). Nevertheless, I do appreciate the careful read.
Your comments here are part of why I think including the third disclaimer we add that allows for jagged capabilities is important. Additionally, we donāt require that all capabilities are achieved, hence the ābest matchingā qualifier, rather than looking at the minimum across the capabilities space.
We indeed developed/ātested versions of this question which included a section on current capabilities. Survey burden is another source of noise/ābias in surveys, so such modifications are not costless. I absolutely agree that current views of progress will impact responses to this question.
Iāll reiterate that LEAP is a portfolio of questions, and I think we have other questions where disagreement about current capabilities is less of an issue because the target is much less dependent on subjective assessment, but those questions will sacrifice some degree of being complete pictures of AI capabilities. Lastly, any expectation of the future necessarily includes some model of the present.
Always happy to hear suggestions for a new question or revised version of this question!
Thanks for replying again. This is helpful. (I am strongly upvoting your comments because Iām grateful for your contribution to the conversation and I think you deserve to have that little plant icon next to your name go away.)
Apologies for the word count of this comment. Iām really struggling to compress what Iām trying to say to something shorter.
On āextremeā: Thank you for clarifying that non-standard/ātechnical use of the word āextremeā. I was confused because I just interpreted it in the typical, colloquial way.
On the content of the three scenarios: I have a hard time understanding how someone could say the slow progress scenario has already been achieved (or that it represents barely an improvement over existing capabilities), but the more I have these kinds of discussions, the more I realize people interpret exactly the same descriptions of hypothetical future AI systems in wildly different ways.
This seems like a problem for forecasting surveys ā different respondents may mean completely different things yet, on paper, their responses are exactly the same. (I donāt fault you or your co-authors for this, though, because you didnāt create this problem and I donāt think that I could do any better at writing unambiguous scenarios.)
But, more importantly, itās also a problem that goes far beyond the scope of just forecasting surveys. Itās a problem for the whole community of people who want to have discussions about AI progress, which we have a shared responsibility to address. I am not sure quite what to do yet, but Iāve been thinking about it a bit over the last few weeks.[1]
On intersubjective resolution/āmetaprediction: My confusion about the intersubjective resolution or metaprediction for the three scenarios question is I donāt know how respondents are supposed to express their probability of a scenario being best matching vs. expressing how ambiguous or unambiguous they think the resolution of the prediction will be. If I think thereās a 51% chance that before the end of 2030 the Singularity will happen, in which case the prediction would resolve completely unambiguously for the rapid progress scenario, what should my response to the survey be?
Should I predict 100% of respondents will agree, retrospectively, that the rapid progress scenario is the best matching one, since that is what will happen in the scenario I think is 51% probable? Or should I predict 51% of respondents will pick the rapid progress scenario, even though thatās not what the question is literally asking, because 51% is my probability? (Letās say for simplicity I think thereās a 51% chance of an unambiguous Singularity of the sort described by futurists like Ray Kurzweil or Vernor Vinge before December 2030 and a 49% chance AI will make no meaningful progress between now and December 2030. And nothing in between.)
Itās possible I just have no idea how intersubjective resolution/āmetaprediction is supposed to work, but then, was this explained to the respondents? Can you count on them understanding how it works?
On ātasksā vs. āoccupationsā: I agree that, once you think about it, you can understand why people would think automating all ātasksā and automating all āoccupationsā wouldnāt mean the same thing. However, this is not obvious (at least, not to everyone) in advance of asking two variants of the question and noticing the difference in the responses. The reasoning is that, logically, an occupation is just a set of tasks, so an AI that can do all tasks can also do all occupations. The authors of the AI Impacts survey were themselves surprised by the framing effect here. On page 7 of their pre-print about the survey, they say (emphasis added by me):
Predictions for a 50% chance of the arrival of FAOL are consistently more than sixty years later than those for a 50% chance of the arrival of HLMI. This was seen in the results from the surveys of 2023, 2022, and 2016. This is surprising because HLMI and FAOL are quite similar: FAOL asks about the automation of all occupations; HLMI asks about the feasible automation of all tasks. Since occupations might naturally be understood either as complex tasks, composed of tasks, or closely connected with one of these, achieving HLMI seems to either imply having already achieved FAOL, or suggest being close.
We do not know what accounts for this gap in forecasts. Insofar as HLMI and FAOL refer to the same event, the difference in predictions about the time of their arrival would seem to be a framing effect.
However, the relationship between ātasksā and āoccupationsā is debatable. And the question sets do differ beyond definitions: only the HLMI questions are preceded by the instruction to āassume that human scientific activity continues without major negative disruption,ā and the FAOL block asks a sequence of questions about the automation of specific occupations before asking about full automation of labor. So conceivably this wide difference could be caused by respondents expecting major disruption to scientific progress, or by the act of thinking through specific examples shifting overall anticipations. From our experience with question testing, it also seems possible that the difference is due to other differences in interpretation of the questions, such as thinking of automating occupations but not tasks as including physical manipulation, or interpreting FAOL to require adoption of AI in automating occupations, not mere feasibility (contrary to the question wording).
The broader problem with Benjamin Tereickās reply is that he seems to be saying (if Iām understanding correctly) you can conclude there is no significant framing effect just by looking at the responses to one variant of one question. But if the AI Impacts survey only asked about HLMI and not FAOL, and just assumed the two were logically equivalent and equivalent in the eyes of respondents, how would they know, just from that information, that the HLMI question was susceptible to a significant framing effect or not? They wouldnāt know.
I donāt see how someone could argue that the authors of the AI Impacts survey would be able to infer from the results of just the HLMI question, without comparing it to anything else, whether or not the framing of the question introduced significant bias. They wouldnāt know. You have to run the experiment to know ā thatās the whole point. Benjaminās argument, which I may just be misunderstanding, seems analogous to the argument that a clinical trial of a drug doesnāt need a control group because you can tell how effective the drug is just from the experimental group. (Benjamin, what am I missing here?)
Thatās why I brought up the AI Impacts survey example and the 2023 Forecasting Research Institute survey example. Just to drive home the point that framing effects/āquestion wording bias/āanchoring effects can be extremely significant, and we donāt necessarily know that until we run two versions of the same question. So, Iām glad that you at least agree with the general point that this an important topic to consider.
I think, unfortunately, itās not a problem thatās easily or quickly resolved, but will most likely involve a lot of reading and writing to get everyone on the same page about some core concepts. Iāve tried to do a little bit of this work already in posts like this one, but thatās just a tiny step in the right direction. Concepts like data efficiency, generalization, continual learning, and fluid intelligence are helpful and much under-discussed. Open technical challenges like learning efficiently from video data (a topic the AI researcher Yann LeCun has talked a lot about) and complex, long-term hierarchical planning (a longstanding problem in reinforcement learning) are also helpful for understanding what the disagreements are about and are also much under-discussed.
One of the distinctions that seems to be causing trouble is understanding intelligence as the ability to complete tasks vs. intelligence as the ability to learn to complete tasks.
Another problem is people interpreting (sometimes despite instructions or despite whatās stipulated in the scenario) an AI systemās ability to complete a task in a minimal, technical sense vs. in a robust, meaningful sense, e.g., an LLM writing a terrible, incoherent novel that nobody reads or likes vs. a good, commercially successful, critically well-received novel (or a novel at that quality level).
A third problem is (again, sometimes despite warnings or qualifications that were meant to forestall this) around reliability: the distinction between an AI system being able to successfully complete a task sometimes, e.g., 50% or 80% or 95% of the time, vs. being able to successfully complete it at the same rate as humans, e.g. 99.9% or 99.999% of the time.
I suspect, but donāt know, that another interpretive difficulty for scenarios like the ones in your survey is around people filling in the gaps (or not). If we say in a scenario that an AI system can do these five things we describe, like make a good song, write a good novel, load a dishwasher, and so on, some people can interpret that to mean the AI system can only do those five things. Other people can interpret these tasks as just representative of the overall set of tasks the AI system can do, such that there a hundred or a thousand or a million other things it can do, and these are just a few examples.
A little discouragingly, similar problems have persisted in discussions around philosophy of mind, cognitive science, and AI for decades ā for example, in debates around the Turing test ā despite the masterful interventions of brilliant writers who have tried to clear up the ambiguity and confusion (e.g. the philosopher Daniel Dennettās wonderful essay on the Turing test āCan machines think?ā in the anthology Brainchildren).
(An author of the report here.) Thanks for engaging with this question and providing your feedback! Iāll provide a few of my thoughts. But, I will first note that EA forum posts by individuals affiliated with FRI do not constitute official positions.
I do think the following qualification we provided to forecasters (also noted by Benjamin) is important: Reasonable people may disagree with our characterization of what constitutes slow, moderate, or rapid AI progress. Or they may expect to see slow progress observed with some AI capabilities and moderate or fast progress in others. Nevertheless, we ask you to select which scenario, in sum, you feel best represents your views.
I would also agree with Benjamin that ābest matchingā covers scenarios with slower and faster progress than the slow and fast progress scenarios, respectively. And, I believe our panel is sophisticated and capable understanding this feature. Additionally, this question was not designed to capture the extreme possibilities for AI progress, and I personally wouldnāt use it to inform my views on these extreme possibilities (I think the mid-probability space is interesting and undeexplored, and we want LEAP to fill this gap). Given this, however, you are correct that we ought to include the ābest matchingā qualification when we present these results, and Iāve added this to our paper revision to-do list. Thanks for pointing that out.
I think other questions in the survey do a better job of covering the full range of possibilities, both in scenarios questions (i.e., TRS) and our more traditional, easily resolvable forecasting questions. The latter group comprise the vast majority of our surveys. I think itās impossible to write a single forecasting question that satisfies any reasonable and comprehensive set of desiderata, so Iād view LEAP as a portfolio of questions.
On edit #2, I would first note that it is challenging to write a set of scenarios for AI progress without an explosion of scenarios (and an associated increase in survey burden, which would itself degrade response quality); we face a tradeoff between parsimony and completeness. This specific question in the first survey is uniquely focused on parsimony, and we attempted to include questions that take other stances on that tradeoff. However, weād love to hear any suggestions you have for writing these types of questions, as we could certainly improve on this front. I think youāve identified many of the shortcomings in this specific question already. Second, I would defend our choice to present as probabilities (but we should add the ābest matchingā qualifier). Weāre making an appeal to intersubjective resolution. Witkowski et al. (2017) is one example, and some people at FRI have done similar work (Karger et al. 2021). These two metrics rely on wisdom-of-the-crowd effects. Again, however, I donāt think itās clear that weāre making this appeal, so Iāve added a note to clarify this in the paper. We use a resolution criterion (metaprediction) that some find unintuitive, but it allows us to incentivize this question. But, others might argue that incentives are less important.
While I think framing effects obviously matter in surveys, I do think that your edit #3 is conflating an elicitation/āmeasurement/āinstrumentation issue in low-probability forecasting with the broader phenomenon of framing, which I view as being primarily but not exclusively about question phrasing. Weāre including tests on framing and the elicitation environment in LEAP itself to make sure our results arenāt too sensitive to any framing effects, and weāll be sharing more on those in the future. Iād love to hear any ideas for experiments we should run there.
In sum, I largely defend the choices we made in writing this question. LEAP includes many different types of questions, because consumers of the research will have different views of the types of questions they will find informative and convincing. I will note that even within FRI some people personally find the scenarios questions much less compelling than the other questions in the survey. Nevertheless, I think you identified issues with our framing of the results, and we will make some changes. I appreciate you laying out your criticisms of the paper clearly so that we can dig into them, and Iād welcome any additional feedback!
Thank you very much for your reply. I especially want to give you my profound appreciation for being willing to revise how your results are described in the report. (I hope you will make the same revision in public communications as well, such as blog posts or posts on this forum.) A few responses which I tried to keep as succinct as possible, but failed to keep succinct:
I donāt think the ābest matchingā qualifier mitigates the reasonable concern (not just mine, but othersā as well) with the three-scenario framing. The concern is that the design of question may create an anchoring effect. The sophistication of the respondents also does not mitigate this concern.
I donāt think the disclaimer you and Benjamin Tereick both quoted makes much difference to our discussion here, on this forum. Itās just an elaboration on what ābest matchingā means, which is good to include in a survey, but which is already intuitive to readers here.
You seem to be saying that something significantly less than the slow progress scenario would be an āextreme possibilityā. Is that correct, or am I misunderstanding? If so, I strongly disagree. If you and the other authors of this report view that as an extreme possibility, I would worry about you baking in your own personal AI forecasts into the design of the survey.
If I were designing a survey question like this, I would make the lowest progress scenario one in which there is very little progress toward more powerful, more general AI systems. For example, a scenario in which LLM and AI progress more or less stagnates from now until 2031, or a scenario in which only modest, incremental progress occurs. If you wanted to stick with three scenarios, that could be the slow progress scenario. (Iām no expert on survey design, but I donāt see how you could reasonably avoid having something like that as your lowest progress scenario and still have a well-designed question.)
My complaint about framing the results as expertsā āprobabilitiesā is that this directly contradicts the ābest matchingā qualifier. I didnāt raise a complaint with the intersubjective resolution framing of the question.
That being said, I do find the intersubjective resolution framing counterintuitive. I didnāt bring this up until now because I find it difficult to wrap by head around, I wasnāt sure if I was misunderstanding the logic behind it, and it seemed like a distraction from the more important point. The reason I find this framing counterintuitive is best explained through an analogy. Letās say you want to ask experts about the probability of the volcano Eyjafjallajƶkull in Iceland erupting again between now and 2050. Letās say you ask, āIn 2050, what percentage of volcanologists will say Eyjafjallajƶkull has erupted?ā This is confusing because you would think the percentage would be around 0% or 100% in virtually any realistic scenario. If someone thinks thereās a 51% chance the volcano will erupt, then they should say 100% of vulcanologists in 2050 will think that. If they think thereās a 49% chance it will erupt, they will say 0%. I donāt understand how you translate this into probabilities, since the only numbers the respondents are telling you are 0% and 100%, and neither is their probability of an eruption. Even if there is a logical way to do this that Iām not getting, can you rely on all your survey respondents to understand the question in the way you intended and answer accordingly?
With regard to edit #3, I was indeed only pointing out an example of the broader phenomenon of framing bias and question wording bias. Another example I raised in the comments was the AI Impacts survey that found a 69-year difference in the median date of AGI essentially just by changing the word ātasksā to āoccupationsā. Most would not expect differences so large ahead of time and most people have expressed surprise at the amount of difference in responses to the different versions of these questions. Itās relevant to bring up because Benjamin Tereickās argument was that the results to the one version of the three-scenarios question donāt seem to him to indicate a question wording bias. My counterargument is that you canāt tell how much bias there is from just one version of the question; you have to ask at least two different versions and see the results. For what itās worth, the concern with the three-scenarios framing is a potential anchoring effect and I think it makes sense to understand percentage probabilities as causing an anchoring effect for the reason titotal explained here.
Thanks again for a helpful, cooperative, and open reply.
Thanks for following up!
I am using āextremeā in a very narrow sense, meaning anything above and below the scale provided for this specific question, rather than any normative sense, or making any statement about probabilities. I think people interpret this word differently. I additionally think we have some questions that represent a broader swath of possible outcomes (e.g., TRS), taking a different position on the the parsimony and completeness frontier. I suspect we have different goals in mind for this question.
I think others would argue that the slow progress scenario is barely an improvement over current capabilities. Given the disagreement people have over current capabilities, this disagreement on how much progress a certain scenario represents will always exist. We notably had some people who take the opposite stance you do, that the slow progress scenario has already been achieved.
I would maintain that we can express these results as the probability that reality best matches a certain scenario, hence the needed addition of the ābest matchesā qualifier. So, Iām not following your points here, apologies.
And for what itās worth, I think the view that tasks = occupations is reasonably disputed. Again, I still grant the point that framing matters, and absolutely could be at play here. In fact, Iād argue that itās always at play everywhere, and we can and should do our best to limit its influence.
This is a really great exchange, and thank you for responding to the post.
I just wanted to leave a quick comment to say: It seems crazy to me that someone would say the āslowā scenario has āalready been achievedā!
Unless Iām missing something, the āslowā scenario says that half of all freelance software engineering jobs taking <8 hours can be fully automated, that any task a competent human assistant can do in <1 hour can be fully automated with no drop in quality (what if I ask my human assistant to solve some ARC-2 problems for me?), that the majority of customer complaints in a typical business will be fully resolved by AI in those businesses that use it, and that AI will be capable of writing hit songs (at least if humans arenāt made aware that it is AI-generated)?
I suppose the scenario is framed only to say that AI is capable of all of the above, rather than that it is being used like this in practice. That still seems like an incorrect summary of current capability to me, but is slightly more understandable. But in that case, it seems the scenario should have just been framed that way: āSlow progress: No significant improvement in AI capabilities from 2025, though possibly a significant increase in adoptionā. There could then be a separate question on what people think about the level that current capabilities are at?
Otherwise disagreements about current capabilities and progress are getting blurred in the single question. Describing the āslowā scenario as āslowā and putting it at the extreme end of the spectrum is inevitably priming people to think about current capabilities in a certain way. Still struggling to understand the point of view that says this is an acceptable way to frame this question.
Thanks for the thoughts! The question is indeed framed as being about capabilities and not adoption, and this is absolutely central.
Second, people have a wide range of views on any given topic, and surveys reflect this distribution. I think this is a feature, not a bug. Additionally, if you take any noisy measurement (which all surveys are), reading too much into the tails can lead one astray (I donāt think thatās happening in this specific instance, but I want to guard against the view that the existence of noise implies the nonexistence of signal). Nevertheless, I do appreciate the careful read.
Your comments here are part of why I think including the third disclaimer we add that allows for jagged capabilities is important. Additionally, we donāt require that all capabilities are achieved, hence the ābest matchingā qualifier, rather than looking at the minimum across the capabilities space.
We indeed developed/ātested versions of this question which included a section on current capabilities. Survey burden is another source of noise/ābias in surveys, so such modifications are not costless. I absolutely agree that current views of progress will impact responses to this question.
Iāll reiterate that LEAP is a portfolio of questions, and I think we have other questions where disagreement about current capabilities is less of an issue because the target is much less dependent on subjective assessment, but those questions will sacrifice some degree of being complete pictures of AI capabilities. Lastly, any expectation of the future necessarily includes some model of the present.
Always happy to hear suggestions for a new question or revised version of this question!
Thanks for replying again. This is helpful. (I am strongly upvoting your comments because Iām grateful for your contribution to the conversation and I think you deserve to have that little plant icon next to your name go away.)
Apologies for the word count of this comment. Iām really struggling to compress what Iām trying to say to something shorter.
On āextremeā: Thank you for clarifying that non-standard/ātechnical use of the word āextremeā. I was confused because I just interpreted it in the typical, colloquial way.
On the content of the three scenarios: I have a hard time understanding how someone could say the slow progress scenario has already been achieved (or that it represents barely an improvement over existing capabilities), but the more I have these kinds of discussions, the more I realize people interpret exactly the same descriptions of hypothetical future AI systems in wildly different ways.
This seems like a problem for forecasting surveys ā different respondents may mean completely different things yet, on paper, their responses are exactly the same. (I donāt fault you or your co-authors for this, though, because you didnāt create this problem and I donāt think that I could do any better at writing unambiguous scenarios.)
But, more importantly, itās also a problem that goes far beyond the scope of just forecasting surveys. Itās a problem for the whole community of people who want to have discussions about AI progress, which we have a shared responsibility to address. I am not sure quite what to do yet, but Iāve been thinking about it a bit over the last few weeks.[1]
On intersubjective resolution/āmetaprediction: My confusion about the intersubjective resolution or metaprediction for the three scenarios question is I donāt know how respondents are supposed to express their probability of a scenario being best matching vs. expressing how ambiguous or unambiguous they think the resolution of the prediction will be. If I think thereās a 51% chance that before the end of 2030 the Singularity will happen, in which case the prediction would resolve completely unambiguously for the rapid progress scenario, what should my response to the survey be?
Should I predict 100% of respondents will agree, retrospectively, that the rapid progress scenario is the best matching one, since that is what will happen in the scenario I think is 51% probable? Or should I predict 51% of respondents will pick the rapid progress scenario, even though thatās not what the question is literally asking, because 51% is my probability? (Letās say for simplicity I think thereās a 51% chance of an unambiguous Singularity of the sort described by futurists like Ray Kurzweil or Vernor Vinge before December 2030 and a 49% chance AI will make no meaningful progress between now and December 2030. And nothing in between.)
Itās possible I just have no idea how intersubjective resolution/āmetaprediction is supposed to work, but then, was this explained to the respondents? Can you count on them understanding how it works?
On ātasksā vs. āoccupationsā: I agree that, once you think about it, you can understand why people would think automating all ātasksā and automating all āoccupationsā wouldnāt mean the same thing. However, this is not obvious (at least, not to everyone) in advance of asking two variants of the question and noticing the difference in the responses. The reasoning is that, logically, an occupation is just a set of tasks, so an AI that can do all tasks can also do all occupations. The authors of the AI Impacts survey were themselves surprised by the framing effect here. On page 7 of their pre-print about the survey, they say (emphasis added by me):
The broader problem with Benjamin Tereickās reply is that he seems to be saying (if Iām understanding correctly) you can conclude there is no significant framing effect just by looking at the responses to one variant of one question. But if the AI Impacts survey only asked about HLMI and not FAOL, and just assumed the two were logically equivalent and equivalent in the eyes of respondents, how would they know, just from that information, that the HLMI question was susceptible to a significant framing effect or not? They wouldnāt know.
I donāt see how someone could argue that the authors of the AI Impacts survey would be able to infer from the results of just the HLMI question, without comparing it to anything else, whether or not the framing of the question introduced significant bias. They wouldnāt know. You have to run the experiment to know ā thatās the whole point. Benjaminās argument, which I may just be misunderstanding, seems analogous to the argument that a clinical trial of a drug doesnāt need a control group because you can tell how effective the drug is just from the experimental group. (Benjamin, what am I missing here?)
Thatās why I brought up the AI Impacts survey example and the 2023 Forecasting Research Institute survey example. Just to drive home the point that framing effects/āquestion wording bias/āanchoring effects can be extremely significant, and we donāt necessarily know that until we run two versions of the same question. So, Iām glad that you at least agree with the general point that this an important topic to consider.
I think, unfortunately, itās not a problem thatās easily or quickly resolved, but will most likely involve a lot of reading and writing to get everyone on the same page about some core concepts. Iāve tried to do a little bit of this work already in posts like this one, but thatās just a tiny step in the right direction. Concepts like data efficiency, generalization, continual learning, and fluid intelligence are helpful and much under-discussed. Open technical challenges like learning efficiently from video data (a topic the AI researcher Yann LeCun has talked a lot about) and complex, long-term hierarchical planning (a longstanding problem in reinforcement learning) are also helpful for understanding what the disagreements are about and are also much under-discussed.
One of the distinctions that seems to be causing trouble is understanding intelligence as the ability to complete tasks vs. intelligence as the ability to learn to complete tasks.
Another problem is people interpreting (sometimes despite instructions or despite whatās stipulated in the scenario) an AI systemās ability to complete a task in a minimal, technical sense vs. in a robust, meaningful sense, e.g., an LLM writing a terrible, incoherent novel that nobody reads or likes vs. a good, commercially successful, critically well-received novel (or a novel at that quality level).
A third problem is (again, sometimes despite warnings or qualifications that were meant to forestall this) around reliability: the distinction between an AI system being able to successfully complete a task sometimes, e.g., 50% or 80% or 95% of the time, vs. being able to successfully complete it at the same rate as humans, e.g. 99.9% or 99.999% of the time.
I suspect, but donāt know, that another interpretive difficulty for scenarios like the ones in your survey is around people filling in the gaps (or not). If we say in a scenario that an AI system can do these five things we describe, like make a good song, write a good novel, load a dishwasher, and so on, some people can interpret that to mean the AI system can only do those five things. Other people can interpret these tasks as just representative of the overall set of tasks the AI system can do, such that there a hundred or a thousand or a million other things it can do, and these are just a few examples.
A little discouragingly, similar problems have persisted in discussions around philosophy of mind, cognitive science, and AI for decades ā for example, in debates around the Turing test ā despite the masterful interventions of brilliant writers who have tried to clear up the ambiguity and confusion (e.g. the philosopher Daniel Dennettās wonderful essay on the Turing test āCan machines think?ā in the anthology Brainchildren).