Scoring forecasts from the 2016 “Expert Survey on Progress in AI”
Summary
This document looks at the predictions made by AI experts in The 2016 Expert Survey on Progress in AI, analyses the predictions on ‘Narrow tasks’, and gives a Brier score to the median of the experts’ predictions.
My analysis suggests that the experts did a fairly good job of forecasting (Brier score = 0.21), and would have been less accurate if they had predicted each development in AI to generally come, by a factor of 1.5, later (Brier score = 0.26) or sooner (Brier score = 0.29) than they actually predicted.
I judge that the experts expected 9 milestones to have happened by now—and that 10 milestones have now happened.
But there are important caveats to this, such as:
I have only analysed whether milestones have been publicly met. AI labs may have achieved more milestones in private this year without disclosing them. This means my analysis of how many milestones have been met is probably conservative.
I have taken the point probabilities given, rather than estimating probability distributions for each milestone, meaning I often round down, which skews the expert forecasts towards being more conservative and unfairly penalises their forecasts for low precision.
It’s not apparent that forecasting accuracy on these nearer-term questions is very predictive of forecasting accuracy on the longer-term questions.
My judgements regarding which forecasting questions have resolved positively vs negatively were somewhat subjective (justifications for each question in the separate appendix).
This is a blog post, not a research report, meaning it was produced quickly and is not to Rethink Priorities’ typical standards of substantiveness and careful checking for accuracy.
This post was edited on 3 March to reflect my updated view on two of the milestones (‘Starcraft’ and ‘Explain’). This has not changed the conclusion, although it weakens the experts’ forecasts somewhat (from a Brier score of 0.19 to 0.21).
Introduction
In 2016, AI Impacts published The Expert Survey on Progress in AI: a survey of machine learning researchers, asking for their predictions about when various AI developments will occur. The results have been used to inform general and expert opinions on AI timelines.
The survey largely focused on timelines for general/human-level artificial intelligence (median forecast of 2056). However, included in this survey were a collection of questions about shorter-term milestones in AI. Some of these forecasts are now resolvable. Measuring how accurate these shorter-term forecasts have been is probably somewhat informative of how accurate the longer-term forecasts are. More broadly, the accuracy of these shorter-term forecasts seems somewhat informative of how accurate ML researchers’ views are in general. So, how have the experts done so far?
Findings
I analysed the 32 ‘Narrow tasks’ to which the following question was asked:
How many years until you think the following AI tasks will be feasible with:
a small chance (10%)?
an even chance (50%)?
a high chance (90%)?
Let a task be ‘feasible’ if one of the best resourced labs could implement it in less than a year if they chose to. Ignore the question of whether they would choose to.[1]
I interpret ‘feasible’ as whether, in ‘less than a year’ before now, any AI models had passed these milestones, and this was disclosed publicly. Since it is now (February 2023) 6.5 years since this survey, I am therefore looking at any forecasts for events happening within 5.5 years of the survey.
Across these milestones, I judge that 10 have now happened and 22 have not happened. My 90% confidence interval is that 7-15 of them have now happened. A full description of milestones, and justification of my judgments, are in the appendix (separate doc).
The experts forecast that:
4 milestones had a <10% chance of happening by now,
20 had a 10-49% chance,
7 had a 50-89% chance,
1 had a >90% chance.
So they expected 6-17 of these milestones to have happened by now. By eyeballing the forecasts for each milestone, my estimate is that they expected ~9 to have happened.[2] I did not estimate the implied probability distributions for each milestone, which would make this more accurate.
Using the 10, 50, and 90% point probabilities, we get the following calibration curve:
But, firstly, the data here is small (there are 7 data points at the 50% mark and 1 at the 90% mark). Secondly, my methodology for this graph, and in the below Brier calculations, is based on rounding down to the nearest given forecast. For example, if a 10% chance was given at 3 years, and a 50% chance at 10 years, the forecast was taken to be 10%, rather than estimating a full probability distribution and finding the 5.5 years point. This skews the expert forecasts towards being more conservative and unfairly penalises a lack of precision.
Brier scores
Overall, across every forecast made, the experts come out with a Brier score of 0.21.[3] The score breakdown and explanation of the method is here.[4]
For reference, a lower Brier score is better. 0 would mean absolute confidence in everything that eventually happened, 0.25 would mean a series of 50% hedged guesses on anything happening, and randomly guessing from 0% to 100% for every question would yield a Brier score of 0.33.[5] As the experts were asked to give 10%, 50%, and 90% forecasts, they would have averaged a score of 0.36 by giving completely random answers.[6]
Also interesting is the Brier score relative to others who forecast the same events. We don’t have that when looking at the median of our experts—but we could simulate a few other versions:
Bearish[7] - if the experts all thought each milestone would take 1.5 times longer than they actually thought, they would’ve gotten a Brier score of 0.26.[8]
Slightly Bearish—if the experts all thought each milestone would take 1.2 times longer than they actually thought, they would’ve gotten a Brier score of 0.25.
Actual forecasts—a Brier score of 0.21.
Slightly Bullish—if the experts all thought each milestone would take 1.2 times less than they actually thought, they would’ve gotten a Brier score of 0.24.
Bullish—if the experts all thought each milestone would take 1.5 times less than they actually thought, they would’ve gotten a Brier score of 0.29.
So, the experts were in general pretty accurate and would have been less so if they had been more or less bullish on the speed of AI development (with the same relative expectations between each milestone).Taken together, I think this should slightly update us towards the expert forecasts being useful in as yet unresolved cases, and away from the usefulness of estimates which fall outside of 1.5 times further or closer than the expert forecasts.
Randomised—if the experts’ forecast for each specific milestone were randomly assigned to any forecasted date for a different milestone in the collection, they would’ve gotten a Bier score of 0.31 (in the random assignment I received from a random number generator).
I think this should update us slightly towards the surveyed experts generally being accurate on which areas of AI would progress fastest. My assessment is that, compared to the experts’ predictions, AI has progressed more quickly in text generation and coding and more slowly in game playing and robotics. It is not clear now whether this trend will continue, or whether other areas in AI will unexpectedly progress more quickly in the next 5 year period.
Summary of milestones and forecasts
In the below table, the numbers in the cells are the median expert response to “Years after the (2016) survey for which there is a 10, 50 and 90% probability of the milestone being feasible”. The final column is my judgement of whether the milestone was in fact feasible after 5.5 years. Orange shading shows forecasts falling within the 5.5 years between the survey and today. A full description of milestones, and justification of my judgments, are in the appendix.
Milestone / Confidence of AI reaching the milestone within X years | 10 percent | 50 percent | 90 percent | True by Feb 2023? (5.5 + 1 years) |
Translate a new-to-humanity language | 10 | 20 | 50 | FALSE |
Translate a new-to-it language | 5 | 10 | 15 | FALSE |
Translate as well as bilingual humans | 3 | 7 | 15 | FALSE |
Phone bank as well as humans | 3 | 6 | 10 | FALSE |
Correctly group unseen objects | 2 | 4.5 | 6.5 | TRUE |
One-shot image labeling | 4.5 | 8 | 20 | FALSE |
Generate video from a photograph | 5 | 10 | 20 | TRUE |
Transcribe as well as humans | 5 | 10 | 20 | TRUE |
Read aloud better than humans | 5 | 10 | 15 | FALSE |
Prove and generate top theorems | 10 | 50 | 90 | FALSE |
Win Putnam competition | 15 | 35 | 55 | FALSE |
Win Go with less gametime | 3.5 | 8.5 | 19.5 | FALSE |
Win Starcraft | 2 | 5 | 10 | FALSE |
Win any random computer game | 5 | 10 | 15 | FALSE |
Win angry birds | 2 | 4 | 6 | FALSE |
Beat professionals at all Atari games | 5 | 10 | 15 | FALSE |
Win Atari with 20 minutes training | 2 | 5 | 10 | FALSE |
Fold laundry as well as humans | 2 | 5.5 | 10 | FALSE |
Beat a human in a 5km race | 5 | 10 | 20 | FALSE |
Assemble any LEGO | 5 | 10 | 15 | FALSE |
Efficiently sort very large lists | 3 | 5 | 10 | TRUE |
Write good Python code | 3 | 10 | 20 | TRUE |
Answers factoids better than experts | 3 | 5 | 10 | TRUE |
Answer open-ended questions well | 5 | 10 | 15 | TRUE |
Answer unanswered questions well | 4 | 10 | 17.5 | TRUE |
High marks for a high school essay | 2 | 7 | 15 | FALSE |
Create a top forty song | 5 | 10 | 20 | FALSE |
Produce a Taylor Swift song | 5 | 10 | 20 | FALSE |
Write a NYT bestseller | 10 | 30 | 50 | FALSE |
Concisely explain its game play | 5 | 10 | 15 | TRUE |
Win World Series of Poker | 1 | 3 | 5.5 | TRUE |
Output laws of physics of virtual world | 5 | 10 | 20 | FALSE |
Caveats:
My judgements of which forecasts have turned out true or false are a little subjective. This was made harder by the survey question asking which tasks were ‘feasible’, where feasible meant ‘if one of the best resourced labs could implement it in less than a year if they chose to. Ignore the question of whether they would choose to.’ I have interpreted this as, one year after the forecasted date, have AI labs achieved these milestones, and disclosed this publicly?
Given (a) ‘has happened’ implies ‘feasible’, but ‘feasible’ does not imply ‘has happened’ and (b) labs may have achieved some of these milestones but not disclosed it, I am probably being conservative in the overall number of tasks which have been completed by labs. I have not attempted to offset this conservatism by using my judgement of what labs can probably achieve in private. If you disagree or have insider knowledge of capabilities, you may be interested in editing my working here. Please reach out if you want an explanation of the method, or to privately share updates—patrick at rethinkpriorities dot org.
It’s not obvious that forecasting accuracy on these nearer-term questions is very predictive of forecasting accuracy on the longer-term questions. Dillon (2021) notes “There is some evidence that forecasting skill generalises across topics (see Superforecasting, Tetlock, 2015 and for a brief overview see here) and this might inform a prior that good forecasters in the short term will also be good over the long term, but there may be specific adjustments which are worth emphasising when forecasting in different temporal domains.” I have not found any evidence either way on whether good forecasters in the short term will also be good over the long term, but this does seem possible to analyse from the data that Dillon and niplav collect.[9]
Finally, there are caveats in the original survey worth noting here, too. For example, how the question is framed makes a difference to forecasts, even when the meaning is the same. To illustrate this, the authors note
“People consistently give later forecasts if you ask them for the probability in N years instead of the year that the probability is M. We saw this in the straightforward HLMI (high-level machine intelligence) question and most of the tasks and occupations, and also in most of these things when we tested them on mturk people earlier. For HLMI for instance, if you ask when there will be a 50% chance of HLMI you get a median answer of 40 years, yet if you ask what the probability of HLMI is in 40 years, you get a median answer of 30%.”
This is commonly true of the ‘Narrow tasks’ forecasts (although I disagree with the authors that it is consistently so).[10] For example, when asked when there is a 50% chance AI can write a top forty hit, respondents gave a median of 10 years. Yet when asked about the probability of this milestone being reached in 10 years, respondents gave a median of 27.5%.
What does this all mean for us?
Maybe not a huge amount at this point. It is probably a little too early to get a good picture of the experts’ accuracy, and there are a few important caveats. But this should update you slightly towards the experts’ timelines if you were sceptical of their forecasts. Within another five years, we will have ~twice the data and a good sense of how the experts performed across their 50% estimates.
It is also limiting to have only one comprehensive survey of AI experts which includes both long-term and shorter-term timelines. What would be excellent for assessing accuracy is detailed forecasts from various different groups, including political pundits, technical experts, and professional forecasters, with which we can compare accuracy between groups. It would be easier to analyse the forecasting accuracy of questions focused on what developments have happened, rather than what developments are feasible. We could try closer to home, maybe the average EA would be better at forecasting developments than the average AI expert—it seems worth testing now to give us some more data in ten years!
Acknowledgements
I’m grateful to Alex Lintz, Amanda El-Dakhakhni, Ben Cottier, Charlie Harrison, Oliver Guest, Michael Aird, Rick Korzekwa, Scott Alexander, and Zach Stein-Perlman for comments on an earlier draft.
If you are interested in RP’s work, please visit our research database and subscribe to our newsletter.
Cross-posted to AI Impacts, LessWrong and this google doc.
- ^
I only analysed this ‘fixed probabilities’ question and not the alternative ‘fixed years’ question, which asked:
“How likely do you think it is that the following AI tasks will be feasible within the next:
- 10 years?
- 20 years?
- 50 years?”
We are not yet at any of these dates, so the analysis would be much more unclear. - ^
9 = 4*5% + 14*15% + 6*30% + 5*55% + 2*80% + 1*90%
- ^
A precise number as a Brier score does not imply an accurate assessment of forecasting ability—ideally, we could work with a larger dataset (i.e. more surveys, with more questions) to get more accuracy.
- ^
My methodology for the Brier score calculations is based on rounding down to the nearest given forecast, or rounding up to the 10% mark. For example, if a 10% chance was given at 3 years, and a 50% chance at 10 years, the forecast was taken to be 10%, rather than estimating a full probability distribution and finding the 5.5 years point. This skews the expert forecasts towards being more conservative and unfairly penalises them. If the experts gave a 10% chance of X happening in 3 years, I didn’t check whether it had happened in 3 years, but instead checked if it had happened by now. I estimate these two factors (the first skewing the forecasts to be more begives a roughly balance 5-10% increase to the Brier score, given most milestones included a probability at the 5 year mark. A better analysis would estimate the probability distributions implied by each 10, 50, 90% point probability, then assess the probability implied at 5.5 years.
- ^
For more detail, see Brier score—Wikipedia.
- ^
[(1-0.1)^2 + (0-0.1)^2 + (1-0.5)^2 + (0-0.5)^2 + (1-0.9)^2 + (1-0.1)^2] / 6
- ^
By ‘bearish’ and ‘bullish’ I mean expecting AI milestones to be met later or sooner, respectively.
- ^
The score breakdown and method for these calculations is also here.
- ^
This seems valuable, and I’m not sure why it hasn’t been analysed yet.
Somewhat relevant sources:https://www.lesswrong.com/posts/MquvZCGWyYinsN49c/range-and-forecasting-accuracy
https://www.openphilanthropy.org/research/how-feasible-is-long-range-forecasting/
https://forum.effectivealtruism.org/topics/long-range-forecasting
- ^
I sampled ten forecasts where probabilities were given on a 10 year timescale, and five of them (Subtitles, Transcribe, Top forty, Random game, Explain) gave later forecasts when asked with a ‘probability in N years’ framing rather than a ‘year that the probability is M’ framing, three of them (Video scene, Read aloud, Atari) gave the same forecasts, and two of them (Rosetta, Taylor) gave an earlier forecast. This is why I disagree it leads to consistently later forecasts.
- Against LLM Reductionism by 8 Mar 2023 15:52 UTC; 140 points) (LessWrong;
- 2023: highlights from the year, from the EA Newsletter by 5 Jan 2024 21:57 UTC; 68 points) (
- Against LLM Reductionism by 8 Mar 2023 15:52 UTC; 32 points) (
- Scoring forecasts from the 2016 “Expert Survey on Progress in AI” by 1 Mar 2023 14:41 UTC; 29 points) (LessWrong;
- EA Organization Updates & Opportunities: March 2023 by 15 Mar 2023 23:54 UTC; 27 points) (
- Highlights from last week by 9 Mar 2023 20:27 UTC; 21 points) (
- EA & LW Forum Weekly Summary (27th Feb − 5th Mar 2023) by 6 Mar 2023 3:18 UTC; 20 points) (
- EA & LW Forum Weekly Summary (27th Feb − 5th Mar 2023) by 6 Mar 2023 3:18 UTC; 12 points) (LessWrong;
Thank you for doing this! I was working on a similar project and mostly came up with the same headline finding as you: the experts seemed well-calibrated. I did decide a few of the milestones a little differently, and would like to hear why you chose the way you did so I can decide whether or not to change mine.
Zach Stein-Perlman from AI Impacts said that he thought “efficiently sort very large lists” and “write good Python code” were false, because the questions said it had to be done in a certain way by a certain type of neural net, and that wasn’t how it was done.
I was planning to count “transcribe as well as humans” as false, based on https://interactiveaimag.org/columns/is-ai-at-human-parity-yet-a-case-study-on-speech-recognition/ . Maybe the top labs could achieve this with a year of work, but I think the question specifies they need to do as well as the best human transcriptionists, and right now they don’t seem close.
I counted “translate as well as bilingual humans” as true based on a few quick tests of ChatGPT; I’m curious if you have some specific source for why it’s false.
I don’t think AI has won at Starcraft. The last word I’ve heard from this was https://www.extremetech.com/extreme/301325-deepminds-starcraft-ii-ai-can-now-defeat-99-8-percent-of-human-players , where AlphaStar could beat 99.8% of humans but not the absolute champions. I haven’t seen any further progress on this since 2019. Again, it’s possible that a year of concerted effort could change this, but that seems speculative. See also https://www.reddit.com/r/starcraft/comments/uakohx/why_cant_we_make_a_perfect_ai_for_starcraft/
I’m surprised you judged “high marks for a high school essay” as false; this seems like a central use case for ChatGPT and Bing/GPT4.
I was planning to judge “concisely explain game play” as true, based on https://www.forbes.com/sites/carlieporterfield/2022/11/22/metas-ai-gamer-beat-humans-in-diplomacy-using-strategy-and-negotiation/, which is testing basically this skill. Also, I was able to play a partial game of chess with ChatGPT where it explained all its moves—before it started hallucinating and making moves which were impossible. Still, it seemed to have the “explanation” skill down pat! I imagine if you asked it to explain why a chess engine made a given move, it would give a pretty plausible answer.
Beyond those quibbles—I was also looking at https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/#Data (the dataset itself; the summary doesn’t include the milestones). This new version seems like total garbage. The experts continue to predict several of the milestones are five years out, including milestones that were achieved by ChatGPT (ie a few months after the survey) and at least one milestone that had already clearly been achieved by the time the survey was released! Unless there’s some reason to think the new crop of experts is worse than the old one, this makes me think they only did okay last time by luck/coincidence, and actually they have no idea what they’re doing.
(I don’t think it works to say that the period 2017-2022.5 was predictable, but the period 2022.5-2023 wasn’t, because part of what the 2017 experts were right about was ChatGPT, which came out in late 2022.)
This is great—thanks for this comment! I’ve gone through each to explain my reasoning. Your comments/sources changed my opinion on Starcraft and Explain—I’ve updated the post and scores to reflect this, and think the conclusion is now the same but slightly weaker, because the experts’ Brier score is 0.2 points worse, but the comparative Brier scores are also worse to a similar amount. There’s also my reasoning for other milestones in the appendix (and I’ve copy-pasted some of them below).
On “efficiently sort very large lists”, I think Zach ended up deferring to me on this. But no one has red-teamed my thinking on this. Here’s my source:
On “write good Python code”, Zach and I (at least initially) disagreed. The criteria specified
Suppose the system is given only:
A specification of what counts as a sorted list
Several examples of lists undergoing sorting by quicksort
The concern was that LLMs like Codex are trained on open source code, including examples of quicksort algorithms. So they are given more than just ‘a specification of what counts’ and ‘examples of lists undergoing sorting’. I’m unsure, but I think ‘given’ in the requirement refers to the goal specification post-training. Otherwise, the wording should be ‘suppose the system is trained only on...’ Therefore, Codex does count as only being given the stated requirements (or less).
For example, the ‘Starcraft’ milestone uses the wording ‘Beat the best human Starcraft 2 players at least 50% of the time, given a video of the screen’. For this milestone, I think the AI can be trained on more than the video of the screen, it just can’t use more than this while playing.
But could definitely be convinced otherwise on this, I might be making a conceptual misunderstanding. And I haven’t given Zach a chance to review my argument above.
The question specifies the AI needs to do “as well as the typical human”—I agree AI is not quite as good as the best human transcriptionists, but think it is better than the typical human.
OpenAI says Whisper “approaches human level robustness and accuracy on English speech recognition.” More detail is given in page 10 of the OpenAI report, which concludes “Whisper’s English ASR performance is not perfect but very close to human-level accuracy”.
However, this is when compared to “professional human transcribers”. Given Whisper sometimes performs better than professional human transcribers, and always performs to a similar level it seems very likely that Whisper regularly does “as well as a typical human”. I think Lorenzo’s analysis below seems reasonable too.
I have questionable anecdata! I asked two friends who are fluent in two languages but not skilled in translation, and one who has done a university module in translation, and all three thought they were better than AI at translating web pages (I think all three seem modest in general), although AI does a fine job for most uses.
This article notes the worse performance from AI when interpreting context and cultural nuances and says that “Le Monde explains that “the team (of its English language edition) is composed of eight journalists (…) responsible for selecting articles for translation, editing the translated versions and managing the English home page of the site. The translation is done by international agencies with the help of an artificial intelligence tool. Native English-speaking journalists do the selection and editing of the articles”.”
Not-skilled-in-translation humans are still paid to translate writing between English and Chinese (I know of one organisation that does this and the CEO chooses to use AI to assist but not replace human translators, where the human translators are bilingual students not trained in translation). It does seem possible that this is just business lagging behind though.
I’m convinced by your argument/sources here, and this reddit thread. It does seem open to interpretation if “the best human players” refers to pros (which it did beat >50% of the time), or the literal very best player—but I think this should probably be interpreted as the very best. Given it reached a top 20 score but only beat Serral 2⁄3 times when heavy restrictions were placed on Serral, I’ve updated this to false.
I was also surprised by this. Others disagreed with me and we discussed a bit more in this LW comment thread. Here’s my reasoning:
I couldn’t find this done and think, by now, someone would have submitted a fully ChatGPT-generated high school essay and talked about it publicly if it had gotten high marks. I’ve seen some evidence of cherry-picking paragraphs leading to a mid/low-level, e.g. this article describes someone who got a passing mark (53) on a university social policy essay.
This high school teacher judged two ChatGPT-generated history essays as “below average, scoring a 9⁄20 or lower”. This Guardian article says, uncited, that ‘academics have generated responses to exam queries that they say would result in full marks if submitted by an undergraduate’. I think, if this claim were true, there would be more evidence.
Good point! This is an update for me—I wouldn’t have guessed ChatGPT could explain chess moves. The question specifies “For any computer game that can be played well by a machine”. I just played Scrabble with ChatGPT and it explained its moves well. And I asked ChatGPT to imagine it was watching Agent57 play Montezuma’s revenge for 10 seconds and it gave a very clear summary of what it imagined Agent57 was doing and why.
So I’m just about convinced enough that it could do this for all games—and have changed to True.
(Scott is correct that I said—and strongly feel—that LLMs don’t count for the sorting milestones. Patrick’s source for the “sorting large lists” milestone is not an LLM, and Patrick is correct that I later read a draft of this blogpost and deferred to him on whether the “sorting large lists” milestone had been achieved.)
Thank you. I misremebered the transcription question. I now agree with all of your resolutions, with the most remaining uncertainty on translation.
Update: I think Bing passes the high school essay bar, based on the section “B- Essays No More” at https://oneusefulthing.substack.com/p/i-hope-you-werent-getting-too-comfortable
Yeah good find, I also think that passes the bar. Although I do think people have generally overestimated GPT’s essay-writing ability compared to humans, and think I might be falling for that here.
I’m not planning to change the doc because Bing’s AI wasn’t released by Feb 23, but if you think it should be included (which would be reasonable given OpenAI pretty obviously made this before Feb 23), it would mean:
Experts expected 9 milestones to be met vs actually 11 milestones
The calibration curve looks four percentage points worse at the 10% mark
Bulls’ Brier score: 0.29
Experts’ Brier score: 0.24
Bears’ Brier score: 0.29
I’ve added it to this tracker of milestones (feel free to request edit access).
I think it’s reasonable to go either way on Starcraft. It’s true that the version of Alphastar from three years ago were not beating the best humans more than half the time, and they did not take screen pixels as inputs.
But those models were substantially inhibited in their actions per minute, because computers that can beat humans by being fast are boring. Given that the version of Alphastar that beat MaNa was already throttled (albeit not in the right way to play like a human), I don’t see why an AI with no APM restrictions couldn’t beat the best humans. And I don’t see any particular reason you couldn’t train an image classifier to get from screen pixels to Alphastar’s inputs.
So I think this mostly comes down to whether you think it was implied in the prediction that a realistic APM limit was implied, and what your bar is for “feasible”.
I think the question says:
As a data point, it seems to me that OpenAI’s Whisper large model is probably above typical human transcription quality for standard accents in non-noisy environments. E.g. it transcribes correctly “Hyderabad” from here (while YouTube transcribes it as “hyper bus and”).[1]
For “noisy environments with a variety of accents”, it was surprisingly hard to find a sample. From this, it generates this, which does seem worse than a typical human, so I would also resolve this as “false” if OpenAI’s Whisper is the state of the art, but I wouldn’t say that it doesn’t seem close.
As another data point, for English <-> Italian it’s usually better than me. But it really struggles with things like idioms.
Here’s the full transcription of that talk. (It does transcribe “Jacy” as “JC”, but I still think the typical human would have made more mistakes, or at the very least it does seem close).
I’ve only given the data a quick look and found it hard to analyse—but yeah, many of the forecasts look bad. But some of the medians (I think- from eyeballing data) seem not terrible—the ‘create top forty song’ shifted from 10 year median to ~5 year median. The ‘answer open-ended questions’ shifted from 10 year median to ~3 years.
But like you say, for many of the milestones I resolved as being met before this survey went out, they still have medians >0 years from now so—if I’m right in my judgements—the experts seem pretty poorly clued up on recent developments across the field.
This is great, thanks for writing it! I’m curating it. I really appreciate the table, the fact that you went back and analyzed the results, the very clear flags about reasons to be skeptical of these conclusions or this methodology, etc.
I’d also highlight this recent post: Why I think it’s important to work on AI forecasting
Also, this is somewhat wild:
Great analysis!
I wonder what would happen if you were to do the same exercise with the fixed-year predictions under a ‘constant risk’ model, i.e.
P(t) = 1 - exp(-l*t)
withl = - year / log(1 - P(year))
, to get around the problem that we’re still 3 years away from 2026. Given that timelines are systematically longer with a fixed-year framing, I would expect the Brier score of those predictions would be worse. OTOH, the constant risk model doesn’t seem very reasonable here, so the results wouldn’t have a straightforward interpretation.I’d be interested in seeing this too! Although I’m not planning to spend the time to do this soon, I’d be up for having a quick chat if someone else was up for it. The constant risk model seems to me like a better-than-nothing model, so could be a slight update if it gives different results. Or could just use a linear increase of probability from 0 years to 10 years.
FWIW—from this survey alone, I’m not convinced that timelines are systematically longer with a fixed-year framing. I sampled ten forecasts where probabilities were given in both framings on a 10 year timescale, and five of them (Subtitles, Transcribe, Top forty, Random game, Explain) gave later forecasts when asked with ‘probability in N years’ rather than ‘year that the probability is M’, three of them (Video scene, Read aloud, Atari) gave the same forecasts, and two of them (Rosetta, Taylor) gave an earlier forecast. So from this sample, it seems they’re commonly longer.
Important missing context from this is that Brier score = 0.25 is what you would get if you predicted randomly (i.e., put 50% on everything no matter what, or assigned 0% or 100% by coin flip). So that means here that systematically predicting “later” or “sooner” would make you “worse than random” whereas the actual predictions are “better than random” (though not by much).
So I think the main takeaways are: (1) predicting this AI stuff is very hard and (2) we are at least not systematically biased as far as we can tell so far in predicting progress too slow or too fast. (1) is not great but (2) is at least reassuring.
Hmm I disagree on the numbers—have I got something wrong in the below?
If you assigned 0% or 100% by coin flip, you would get a Brier score of 0.5 (half the time you would get 0, half the time you would get 1), and if you assigned a random probability between 0% and 100% for every question, you would get a Brier score of 0.33. If you put 50% on everything you would indeed get 0.25.
As the experts had to give 10%, 50%, and 90% forecasts, if they had done this at random they would have ended up with a score of 0.36 [1].
So I think they—including the bullish and bearish groups—still did a fair bit better than random, which would be 0.36 in this context. And all simulated groups did better than the ‘randomized’ group which got a Brier score of 0.31 in my randomization. This does seem like worthwhile context to add though.
[(1-0.1)^2 + (0-0.1)^2 + (1-0.5)^2 + (0-0.5)^2 + (1-0.9)^2 + (1-0.1)^2] / 6
No, it was me who got this wrong. Thanks!
Super interesting to see this analysis, especially the table of current capabilities—thank you!
It seems to me that this ends up being more conservative than the original “Ignore the question of whether they would choose to” , which presumably makes the expert forecasts worse than they seem to be here.
For example, a task like “win angry birds” seems pretty achievable to me, just that no one’s thinking about angry birds these days so it probably hasn’t been attempted. Does that sound right to you?
I’m curious if you have a rough estimate of how many of these tasks would be achievable within a year if top labs attempted them?
Thanks Adam :)
I have a rough (i.e. considered for <15 minutes) take: if top labs one year ago had attempted these particular milestones, and had the same policies on disclosing capabilities as they currently seem to, then there’s a 40-50% chance they would have achieved 2 of Angry Birds, Atari fifty , Laundry and Go low by now. But I don’t put much weight on my prediction, whereas I put a lot more weight on my analysis of what has happened (though this is also somewhat subjective!).
I agree though that checking what has actually happened ends up being more conservative than the original “Ignore the question of whether they would choose to” , which makes the expert forecasts worse than they seem to be here. This is a weakness of this analysis! And of the resolvability of the original survey.
Do you have an estimate of how many of the tasks would have been achieved by now if labs tried a year ago?
Could anyone explain why the experts think Angry Birds will be so hard? It seems like absolutely ideal conditions for reinforcement learning, in the sense that inputs are very simple and there is a very straightforward way of determining how successful each shot is. Is the limitation that it has to be an Artificial Intelligence which succeeds at the problem, rather than a dumb reinforcement algorithm which happens to be really well suited for the task?
The experts thought beating humans at Angry Birds would be relatively easy—they put a 90% chance of it being feasible by now. The surprise is that it has not been done—although I think this is mostly explained by no labs seriously trying it.
Yeah I think an AI researcher working for a serious company who solved Angry birds might be fired for timewasting :D
Ilya Sutskever, co-founder of OpenAI, who understands deep learning as well as anyone, said “I don’t know” on what the future of AI would look like. So if he doesn’t know, nobody does. Just some food for thought.