Thank you for doing this! I was working on a similar project and mostly came up with the same headline finding as you: the experts seemed well-calibrated. I did decide a few of the milestones a little differently, and would like to hear why you chose the way you did so I can decide whether or not to change mine.
Zach Stein-Perlman from AI Impacts said that he thought “efficiently sort very large lists” and “write good Python code” were false, because the questions said it had to be done in a certain way by a certain type of neural net, and that wasn’t how it was done.
I counted “translate as well as bilingual humans” as true based on a few quick tests of ChatGPT; I’m curious if you have some specific source for why it’s false.
I’m surprised you judged “high marks for a high school essay” as false; this seems like a central use case for ChatGPT and Bing/GPT4.
I was planning to judge “concisely explain game play” as true, based on https://www.forbes.com/sites/carlieporterfield/2022/11/22/metas-ai-gamer-beat-humans-in-diplomacy-using-strategy-and-negotiation/, which is testing basically this skill. Also, I was able to play a partial game of chess with ChatGPT where it explained all its moves—before it started hallucinating and making moves which were impossible. Still, it seemed to have the “explanation” skill down pat! I imagine if you asked it to explain why a chess engine made a given move, it would give a pretty plausible answer.
Beyond those quibbles—I was also looking at https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/#Data (the dataset itself; the summary doesn’t include the milestones). This new version seems like total garbage. The experts continue to predict several of the milestones are five years out, including milestones that were achieved by ChatGPT (ie a few months after the survey) and at least one milestone that had already clearly been achieved by the time the survey was released! Unless there’s some reason to think the new crop of experts is worse than the old one, this makes me think they only did okay last time by luck/coincidence, and actually they have no idea what they’re doing.
(I don’t think it works to say that the period 2017-2022.5 was predictable, but the period 2022.5-2023 wasn’t, because part of what the 2017 experts were right about was ChatGPT, which came out in late 2022.)
This is great—thanks for this comment! I’ve gone through each to explain my reasoning. Your comments/sources changed my opinion on Starcraft and Explain—I’ve updated the post and scores to reflect this, and think the conclusion is now the same but slightly weaker, because the experts’ Brier score is 0.2 points worse, but the comparative Brier scores are also worse to a similar amount. There’s also my reasoning for other milestones in the appendix (and I’ve copy-pasted some of them below).
Zach Stein-Perlman from AI Impacts said that he thought “efficiently sort very large lists” and “write good Python code” were false, because the questions said it had to be done in a certain way by a certain type of neural net, and that wasn’t how it was done.
On “efficiently sort very large lists”, I think Zach ended up deferring to me on this. But no one has red-teamed my thinking on this. Here’s my source:
“We compared this algorithm against common sorting approaches and measured its performance for up to 1 billion normally-distributed double-precision keys. The results show that our approach yields an average 3.38x performance improvement over C++ STL sort, which is an optimized Quicksort hybrid, 1.49x improvement over sequential Radix Sort, and 5.54x improvement over a C++ implementation of Timsort, which is the default sorting function for Java and Python.”
On “write good Python code”, Zach and I (at least initially) disagreed. The criteria specified Suppose the system is given only:
A specification of what counts as a sorted list
Several examples of lists undergoing sorting by quicksort
The concern was that LLMs like Codex are trained on open source code, including examples of quicksort algorithms. So they are given more than just ‘a specification of what counts’ and ‘examples of lists undergoing sorting’. I’m unsure, but I think ‘given’ in the requirement refers to the goal specification post-training. Otherwise, the wording should be ‘suppose the system is trained only on...’ Therefore, Codex does count as only being given the stated requirements (or less).
For example, the ‘Starcraft’ milestone uses the wording ‘Beat the best human Starcraft 2 players at least 50% of the time, given a video of the screen’. For this milestone, I think the AI can be trained on more than the video of the screen, it just can’t use more than this while playing.
But could definitely be convinced otherwise on this, I might be making a conceptual misunderstanding. And I haven’t given Zach a chance to review my argument above.
The question specifies the AI needs to do “as well as the typical human”—I agree AI is not quite as good as the best human transcriptionists, but think it is better than the typical human.
OpenAI says Whisper “approaches human level robustness and accuracy on English speech recognition.” More detail is given in page 10 of the OpenAI report, which concludes “Whisper’s English ASR performance is not perfect but very close to human-level accuracy”.
However, this is when compared to “professional human transcribers”. Given Whisper sometimes performs better than professional human transcribers, and always performs to a similar level it seems very likely that Whisper regularly does “as well as a typical human”. I think Lorenzo’s analysis below seems reasonable too.
I counted “translate as well as bilingual humans” as true based on a few quick tests of ChatGPT; I’m curious if you have some specific source for why it’s false.
I have questionable anecdata! I asked two friends who are fluent in two languages but not skilled in translation, and one who has done a university module in translation, and all three thought they were better than AI at translating web pages (I think all three seem modest in general), although AI does a fine job for most uses.
This article notes the worse performance from AI when interpreting context and cultural nuances and says that “Le Monde explains that “the team (of its English language edition) is composed of eight journalists (…) responsible for selecting articles for translation, editing the translated versions and managing the English home page of the site. The translation is done by international agencies with the help of an artificial intelligence tool. Native English-speaking journalists do the selection and editing of the articles”.”
Not-skilled-in-translation humans are still paid to translate writing between English and Chinese (I know of one organisation that does this and the CEO chooses to use AI to assist but not replace human translators, where the human translators are bilingual students not trained in translation). It does seem possible that this is just business lagging behind though.
I’m convinced by your argument/sources here, and this reddit thread. It does seem open to interpretation if “the best human players” refers to pros (which it did beat >50% of the time), or the literal very best player—but I think this should probably be interpreted as the very best. Given it reached a top 20 score but only beat Serral 2⁄3 times when heavy restrictions were placed on Serral, I’ve updated this to false.
I’m surprised you judged “high marks for a high school essay” as false; this seems like a central use case for ChatGPT and Bing/GPT4.
I was also surprised by this. Others disagreed with me and we discussed a bit more in this LW comment thread. Here’s my reasoning:
I couldn’t find this done and think, by now, someone would have submitted a fully ChatGPT-generated high school essay and talked about it publicly if it had gotten high marks.I’ve seen some evidence of cherry-picking paragraphs leading to a mid/low-level, e.g. this article describes someone who got a passing mark (53) on a university social policy essay.
This high school teacher judged two ChatGPT-generated history essays as “below average, scoring a 9⁄20 or lower”. This Guardian article says, uncited, that ‘academics have generated responses to exam queries that they say would result in full marks if submitted by an undergraduate’. I think, if this claim were true, there would be more evidence.
I was planning to judge “concisely explain game play” as true, based on https://www.forbes.com/sites/carlieporterfield/2022/11/22/metas-ai-gamer-beat-humans-in-diplomacy-using-strategy-and-negotiation/, which is testing basically this skill. Also, I was able to play a partial game of chess with ChatGPT where it explained all its moves—before it started hallucinating and making moves which were impossible. Still, it seemed to have the “explanation” skill down pat! I imagine if you asked it to explain why a chess engine made a given move, it would give a pretty plausible answer.
Good point! This is an update for me—I wouldn’t have guessed ChatGPT could explain chess moves. The question specifies “For any computer game that can be played well by a machine”. I just played Scrabble with ChatGPT and it explained its moves well. And I asked ChatGPT to imagine it was watching Agent57 play Montezuma’s revenge for 10 seconds and it gave a very clear summary of what it imagined Agent57 was doing and why.
So I’m just about convinced enough that it could do this for all games—and have changed to True.
(Scott is correct that I said—and strongly feel—that LLMs don’t count for the sorting milestones. Patrick’s source for the “sorting large lists” milestone is not an LLM, and Patrick is correct that I later read a draft of this blogpost and deferred to him on whether the “sorting large lists” milestone had been achieved.)
Yeah good find, I also think that passes the bar. Although I do think people have generally overestimated GPT’s essay-writing ability compared to humans, and think I might be falling for that here.
I’m not planning to change the doc because Bing’s AI wasn’t released by Feb 23, but if you think it should be included (which would be reasonable given OpenAI pretty obviously made this before Feb 23), it would mean:
Experts expected 9 milestones to be met vs actually 11 milestones
The calibration curve looks four percentage points worse at the 10% mark
Bulls’ Brier score: 0.29
Experts’ Brier score: 0.24
Bears’ Brier score: 0.29
I’ve added it to this tracker of milestones (feel free to request edit access).
I think it’s reasonable to go either way on Starcraft. It’s true that the version of Alphastar from three years ago were not beating the best humans more than half the time, and they did not take screen pixels as inputs.
But those models were substantially inhibited in their actions per minute, because computers that can beat humans by being fast are boring. Given that the version of Alphastar that beat MaNa was already throttled (albeit not in the right way to play like a human), I don’t see why an AI with no APM restrictions couldn’t beat the best humans. And I don’t see any particular reason you couldn’t train an image classifier to get from screen pixels to Alphastar’s inputs.
So I think this mostly comes down to whether you think it was implied in the prediction that a realistic APM limit was implied, and what your bar is for “feasible”.
Transcribe human speech with a variety of accents in a noisy environment as well as a typical human can.
As a data point, it seems to me that OpenAI’s Whisper large model is probably above typical human transcription quality for standard accents in non-noisy environments. E.g. it transcribes correctly “Hyderabad” from here (while YouTube transcribes it as “hyper bus and”).[1]
For “noisy environments with a variety of accents”, it was surprisingly hard to find a sample. From this, it generates this, which does seem worse than a typical human, so I would also resolve this as “false” if OpenAI’s Whisper is the state of the art, but I wouldn’t say that it doesn’t seem close.
I counted “translate as well as bilingual humans” as true based on a few quick tests of ChatGPT; I’m curious if you have some specific source for why it’s false.
As another data point, for English <-> Italian it’s usually better than me. But it really struggles with things like idioms.
Here’s the full transcription of that talk. (It does transcribe “Jacy” as “JC”, but I still think the typical human would have made more mistakes, or at the very least it does seem close).
I was also looking at https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/#Data (the dataset itself; the summary doesn’t include the milestones). This new version seems like total garbage. The experts continue to predict several of the milestones are five years out, including milestones that were achieved by ChatGPT (ie a few months after the survey) and at least one milestone that had already clearly been achieved by the time the survey was released!
I’ve only given the data a quick look and found it hard to analyse—but yeah, many of the forecasts look bad. But some of the medians (I think- from eyeballing data) seem not terrible—the ‘create top forty song’ shifted from 10 year median to ~5 year median. The ‘answer open-ended questions’ shifted from 10 year median to ~3 years.
But like you say, for many of the milestones I resolved as being met before this survey went out, they still have medians >0 years from now so—if I’m right in my judgements—the experts seem pretty poorly clued up on recent developments across the field.
Thank you for doing this! I was working on a similar project and mostly came up with the same headline finding as you: the experts seemed well-calibrated. I did decide a few of the milestones a little differently, and would like to hear why you chose the way you did so I can decide whether or not to change mine.
Zach Stein-Perlman from AI Impacts said that he thought “efficiently sort very large lists” and “write good Python code” were false, because the questions said it had to be done in a certain way by a certain type of neural net, and that wasn’t how it was done.
I was planning to count “transcribe as well as humans” as false, based on https://interactiveaimag.org/columns/is-ai-at-human-parity-yet-a-case-study-on-speech-recognition/ . Maybe the top labs could achieve this with a year of work, but I think the question specifies they need to do as well as the best human transcriptionists, and right now they don’t seem close.
I counted “translate as well as bilingual humans” as true based on a few quick tests of ChatGPT; I’m curious if you have some specific source for why it’s false.
I don’t think AI has won at Starcraft. The last word I’ve heard from this was https://www.extremetech.com/extreme/301325-deepminds-starcraft-ii-ai-can-now-defeat-99-8-percent-of-human-players , where AlphaStar could beat 99.8% of humans but not the absolute champions. I haven’t seen any further progress on this since 2019. Again, it’s possible that a year of concerted effort could change this, but that seems speculative. See also https://www.reddit.com/r/starcraft/comments/uakohx/why_cant_we_make_a_perfect_ai_for_starcraft/
I’m surprised you judged “high marks for a high school essay” as false; this seems like a central use case for ChatGPT and Bing/GPT4.
I was planning to judge “concisely explain game play” as true, based on https://www.forbes.com/sites/carlieporterfield/2022/11/22/metas-ai-gamer-beat-humans-in-diplomacy-using-strategy-and-negotiation/, which is testing basically this skill. Also, I was able to play a partial game of chess with ChatGPT where it explained all its moves—before it started hallucinating and making moves which were impossible. Still, it seemed to have the “explanation” skill down pat! I imagine if you asked it to explain why a chess engine made a given move, it would give a pretty plausible answer.
Beyond those quibbles—I was also looking at https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/#Data (the dataset itself; the summary doesn’t include the milestones). This new version seems like total garbage. The experts continue to predict several of the milestones are five years out, including milestones that were achieved by ChatGPT (ie a few months after the survey) and at least one milestone that had already clearly been achieved by the time the survey was released! Unless there’s some reason to think the new crop of experts is worse than the old one, this makes me think they only did okay last time by luck/coincidence, and actually they have no idea what they’re doing.
(I don’t think it works to say that the period 2017-2022.5 was predictable, but the period 2022.5-2023 wasn’t, because part of what the 2017 experts were right about was ChatGPT, which came out in late 2022.)
This is great—thanks for this comment! I’ve gone through each to explain my reasoning. Your comments/sources changed my opinion on Starcraft and Explain—I’ve updated the post and scores to reflect this, and think the conclusion is now the same but slightly weaker, because the experts’ Brier score is 0.2 points worse, but the comparative Brier scores are also worse to a similar amount. There’s also my reasoning for other milestones in the appendix (and I’ve copy-pasted some of them below).
On “efficiently sort very large lists”, I think Zach ended up deferring to me on this. But no one has red-teamed my thinking on this. Here’s my source:
On “write good Python code”, Zach and I (at least initially) disagreed. The criteria specified
Suppose the system is given only:
A specification of what counts as a sorted list
Several examples of lists undergoing sorting by quicksort
The concern was that LLMs like Codex are trained on open source code, including examples of quicksort algorithms. So they are given more than just ‘a specification of what counts’ and ‘examples of lists undergoing sorting’. I’m unsure, but I think ‘given’ in the requirement refers to the goal specification post-training. Otherwise, the wording should be ‘suppose the system is trained only on...’ Therefore, Codex does count as only being given the stated requirements (or less).
For example, the ‘Starcraft’ milestone uses the wording ‘Beat the best human Starcraft 2 players at least 50% of the time, given a video of the screen’. For this milestone, I think the AI can be trained on more than the video of the screen, it just can’t use more than this while playing.
But could definitely be convinced otherwise on this, I might be making a conceptual misunderstanding. And I haven’t given Zach a chance to review my argument above.
The question specifies the AI needs to do “as well as the typical human”—I agree AI is not quite as good as the best human transcriptionists, but think it is better than the typical human.
OpenAI says Whisper “approaches human level robustness and accuracy on English speech recognition.” More detail is given in page 10 of the OpenAI report, which concludes “Whisper’s English ASR performance is not perfect but very close to human-level accuracy”.
However, this is when compared to “professional human transcribers”. Given Whisper sometimes performs better than professional human transcribers, and always performs to a similar level it seems very likely that Whisper regularly does “as well as a typical human”. I think Lorenzo’s analysis below seems reasonable too.
I have questionable anecdata! I asked two friends who are fluent in two languages but not skilled in translation, and one who has done a university module in translation, and all three thought they were better than AI at translating web pages (I think all three seem modest in general), although AI does a fine job for most uses.
This article notes the worse performance from AI when interpreting context and cultural nuances and says that “Le Monde explains that “the team (of its English language edition) is composed of eight journalists (…) responsible for selecting articles for translation, editing the translated versions and managing the English home page of the site. The translation is done by international agencies with the help of an artificial intelligence tool. Native English-speaking journalists do the selection and editing of the articles”.”
Not-skilled-in-translation humans are still paid to translate writing between English and Chinese (I know of one organisation that does this and the CEO chooses to use AI to assist but not replace human translators, where the human translators are bilingual students not trained in translation). It does seem possible that this is just business lagging behind though.
I’m convinced by your argument/sources here, and this reddit thread. It does seem open to interpretation if “the best human players” refers to pros (which it did beat >50% of the time), or the literal very best player—but I think this should probably be interpreted as the very best. Given it reached a top 20 score but only beat Serral 2⁄3 times when heavy restrictions were placed on Serral, I’ve updated this to false.
I was also surprised by this. Others disagreed with me and we discussed a bit more in this LW comment thread. Here’s my reasoning:
I couldn’t find this done and think, by now, someone would have submitted a fully ChatGPT-generated high school essay and talked about it publicly if it had gotten high marks. I’ve seen some evidence of cherry-picking paragraphs leading to a mid/low-level, e.g. this article describes someone who got a passing mark (53) on a university social policy essay.
This high school teacher judged two ChatGPT-generated history essays as “below average, scoring a 9⁄20 or lower”. This Guardian article says, uncited, that ‘academics have generated responses to exam queries that they say would result in full marks if submitted by an undergraduate’. I think, if this claim were true, there would be more evidence.
Good point! This is an update for me—I wouldn’t have guessed ChatGPT could explain chess moves. The question specifies “For any computer game that can be played well by a machine”. I just played Scrabble with ChatGPT and it explained its moves well. And I asked ChatGPT to imagine it was watching Agent57 play Montezuma’s revenge for 10 seconds and it gave a very clear summary of what it imagined Agent57 was doing and why.
So I’m just about convinced enough that it could do this for all games—and have changed to True.
(Scott is correct that I said—and strongly feel—that LLMs don’t count for the sorting milestones. Patrick’s source for the “sorting large lists” milestone is not an LLM, and Patrick is correct that I later read a draft of this blogpost and deferred to him on whether the “sorting large lists” milestone had been achieved.)
Thank you. I misremebered the transcription question. I now agree with all of your resolutions, with the most remaining uncertainty on translation.
Update: I think Bing passes the high school essay bar, based on the section “B- Essays No More” at https://oneusefulthing.substack.com/p/i-hope-you-werent-getting-too-comfortable
Yeah good find, I also think that passes the bar. Although I do think people have generally overestimated GPT’s essay-writing ability compared to humans, and think I might be falling for that here.
I’m not planning to change the doc because Bing’s AI wasn’t released by Feb 23, but if you think it should be included (which would be reasonable given OpenAI pretty obviously made this before Feb 23), it would mean:
Experts expected 9 milestones to be met vs actually 11 milestones
The calibration curve looks four percentage points worse at the 10% mark
Bulls’ Brier score: 0.29
Experts’ Brier score: 0.24
Bears’ Brier score: 0.29
I’ve added it to this tracker of milestones (feel free to request edit access).
I think it’s reasonable to go either way on Starcraft. It’s true that the version of Alphastar from three years ago were not beating the best humans more than half the time, and they did not take screen pixels as inputs.
But those models were substantially inhibited in their actions per minute, because computers that can beat humans by being fast are boring. Given that the version of Alphastar that beat MaNa was already throttled (albeit not in the right way to play like a human), I don’t see why an AI with no APM restrictions couldn’t beat the best humans. And I don’t see any particular reason you couldn’t train an image classifier to get from screen pixels to Alphastar’s inputs.
So I think this mostly comes down to whether you think it was implied in the prediction that a realistic APM limit was implied, and what your bar is for “feasible”.
I think the question says:
As a data point, it seems to me that OpenAI’s Whisper large model is probably above typical human transcription quality for standard accents in non-noisy environments. E.g. it transcribes correctly “Hyderabad” from here (while YouTube transcribes it as “hyper bus and”).[1]
For “noisy environments with a variety of accents”, it was surprisingly hard to find a sample. From this, it generates this, which does seem worse than a typical human, so I would also resolve this as “false” if OpenAI’s Whisper is the state of the art, but I wouldn’t say that it doesn’t seem close.
As another data point, for English <-> Italian it’s usually better than me. But it really struggles with things like idioms.
Here’s the full transcription of that talk. (It does transcribe “Jacy” as “JC”, but I still think the typical human would have made more mistakes, or at the very least it does seem close).
I’ve only given the data a quick look and found it hard to analyse—but yeah, many of the forecasts look bad. But some of the medians (I think- from eyeballing data) seem not terrible—the ‘create top forty song’ shifted from 10 year median to ~5 year median. The ‘answer open-ended questions’ shifted from 10 year median to ~3 years.
But like you say, for many of the milestones I resolved as being met before this survey went out, they still have medians >0 years from now so—if I’m right in my judgements—the experts seem pretty poorly clued up on recent developments across the field.