I think this announcement should make people think near term AGI, and thus AIXR, is less likely. To me this is what a relatively continuous takeoff world looks like, if there’s a take off at all. If Google had announced and proved a massive leap forward, then people would have shrunk their timelines even further. So why, given this was a PR-fueled disappointment, should we not update in the opposite direction?
[...] Gemini release is disappointing. Below many people’s expectations of its performance. Should downgrade future expectations. Near term AGI takeoff v unlikely. Update downwards on AI risk (YMMV).
I think the update here should be pretty small. I’m unsure if you disagree. I would also think the update should be pretty small if gemini is notably better than GPT4, but not wildly better. It seems plausible to me that people would (incorrectly) have a large update toward shorter timelines if gemini was merely substantially better than GPT4, but we don’t have to make the same mistake in the other direction.
It’s worth noting there is some asymmetry in the likely updates with a high probability of a mild negative update on near term AI and a low probability of a large positive update toward powerful near term AI. E.g., even if google were to explode and never release a better LLM than gemini, this would be a relatively smaller update than if they were to release transformatively powerful AI.
Hey Ryan, thanks for your engagement :) I’m going to respond to your replies in one go if that’s ok
#1:
It’s worth noting there is some asymmetry in the likely updates with a high probability of a mild negative update on near term AI and a low probability of a large positive update toward powerful near term AI.
This is a good point. I think my argument would point to larger updates for people who put susbtantial probability on near term AGI in 2024 (or even 2023)! Where do they shift that probability in their forecast? I think just dropping it uniformly over their current probability would be suspect to me. So maybe it’d wouldn’t be a large update for somebody already unsure what to expect from AI development, but I think it should probably be a large update for the ~20% expecting ‘weak AGI’ in 2024 (more in response #3)
#2:
Further, manifold doesn’t seem that wrong here on GPT4 vs gemini? See for instance, this market:
Yeah I suppose ~80%->~60% is a decent update, thanks for showing me the link! My issue here would be the resolution criteria realy seems to be CoT on GSM8K, which is almost orthogonal to ‘better’ imho, especially given issues accounting for dataset contamination—though I suppose the market is technically about wider perception rather than technical accuracy. I think I was basing a lot of my take on the response on Tech Twitter which is obviously unrepresentative, and prone to hype. But there were a lot of people I generally regard as smart and switched-on who really over-reacted in my opinion. Perhaps the median community/AI-Safety researcher response was more measured.
#3:
As in, the operationalization seems like a very poor definition for “weakly general AGI” and the tasks being forecast don’t seem very important or interesting.
I’m sympathetic to this, but Metaculus questions are generally meant to be resolved according a strict and unambiguous criteria afaik. So if someone thinks that weakly general AGI is near, but that it wouldn’t do well at the criteria in the question, then they should have longer timelines than the current community response to that question imho. The fact that this isn’t the case to me indicates that many people who made a forecast on this market aren’t paying attention to the details of the resolution and how LLMs are trained and their strengths/limitations in practice. (Of course, if these predictors think that weak AGI will happen from a non-LLM paradigm then fine, but then i’d expect the forecasting community to react less to LLM releases)
I think where I absolutely agree with you is that we need different criteria to actually track the capabilities and properties of general AI systems that we’re concerned about! The current benchmarks available seem to have many flaws and don’t really work to distinguish interesting capabilities in the trained-on-everything era of LLMs. I think funding, supporting, and popularising research into what ‘good’ benchmarks would be and creating a new test would be high impact work for the AI field—I’d love to see orgs look into this!
B
Can’t we just use an SAT test created after the data cutoff?...You can see the technical report for more discussion on data contamination (though account for bias accordingly etc.)
For the Metaculus question? I’d be very upset if I had a longer-timeline prediction that failed because this resolution got changed—it says ‘less than 10 SAT exams’ in the training data in black and white! The fact that these systems need such masses of data to do well is a sign against their generality to me.
I don’t doubt that the Gemini team is aware of issues of data contamination (they even say so at the end of page 7 in the technical report), but I’ve become very sceptical about the state of public science on Frontier AI this year. I’m very much in a ‘trust, but verify’ mode and the technical report is to me more of a fancy press-release that accompanied the marketing than an honest technical report. (which is not to doubt the integrity of the Gemini research and dev team, just to say that I think they’re losing the internal tug-of-war with Google marketing & strategy)
#4:
This doesn’t seem to be by Melanie Mitchell FYI. At least she isn’t an author.
Ah good spot. I think I saw Melanie share it on twitter, and assumed she was sharing some new research of hers (I pulled together the references fairly quickly). I still think the results stand but I appreciate the correction and have amended my post.
<> <> <> <> <>
I want to thank you again for the interesting and insightful questions and prompts. They definitely made me think about how to express my position slightly more clearly (at least, I hope I make more sense to you after this reponse, even if we don’t agree on everything) :)
it says ‘less than 10 SAT exams’ in the training data in black and white
Good to know! That certainly changes my view of whether or not this will happen soon, but also makes me think the resolution criteria is poor.
I think funding, supporting, and popularising research into what ‘good’ benchmarks would be and creating a new test would be high impact work for the AI field—I’d love to see orgs look into this!
Perhaps the median community/AI-Safety researcher response was more measured.
People around me seemed to have a reasonably measured response.
I think we’ll probably get a pretty big update about the power of LLM scaling in the next 1-2 years with the release of GPT5. Like, in the same way that each of GPT3 and GPT4 were quite informative even for the relatively savvy.
I think the update here should be pretty small. I’m unsure if you disagree. I would also think the update should be pretty small if gemini is notably better than GPT4, but not wildly better. It seems plausible to me that people would (incorrectly) have a large update toward shorter timelines if gemini was merely substantially better than GPT4, but we don’t have to make the same mistake in the other direction.
It’s worth noting there is some asymmetry in the likely updates with a high probability of a mild negative update on near term AI and a low probability of a large positive update toward powerful near term AI. E.g., even if google were to explode and never release a better LLM than gemini, this would be a relatively smaller update than if they were to release transformatively powerful AI.
Hey Ryan, thanks for your engagement :) I’m going to respond to your replies in one go if that’s ok
#1:
This is a good point. I think my argument would point to larger updates for people who put susbtantial probability on near term AGI in 2024 (or even 2023)! Where do they shift that probability in their forecast? I think just dropping it uniformly over their current probability would be suspect to me. So maybe it’d wouldn’t be a large update for somebody already unsure what to expect from AI development, but I think it should probably be a large update for the ~20% expecting ‘weak AGI’ in 2024 (more in response #3)
#2:
Yeah I suppose ~80%->~60% is a decent update, thanks for showing me the link! My issue here would be the resolution criteria realy seems to be CoT on GSM8K, which is almost orthogonal to ‘better’ imho, especially given issues accounting for dataset contamination—though I suppose the market is technically about wider perception rather than technical accuracy. I think I was basing a lot of my take on the response on Tech Twitter which is obviously unrepresentative, and prone to hype. But there were a lot of people I generally regard as smart and switched-on who really over-reacted in my opinion. Perhaps the median community/AI-Safety researcher response was more measured.
#3:
I’m sympathetic to this, but Metaculus questions are generally meant to be resolved according a strict and unambiguous criteria afaik. So if someone thinks that weakly general AGI is near, but that it wouldn’t do well at the criteria in the question, then they should have longer timelines than the current community response to that question imho. The fact that this isn’t the case to me indicates that many people who made a forecast on this market aren’t paying attention to the details of the resolution and how LLMs are trained and their strengths/limitations in practice. (Of course, if these predictors think that weak AGI will happen from a non-LLM paradigm then fine, but then i’d expect the forecasting community to react less to LLM releases)
I think where I absolutely agree with you is that we need different criteria to actually track the capabilities and properties of general AI systems that we’re concerned about! The current benchmarks available seem to have many flaws and don’t really work to distinguish interesting capabilities in the trained-on-everything era of LLMs. I think funding, supporting, and popularising research into what ‘good’ benchmarks would be and creating a new test would be high impact work for the AI field—I’d love to see orgs look into this!
B
For the Metaculus question? I’d be very upset if I had a longer-timeline prediction that failed because this resolution got changed—it says ‘less than 10 SAT exams’ in the training data in black and white! The fact that these systems need such masses of data to do well is a sign against their generality to me.
I don’t doubt that the Gemini team is aware of issues of data contamination (they even say so at the end of page 7 in the technical report), but I’ve become very sceptical about the state of public science on Frontier AI this year. I’m very much in a ‘trust, but verify’ mode and the technical report is to me more of a fancy press-release that accompanied the marketing than an honest technical report. (which is not to doubt the integrity of the Gemini research and dev team, just to say that I think they’re losing the internal tug-of-war with Google marketing & strategy)
#4:
Ah good spot. I think I saw Melanie share it on twitter, and assumed she was sharing some new research of hers (I pulled together the references fairly quickly). I still think the results stand but I appreciate the correction and have amended my post.
<> <> <> <> <>
I want to thank you again for the interesting and insightful questions and prompts. They definitely made me think about how to express my position slightly more clearly (at least, I hope I make more sense to you after this reponse, even if we don’t agree on everything) :)
Thanks for the response!
A few quick responses:
Good to know! That certainly changes my view of whether or not this will happen soon, but also makes me think the resolution criteria is poor.
You might be interested in the recent OpenPhil RFP on benchmarks and forecasting.
People around me seemed to have a reasonably measured response.
I think we’ll probably get a pretty big update about the power of LLM scaling in the next 1-2 years with the release of GPT5. Like, in the same way that each of GPT3 and GPT4 were quite informative even for the relatively savvy.