Hey Ryan, thanks for your engagement :) Iām going to respond to your replies in one go if thatās ok
#1:
Itās worth noting there is some asymmetry in the likely updates with a high probability of a mild negative update on near term AI and a low probability of a large positive update toward powerful near term AI.
This is a good point. I think my argument would point to larger updates for people who put susbtantial probability on near term AGI in 2024 (or even 2023)! Where do they shift that probability in their forecast? I think just dropping it uniformly over their current probability would be suspect to me. So maybe itād wouldnāt be a large update for somebody already unsure what to expect from AI development, but I think it should probably be a large update for the ~20% expecting āweak AGIā in 2024 (more in response #3)
#2:
Further, manifold doesnāt seem that wrong here on GPT4 vs gemini? See for instance, this market:
Yeah I suppose ~80%->~60% is a decent update, thanks for showing me the link! My issue here would be the resolution criteria realy seems to be CoT on GSM8K, which is almost orthogonal to ābetterā imho, especially given issues accounting for dataset contaminationāthough I suppose the market is technically about wider perception rather than technical accuracy. I think I was basing a lot of my take on the response on Tech Twitter which is obviously unrepresentative, and prone to hype. But there were a lot of people I generally regard as smart and switched-on who really over-reacted in my opinion. Perhaps the median community/āAI-Safety researcher response was more measured.
#3:
As in, the operationalization seems like a very poor definition for āweakly general AGIā and the tasks being forecast donāt seem very important or interesting.
Iām sympathetic to this, but Metaculus questions are generally meant to be resolved according a strict and unambiguous criteria afaik. So if someone thinks that weakly general AGI is near, but that it wouldnāt do well at the criteria in the question, then they should have longer timelines than the current community response to that question imho. The fact that this isnāt the case to me indicates that many people who made a forecast on this market arenāt paying attention to the details of the resolution and how LLMs are trained and their strengths/ālimitations in practice. (Of course, if these predictors think that weak AGI will happen from a non-LLM paradigm then fine, but then iād expect the forecasting community to react less to LLM releases)
I think where I absolutely agree with you is that we need different criteria to actually track the capabilities and properties of general AI systems that weāre concerned about! The current benchmarks available seem to have many flaws and donāt really work to distinguish interesting capabilities in the trained-on-everything era of LLMs. I think funding, supporting, and popularising research into what āgoodā benchmarks would be and creating a new test would be high impact work for the AI fieldāIād love to see orgs look into this!
B
Canāt we just use an SAT test created after the data cutoff?...You can see the technical report for more discussion on data contamination (though account for bias accordingly etc.)
For the Metaculus question? Iād be very upset if I had a longer-timeline prediction that failed because this resolution got changedāit says āless than 10 SAT examsā in the training data in black and white! The fact that these systems need such masses of data to do well is a sign against their generality to me.
I donāt doubt that the Gemini team is aware of issues of data contamination (they even say so at the end of page 7 in the technical report), but Iāve become very sceptical about the state of public science on Frontier AI this year. Iām very much in a ātrust, but verifyā mode and the technical report is to me more of a fancy press-release that accompanied the marketing than an honest technical report. (which is not to doubt the integrity of the Gemini research and dev team, just to say that I think theyāre losing the internal tug-of-war with Google marketing & strategy)
#4:
This doesnāt seem to be by Melanie Mitchell FYI. At least she isnāt an author.
Ah good spot. I think I saw Melanie share it on twitter, and assumed she was sharing some new research of hers (I pulled together the references fairly quickly). I still think the results stand but I appreciate the correction and have amended my post.
<> <> <> <> <>
I want to thank you again for the interesting and insightful questions and prompts. They definitely made me think about how to express my position slightly more clearly (at least, I hope I make more sense to you after this reponse, even if we donāt agree on everything) :)
it says āless than 10 SAT examsā in the training data in black and white
Good to know! That certainly changes my view of whether or not this will happen soon, but also makes me think the resolution criteria is poor.
I think funding, supporting, and popularising research into what āgoodā benchmarks would be and creating a new test would be high impact work for the AI fieldāIād love to see orgs look into this!
Perhaps the median community/āAI-Safety researcher response was more measured.
People around me seemed to have a reasonably measured response.
I think weāll probably get a pretty big update about the power of LLM scaling in the next 1-2 years with the release of GPT5. Like, in the same way that each of GPT3 and GPT4 were quite informative even for the relatively savvy.
Hey Ryan, thanks for your engagement :) Iām going to respond to your replies in one go if thatās ok
#1:
This is a good point. I think my argument would point to larger updates for people who put susbtantial probability on near term AGI in 2024 (or even 2023)! Where do they shift that probability in their forecast? I think just dropping it uniformly over their current probability would be suspect to me. So maybe itād wouldnāt be a large update for somebody already unsure what to expect from AI development, but I think it should probably be a large update for the ~20% expecting āweak AGIā in 2024 (more in response #3)
#2:
Yeah I suppose ~80%->~60% is a decent update, thanks for showing me the link! My issue here would be the resolution criteria realy seems to be CoT on GSM8K, which is almost orthogonal to ābetterā imho, especially given issues accounting for dataset contaminationāthough I suppose the market is technically about wider perception rather than technical accuracy. I think I was basing a lot of my take on the response on Tech Twitter which is obviously unrepresentative, and prone to hype. But there were a lot of people I generally regard as smart and switched-on who really over-reacted in my opinion. Perhaps the median community/āAI-Safety researcher response was more measured.
#3:
Iām sympathetic to this, but Metaculus questions are generally meant to be resolved according a strict and unambiguous criteria afaik. So if someone thinks that weakly general AGI is near, but that it wouldnāt do well at the criteria in the question, then they should have longer timelines than the current community response to that question imho. The fact that this isnāt the case to me indicates that many people who made a forecast on this market arenāt paying attention to the details of the resolution and how LLMs are trained and their strengths/ālimitations in practice. (Of course, if these predictors think that weak AGI will happen from a non-LLM paradigm then fine, but then iād expect the forecasting community to react less to LLM releases)
I think where I absolutely agree with you is that we need different criteria to actually track the capabilities and properties of general AI systems that weāre concerned about! The current benchmarks available seem to have many flaws and donāt really work to distinguish interesting capabilities in the trained-on-everything era of LLMs. I think funding, supporting, and popularising research into what āgoodā benchmarks would be and creating a new test would be high impact work for the AI fieldāIād love to see orgs look into this!
B
For the Metaculus question? Iād be very upset if I had a longer-timeline prediction that failed because this resolution got changedāit says āless than 10 SAT examsā in the training data in black and white! The fact that these systems need such masses of data to do well is a sign against their generality to me.
I donāt doubt that the Gemini team is aware of issues of data contamination (they even say so at the end of page 7 in the technical report), but Iāve become very sceptical about the state of public science on Frontier AI this year. Iām very much in a ātrust, but verifyā mode and the technical report is to me more of a fancy press-release that accompanied the marketing than an honest technical report. (which is not to doubt the integrity of the Gemini research and dev team, just to say that I think theyāre losing the internal tug-of-war with Google marketing & strategy)
#4:
Ah good spot. I think I saw Melanie share it on twitter, and assumed she was sharing some new research of hers (I pulled together the references fairly quickly). I still think the results stand but I appreciate the correction and have amended my post.
<> <> <> <> <>
I want to thank you again for the interesting and insightful questions and prompts. They definitely made me think about how to express my position slightly more clearly (at least, I hope I make more sense to you after this reponse, even if we donāt agree on everything) :)
Thanks for the response!
A few quick responses:
Good to know! That certainly changes my view of whether or not this will happen soon, but also makes me think the resolution criteria is poor.
You might be interested in the recent OpenPhil RFP on benchmarks and forecasting.
People around me seemed to have a reasonably measured response.
I think weāll probably get a pretty big update about the power of LLM scaling in the next 1-2 years with the release of GPT5. Like, in the same way that each of GPT3 and GPT4 were quite informative even for the relatively savvy.