Thanks for sharing this Phil, itâs very unforunate it came out just as I went on holiday! To all readers, this will probably be the major substantive response I make in these comments, and to get the most out of it youâll probably need some technical/âbackground understanding of how AI systems work. Iâll tag @Ryan Greenblatt directly so he can see my points, but only the first is really directed at him, the rest are responding to the ideas and interpretations.
First, to Ryan directly, this is really great work! Like, awesome job đđ My only sadness here is that I get the impression you think this work is kind of a dead-end? On the contrary, I think this is the kind of research programme that could actually lead to updates (either way) across the different factions on AI progress and AI risk. You get mentioned positively on the xrisk-hostile Machine Learning Street Talk about this! Melanie Mitchell is paying attention (and even appeared in your substack comments)! I feel like the iron is hot here and itâs a promising and exciting vein of research![1]
Second, as others have pointed out, the claimed numbers are not SOTA, but that is because there are different training sets and I think the ARC-AGI team should more clear about that. But to be clear for all readers, this is whatâs happened:
Ryan got a model to achieve 50% accuracy on the public evaluation training set provided by Chollet in the original repo. Ryan has not got a score on the private set, because those answers are kept privately on Kaggle to prevent data leakage. Note that Ryanâs original claims were based on the different sets being IID and the same difficulty, which is not true. We should expect performance to be lower on the private set.
The current SOTA on the private test set was Cole, Osman, and Hodel with 34%, though apparently they now have reached 39% on the private set. Ryan has noted this, so I assume weâll have clarifications/âcorrections soon to that bit of his piece.
Therefore Ryan has not achieved SOTA performance on ARC. That doesnât mean his work isnât impressive, but it is not true that GPT4o improved the ARC SOTA 16% in 6 days.
Also note from the comments on Substack, when limited to ~128 sample programmes per case, the results were 26% on the held out test of the training set. Itâs good, but not state of the art, and one wonders whether the juice is worth the squeeze there, especially if Jianghong Yingâs calculations of the tokens-per-case is accurate. We seem to need exponential data to improve results.
Currently, as Ryan notes, his solution is inelligble for the ARC prize as it doesnât meet the various restrictions on runtime/âcompute/âinternet connection to enter. While the organisers say that this is meant to encourage efficiency,[2] I suspect it may be more of a security-conscious decision to limit peopleâs access to the private test set. It is worth noting that, as the public training and eval sets are on GitHub (as will be most blog pieces about them, and eventually Ryanâs own piece as well as my own) dataset contamination remains an issue to be concerned with.[3]
Third, and most importantly, I think Ryanâs solution shows that the intelligence is coming for him, and not from Chat-GPT4o. skybrian makes this point in the comments in the substack comments. For example:
Ryan came up with the idea and implementation to use ASCII encoding since the vision capabilities of GPT4o were so unreliable. Ryan did some feature extraction on the ARC problems.
Ryan wrote the prompts and did the prompt engineering in lieu of their being fine-tuning available. He also provides the step-by-step reasoning in his prompts. Those long, carefully crafted prompt seems quite domain/âproblem-specific, and would probably point toward ARCâs insufficiency as a test for generality than an example of general ability in LLMs.
Ryan notes that the additional approaches and tweaks are critical for performance gain above the âjust draw more samplesâ. I think that meme was a bit unkind, let alone inaccurate, and I kinda wish it was removed from the piece tbh.
If you check the repo (linked above), itâs full of some really cool code to make this solution work, but thatâs the secret sauce. To my eyes, I think the hard part here was the scaffolding done by Ryan rather than the pre-training[4] of the LLM (this is another cruxy point I highlighted in my article). I think itâs much less conceptually hard to scrape the entire internet and shove it through a transformer architecture. A lot of leg work and cost sure, but the hard part is the ideas bit, and thatâs still basically all Ryan-GPT.
Fourth, I got massively nerdsniped by what âin-context learningâ actually is. A lot of talk about it from a quick search seemed to be vague, wish-washy, and highly anthropomorphising. Iâm quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why itâs called learning at all. The model certainly isnât learning anything. After you ask GPT4o a query you can boot up a new instance and itâll be as clueless as when you started the first session, or you could just flood the context window with enough useless tokens so the original task gets cut off. So, if I accept Ryanâs framing of the inconsistent triad, Iâd reject the 3rd one, and say that âCurrent LLMs never âlearnâ at runtime (e.g. the in-context learning they can do isnât real learning)â. Iâm going to continue following the âin-context learningâ nerdsnipe, but yeah since we know that weights are completely fixed and the model isnât learning, what is doing it? And can we think of a better name for it than âin-context learningâ?
Fifth and finally, Iâm slightly disappointed at Buck and Dwarkesh for kinda posing this as a âmic dropâ against ARC.[5] Similarly, Zvi seems to dismiss it, though he praises Chollet for making a stand with a benchmark. I contrasnt, I think that the ability (or not) of models to reason robustly, out-of-distribution, without having the ability to learn from trillions of pre-labelled samples is a pretty big crux for AI Safetyâs importance. Sure, maybe in a few months weâll see the top score on the ARC Challenge above 85%, but could such a model work in the real world? Is it actually a general intelligence capable of novel or dangerous acts, such as to motivate AI risk? This is what Chollet is talking about in the podcast when he says:
Iâm pretty skeptical that weâre going to see an LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved. If you just train the model on millions or billions of puzzles similar to ARC, youâre relying on the ability to have some overlap between the tasks that you train on and the tasks that youâre going to see at test time. Youâre still using memorization.
If youâre reading, thanks for making it through this comment! Iâd recommend reading Ryanâs full post first (which Philb linked above), but thereâs been a bunch of disparate discussion there, on LessWrong, on HackerNews etc. If you want to pursue what the LLM-reasoning-sceptics think, Iâd recommend following/âreading Melanie Mitchell and Subbarao Kambhampati. Finally, if you think this is topic/âproblem is worth collaborating on then feel free to reach out to me. Iâd love to hear from anyone who thinks itâs worth investigating and would want to pool resources.
(Ofc your time is valuable and you should pursue what you think is valuable, Iâd just hope this could be the start of a cross-factional, positive-sum research program which would be such a breath of fresh air compared to other AI discourse atm)
Ryan estimates he used 1000x runtime compute per problem than Cole et. al, and also spent $40,000 in API costs alone (i wonder how much it costs for just 1 run though?).
In the original interview, Mike mentions that âthere is an asterisk on any score thatâs reported on against the public test setâ for this very reason
Perhaps Iâm misinterpreting, and I am using them as a proxy for the response of AI Safety as a whole, but itâs very much the âvibeâ I got from those reactions
At the moment I think ARC-AGI does a good job at showing the limitations of transformer models on simple tasks that they donât come across in their training set. I think if the score was claimed, weâd want to see how it came about. It might be through frontier models demonstrating true understanding, but it might through shortcut learning/âdata leakage/âimpressive but overly specific and intuitively unsatisfying solution.
If ARC-AGI were to be broken (within the constraints Chollet and Knoop place on it) Iâd definitely change my opinions, but what theyâd change to depends on the matter of how ARC-AGI was solved. Thatâs all Iâm trying to say in that section (perhaps poorly)
the claimed numbers are not SOTA, but that is because there are different training sets and I think the ARC-AGI team should more clear about that
Agreed, though it is possible that my approach is/âwas SOTA on the private set. (E.g., because Jack Cole et al.âs approach is somewhat more overfit.)
Iâm waiting on the private leaderboard results and then Iâll revise.
My only sadness here is that I get the impression you think this work is kind of a dead-end?
I donât think it is a dead end.
As I say in the post:
ARC-AGI probably isnât a good benchmark for evaluating progress towards TAI: substantial âelicitationâ effort could massively improve performance on ARC-AGI in a way that might not transfer to more important and realistic tasks.
But, I still think that work like ARC-AGI can be good on the margin for getting a better understanding of current AI capabilities.
So, if I accept Ryanâs framing of the inconsistent triad, Iâd reject the 3rd one, and say that âCurrent LLMs never âlearnâ at runtime (e.g. the in-context learning they can do isnât real learning)â
You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.
Iâm quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why itâs called learning at all
In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning? I would say that a system can (potentially!) be learning as long as there is some evolving state. In the case of transformers and in-context learning, that state is activations.
You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.
Ah sorry I misread the trilemma, my bad! I think Iâd still hold the 3rd to be true (Current LLMs never âlearnâ at runtime) though Iâm open to changing my mind on that looking at further research. I guess I could see ways to reject 1 (e.g. if I copied the answers and just used a lookup table Iâd get 100% but I donât think thereâs any learning, so itâs certainly feasible for this to be false, but agreed it doesnât feel satisfying), or 2 (Maybe Chollet would say selection-from-memorised-templates doesnât count as a learning, also agreed unsatisfying). Itâs a good challenge!
In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning?
Iâm not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated, and thatâs where the âlearningâ (if we want to call it that) comes inâthe model is âlearningâ to store information/âgenerate information with some combination of accurately predicting the next token in its training data and satisfying the RL model created from human reward labelling. Which is my issue with calling ICL âlearningâ since the model weights are fixed, the model isnât learning anything. Similarly, all the activation functions between the layers do not change either. I also donât make intuitive sense to me to call the outputs of layers as âlearningâ - the activations are âjust matmulâ which I know is reductionist, but they arenât a thing that acquires a new state in my mind.
But again, this is something I want to do a deep dive into myself, so I accept that my thoughts on ICL might not be very clear
Iâm not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated
Sure, I was just using this as an example. I should have made this more clera.
Here is a version of the exact same paragraph you wrote but for activations and incontext learning:
in pre-training and RLHF the model activations are being changed and updated by each layer, and thatâs where the âin-context learningâ (if we want to call it that) comes inâthe activations are being updated/âoptimized to better predict the next token and understand the text. The layers learned to in-context learn (update the activations) across a wide variety of data in pretraining.
Fair enough if you want to say âthe model isnât learning, the activations are learningâ, but then you should also say âshort term (<1 minute) learning in humans isnât the brain learning, it is the transient neural state learningâ.
Iâll have to dive into the technical details here I think, but the mystery of in-context learning has certainly shot up my reading list, and I really appreciate that link btw! It seems Blaine has some of the similary a-priori scepticism that I do towards it, but the right way for me to proceed is dive into the empirical side and see if my ideas hold water there.
Third, and most importantly, I think Ryanâs solution shows that the intelligence is coming for him, and not from Chat-GPT4o. skybrian makes this point in the comments in the substack comments.
[...]
To my eyes, I think the hard part here was the scaffolding done by Ryan rather than the pre-training[4] of the LLM (this is another cruxy point I highlighted in my article).
Certainly some credit goes to me and some to GPT4o.
The solution would be much worse without careful optimization and wouldnât work at all without gpt4o (or another llm with similar performance).
Itâs worth noting a high fraction of my time went into writing prompts and optimization the representation. (Which is perhaps better described as teaching gpt4o and making it easier for it to see the problem.)
There are different analogies here which might be illuminating:
Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.
If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.
You can build systems around people which remove most of the interesting intelligence from various tasks.
I think what is going on here is analogous to all of these.
I think itâs much less conceptually hard to scrape the entire internet and shove it through a transformer architecture. A lot of leg work and cost sure, but the hard part is the ideas bit,
It is worth noting that hundreds (thousands?) of high quality researcher years have been put into making GPT4o more performant.
The solution would be much worse without careful optimization and wouldnât work at all without gpt4o (or another llm with similar performance).
I can buy that GPT4o would be best, but perhaps other LLMs might reached âokâ scores on ARC-AGI if directly swapped out? Iâm not sure what you refer to be âcareful optimizationâ here though.
There are different analogies here which might be illuminating:
Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.
If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.
You can build systems around people which remove most of the interesting intelligence from various tasks.
I think what is going on here is analogous to all of these.
On these analogies:
This is an interesting point actually. I suppose credit-assingment for learning is a very difficult problem. In this case though, the child stranded would (hopefully!) survive and make a life for themselves and learn the skills they need to survive. Theyâre active agents using their innate general intelligence to solve novel problems (per chollet). If I put a hard-drive with gpt4oâs weights in the forest, itâll just rust. And thatâll happen no matter how big we make that model/âhard-drive imo.[1]
Agreed here, will be very interesting to see how improved multimodality affects ARC-AGI scores. I think that we have interesting cases of humans being able to perform these takes in their head presumably without sight? e.g. Blind Chess Players with high ratings or Mathematicians who can reason without sight. I think Cholletâs point in the interview is that they seem to be able to parse the JSON inputs fine in various cases, but still canât perform generalisation.
Yep I think this is true, and perhaps my greatest fear from delegating power to complex AI systems. This is an empirical question weâll have to find out, can we simply automate away everything humans do/âare needed for through a combination of systems even if each individual part/âmodel used in said system is not intelligent?
Yep saw Maxâs comments and think he did a great job on X bringing some clarifications. I still think the hard part is the scaffolding. Money is easy for SanFran VCs to provide, and we know theyâre all fine to scrape-data-first-ask-legal-forgiveness later.
I think thereâs a separate point where enough scaffolding + LLM means the resulting AI system is not well described by being an LLM anymore. Take the case of CICERO by Meta. Is that a âscaffolded LLMâ? Iâd rather describe it as a system which incorporates an LLM as a particular part. Itâs harder to naturally scale such a system in the way that you can with the transformer architecuter by stacking more layers or pre-training for longer on more data.
My intuition here is that scaffolding to make a system work well on ARC-AGI would make it less useable on other tasks, so sacrificing generality for specific performance. Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each? (Just thinking out loud here)
Final point, Iâve really appreciate your original work, comments on substack/âX/âhere. I do apologise if I didnât make clear what parts were my personal reflections/âvibes instead of more technical disagreements on interpretationâthese are very complex topics (at least for me) and Iâm trying my best to form a good explanation of the various evidence and data we have on this. Regardless of our disagreements on this topic, Iâve learned a lot :)
Similarly, you can pre-train a model to create weights and get to a humongous size. But it wonât do anything until you ask it to generate a token. At least, thatâs my intuition. Iâm quite sceptical of how pre-training a transformer is going to lead to creating a mesa-optimiser
But it wonât do anything until you ask it to generate a token. At least, thatâs my intuition.
I think this seems like mostly a fallacy. (I feel like there should be a post explaning this somewhere.)
Here is an alternative version of what you said to indicate why I donât think this is a very interesting claim:
Sure you can have a very smart quadriplegic who is very knowledgable. But they wonât do anything until you let them control some actuator.
If your view is that âprediction wonât result in intelligenceâ, fair enough, though its notable that the human brain seems to heavily utilize prediction objectives.
(folding in replies to different sub-comments here)
Sure you can have a very smart quadriplegic who is very knowledgable. But they wonât do anything until you let them control some actuator.
I think our misunderstanding here is caused by the word do. Sure, Stephen Hawking couldnât control his limbs, but nevertheless his mind was always working. He kept writing books and papers throughout his life, and his brain was âalways onâ. A transformer model is a set of frozen weights that are only âonâ when a prompt is entered. Thatâs what I mean by âit wonât do anythingâ.
As far as this project, seems extremely implausible to me that the hard part of this project is the scaffolding work I did.
Hmm, maybe weâre differing on what hard works means here! Could be a difference between whatâs expensive, time-consuming, etc. Iâm not sure this holds for any reasonable scheme, and I definitely think that you deserve a lot of credit for the work youâve done, much more than GPT4o.
Congrats! I saw that result and am impressed! Itâs definitely clearly SOTA on the ARC-AGI-PUB leaderboard, but the original â34%->50% in 6 days ARC-AGI breakthroughâ claim is still incorrect.
I can buy that GPT4o would be best, but perhaps other LLMs might reached âokâ scores on ARC-AGI if directly swapped out? Iâm not sure what you refer to be âcareful optimizationâ here though.
I think much worse LLMs like GPT-2 or GPT-3 would virtually eliminate performance.
This is very clear as these LLMs canât code basically at all.
If you instead consider LLMs which are only somewhat less powerful like llama-3-70b (which is perhaps 10x less effective compute?), the reduction in perf will be smaller.
Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each?
Yes, it seems reasonable to try out general purpose scaffolds (like what METR does) and include ARC-AGI in general purpose task benchmarks.
I expect substantial performance reductions from general purpose scaffolding, though some fraction will be due to not having prefix compute allocating test time compute less effectively.
As far as this project, seems extremely implausible to me that the hard part of this project is the scaffolding work I did. This probably holds for any reasonable scheme for dividing credit and determining what is difficult.
Oh yeah this wasnât against you at all! I think youâre a great researcher, and an excellent interlocutor, and I learn a lot (and am learning a lot) from both your work and your reactions to my reaction.[1] Point five was very much a reaction against a âvibeâ I saw in the wake of your results being published.
Like letâs take Buckâs tweet for example. We know now that a) your results arenât technically SOTA and b) Itâs not an LLM solution, itâs an LLM + your scaffolding + program search, and I think thatâs importantly not the same thing.
Itâs not an LLM solution, itâs an LLM + your scaffolding + program search, and I think thatâs importantly not the same thing.
I feel like this is a pretty strange way to draw the line about what counts as an âLLM solutionâ.
Consider the following simplified dialogue as an example of why I donât think this is a natural place to draw the line:
Human skeptic: Humans donât exhibit real intelligence. You see, theyâll never do something as impressive as sending a human to the moon.
Humans-have-some-intelligence advocate: Didnât humans go to the moon in 1969?
Human skeptic: That wasnât humans sending someone to the moon that was Humans + Culture + Organizations + Science sending someone to the moon! You see, humans donât exhibit real intelligence!
Humans-have-some-intelligence advocate: ⊠Ok, but do you agree that if we removed the Humans from the overall approach it wouldnât work.
Human skeptic: Yes, but same with the culture and organization!
Humans-have-some-intelligence advocate: Sure, I guess. Iâm happy to just call it humans+etc I guess. Do you have any predictions for specific technical feats which are possible to do with a reasonable amount of intelligence that youâre confident canât be accomplished by building some relatively straightforward organization on top of a bunch of smart humans within the next 15 years?
Human skeptic: No.
Of course, I think actual LLM skeptics often donât answer âNoâ to the last question. They often do have something that they think is unlikely to occur with a relatively straightforward scaffold on top of an LLM (a model descended from the current LLM paradigm, perhaps trained with semi-supervised learning and RLHF).
I actually donât know what in particular Chollet thinks is unlikely here. E.g., I donât know if he has strong views about the performance of my method, but using the SOTA multimodal model in 2 years.
Final final edit: Congrats on the ARC-AGI-PUB results, really impressive :)
This will be my final response on this thread, because life is very time consuming and Iâm rapidly reaching the point where I need to dive back into the technical literature and stress-test my beliefs and intuitions again. I hope Ryan and any readers have found this exchange useful/âenlightening for seeing two different perspectives hopefully have productive disagreement?
If you found my presentation of the scaling-skeptical position highly unconvincing, Iâd recommend following the work and thoughts of Tan Zhi Xuan (find her on X here). One of biggest updates was finding her work after she pushed back on Jacob Steinhardt here, and recently she gave a talk about her approach to Alignment. I urge readers to consider spending much more of their time listening to her than to me about AI.
I feel like this is a pretty strange way to draw the line about what counts as an âLLM solutionâ.
I donât think so? Again, I wouldnât call CICERO an âLLM solutionâ. Surely thereâll be some amount of scaffolding which tips over into the scaffolding being the main thing and the LLM just being a component part? Itâs probably all blurry lines for sure, but I think itâs important to separate âLLM only systemsâ from âsystems that include LLMsâ, because itâs very easy to conceptual scale up the former but harder to do the latter.
Human skeptic: That wasnât humans sending someone to the moon that was Humans + Culture + Organizations + Science sending someone to the moon! You see, humans donât exhibit real intelligence!
I mean, you use this as a reductio, but thatâs basically the theory of Distributed Cognition, and also linked to the ideas of âcollective intelligenceâ, though thatâs definitely not an area Iâm an expert in by any means. Also reminds me a lot Chalmers and Clarksâ thesis of the Extended Mind.[1]
Of course, I think actual LLM skeptics often donât answer âNoâ to the last question. They often do have something that they think is unlikely to occur with a relatively straightforward scaffold on top of an LLM (a model descended from the current LLM paradigm, perhaps trained with semi-supervised learning and RLHF).
So I canât speak for Chollet and other LLM skeptics, and I think again LLMs+extra (or extras+LLMs) are a different beast from LLMs on their own and possibly an important crux. Here are some things I donât think will happen in the near-ish future (on the current paradigm):
I believe an adversarial Imitation Game, where the interrogator is aware of both the AI systemâs LLM-based nature and its failure modes, is unlikely to be consistently beaten in the near future.[2]
Primarily-LLM models, in my view, are highly unlikely to exhibit autopoietic behaviour or develop agentic designs independently (i.e. without prompting/âdirection by a human controller).
I donât anticipate these models exponential increase the rate of scientific research or AI development.[3] Theyâll more likely serve as tools used by scientists and researchers themselves to frame problems, but new and novel problems will still remain difficult and be bottlenecked by the real world + Hofstadterâs law.
I donât anticipate Primarily-LLM models to become good at controlling and manoeuvring robotic bodies in the 3D world. This is especially true in a novel-test-case scenario (if someone could make a physical equivalent of ARC to test this, thatâd be great)
This would be even less likely if the scaffolding remained minimal. For instance, if thereâs no initial sorting code explicitly stating [IF challenge == turing_test GO TO turing_test_game_module].
Finally, as an anti-RSI operationalisation, the idea of LLM-based models assisting in designing and constructing a Dyson Sphere within 15 years seems⊠particularly far-fetched for me.
Iâm not sure if this reply was my best, it felt a little all-over-the-place, but we are touching on some deep or complex topics! So Iâll respectfully bow out now, and thank again for the disucssion and giving me so much to think about. I really appreciate it Ryan :)
Of course, with a new breakthrough, all bets could be off, but itâs also definitionally impossible to predict those, and unrobust to draw straight lines and graphs to predict the future if you think breakthroughs will be need. (Not saying you do this, but some other AIXR people definitely seem to be)
I have thoughts, but a question first: you link a Kambhampati tweet where he says,
...as the context window changes (with additional prompt words), the LLM, by design, switches the CPT used to generate next tokenâgiven that all these CPTs have been pre-computed?
What does âCPTâ stand for here? Itâs not a common ML or computer science acronym that Iâve been able to find.
I think Ryanâs solution shows that the intelligence is coming for him, and not from Chat-GPT4o.
If this is true, then substituting in a less capable model should have equally good results; would you predict that to be the case? I claim that plugging in an older/âsmaller model would produce much worse results, and if thatâs the case then we should consider a substantial part of the performance to be coming from the model.
This is what Chollet is talking about in the podcast when he says...âIâm pretty skeptical that weâre going to see an LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved.â
This seems to me to be Chollet trying to have it both ways. Either a) ARC is an important measure of âtrueâ intelligence (or at least of the ability to reason over novel problems), and so we should consider LLMsâ poor performance on it a sign that theyâre not general intelligence, or b) ARC isnât a very good measure of true intelligence, in which case LLMsâ performance on it isnât very important. Those canât be simultaneously true. I think that nearly everywhere but in the quote, Chollet has claimed (and continues to claim) that a) is true.
Iâm quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why itâs called learning at all. The model certainly isnât learning anything.
I would frame it as: the model is learning but then forgetting what itâs learned (due to its inability to move anything from working/âshort-term memory to long-term memory). Thatâs something that we see in learning in humans as well (one example: Iâve learned an enormous number of six-digit confirmation codes, each of which I remember just long enough to enter it into the website thatâs asking for it), although of course not so consistently.
Thanks for sharing this Phil, itâs very unforunate it came out just as I went on holiday! To all readers, this will probably be the major substantive response I make in these comments, and to get the most out of it youâll probably need some technical/âbackground understanding of how AI systems work. Iâll tag @Ryan Greenblatt directly so he can see my points, but only the first is really directed at him, the rest are responding to the ideas and interpretations.
First, to Ryan directly, this is really great work! Like, awesome job đđ My only sadness here is that I get the impression you think this work is kind of a dead-end? On the contrary, I think this is the kind of research programme that could actually lead to updates (either way) across the different factions on AI progress and AI risk. You get mentioned positively on the xrisk-hostile Machine Learning Street Talk about this! Melanie Mitchell is paying attention (and even appeared in your substack comments)! I feel like the iron is hot here and itâs a promising and exciting vein of research![1]
Second, as others have pointed out, the claimed numbers are not SOTA, but that is because there are different training sets and I think the ARC-AGI team should more clear about that. But to be clear for all readers, this is whatâs happened:
Ryan got a model to achieve 50% accuracy on the public evaluation training set provided by Chollet in the original repo. Ryan has not got a score on the private set, because those answers are kept privately on Kaggle to prevent data leakage. Note that Ryanâs original claims were based on the different sets being IID and the same difficulty, which is not true. We should expect performance to be lower on the private set.
The current SOTA on the private test set was Cole, Osman, and Hodel with 34%, though apparently they now have reached 39% on the private set. Ryan has noted this, so I assume weâll have clarifications/âcorrections soon to that bit of his piece.
Therefore Ryan has not achieved SOTA performance on ARC. That doesnât mean his work isnât impressive, but it is not true that GPT4o improved the ARC SOTA 16% in 6 days.
Also note from the comments on Substack, when limited to ~128 sample programmes per case, the results were 26% on the held out test of the training set. Itâs good, but not state of the art, and one wonders whether the juice is worth the squeeze there, especially if Jianghong Yingâs calculations of the tokens-per-case is accurate. We seem to need exponential data to improve results.
Currently, as Ryan notes, his solution is inelligble for the ARC prize as it doesnât meet the various restrictions on runtime/âcompute/âinternet connection to enter. While the organisers say that this is meant to encourage efficiency,[2] I suspect it may be more of a security-conscious decision to limit peopleâs access to the private test set. It is worth noting that, as the public training and eval sets are on GitHub (as will be most blog pieces about them, and eventually Ryanâs own piece as well as my own) dataset contamination remains an issue to be concerned with.[3]
Third, and most importantly, I think Ryanâs solution shows that the intelligence is coming for him, and not from Chat-GPT4o. skybrian makes this point in the comments in the substack comments. For example:
Ryan came up with the idea and implementation to use ASCII encoding since the vision capabilities of GPT4o were so unreliable. Ryan did some feature extraction on the ARC problems.
Ryan wrote the prompts and did the prompt engineering in lieu of their being fine-tuning available. He also provides the step-by-step reasoning in his prompts. Those long, carefully crafted prompt seems quite domain/âproblem-specific, and would probably point toward ARCâs insufficiency as a test for generality than an example of general ability in LLMs.
Ryan notes that the additional approaches and tweaks are critical for performance gain above the âjust draw more samplesâ. I think that meme was a bit unkind, let alone inaccurate, and I kinda wish it was removed from the piece tbh.
If you check the repo (linked above), itâs full of some really cool code to make this solution work, but thatâs the secret sauce. To my eyes, I think the hard part here was the scaffolding done by Ryan rather than the pre-training[4] of the LLM (this is another cruxy point I highlighted in my article). I think itâs much less conceptually hard to scrape the entire internet and shove it through a transformer architecture. A lot of leg work and cost sure, but the hard part is the ideas bit, and thatâs still basically all Ryan-GPT.
Fourth, I got massively nerdsniped by what âin-context learningâ actually is. A lot of talk about it from a quick search seemed to be vague, wish-washy, and highly anthropomorphising. Iâm quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why itâs called learning at all. The model certainly isnât learning anything. After you ask GPT4o a query you can boot up a new instance and itâll be as clueless as when you started the first session, or you could just flood the context window with enough useless tokens so the original task gets cut off. So, if I accept Ryanâs framing of the inconsistent triad, Iâd reject the 3rd one, and say that âCurrent LLMs never âlearnâ at runtime (e.g. the in-context learning they can do isnât real learning)â. Iâm going to continue following the âin-context learningâ nerdsnipe, but yeah since we know that weights are completely fixed and the model isnât learning, what is doing it? And can we think of a better name for it than âin-context learningâ?
Fifth and finally, Iâm slightly disappointed at Buck and Dwarkesh for kinda posing this as a âmic dropâ against ARC.[5] Similarly, Zvi seems to dismiss it, though he praises Chollet for making a stand with a benchmark. I contrasnt, I think that the ability (or not) of models to reason robustly, out-of-distribution, without having the ability to learn from trillions of pre-labelled samples is a pretty big crux for AI Safetyâs importance. Sure, maybe in a few months weâll see the top score on the ARC Challenge above 85%, but could such a model work in the real world? Is it actually a general intelligence capable of novel or dangerous acts, such as to motivate AI risk? This is what Chollet is talking about in the podcast when he says:
If youâre reading, thanks for making it through this comment! Iâd recommend reading Ryanâs full post first (which Philb linked above), but thereâs been a bunch of disparate discussion there, on LessWrong, on HackerNews etc. If you want to pursue what the LLM-reasoning-sceptics think, Iâd recommend following/âreading Melanie Mitchell and Subbarao Kambhampati. Finally, if you think this is topic/âproblem is worth collaborating on then feel free to reach out to me. Iâd love to hear from anyone who thinks itâs worth investigating and would want to pool resources.
(Ofc your time is valuable and you should pursue what you think is valuable, Iâd just hope this could be the start of a cross-factional, positive-sum research program which would be such a breath of fresh air compared to other AI discourse atm)
Ryan estimates he used 1000x runtime compute per problem than Cole et. al, and also spent $40,000 in API costs alone (i wonder how much it costs for just 1 run though?).
In the original interview, Mike mentions that âthere is an asterisk on any score thatâs reported on against the public test setâ for this very reason
H/ât to @Max Nadeau for being on top of some of the clarifications on Twitter
Perhaps Iâm misinterpreting, and I am using them as a proxy for the response of AI Safety as a whole, but itâs very much the âvibeâ I got from those reactions
It sound like you agree with my claims that ARC-AGI isnât that likely to track progress and that other benchmarks could work better?
(The rest of your response seemed to imply something different.)
At the moment I think ARC-AGI does a good job at showing the limitations of transformer models on simple tasks that they donât come across in their training set. I think if the score was claimed, weâd want to see how it came about. It might be through frontier models demonstrating true understanding, but it might through shortcut learning/âdata leakage/âimpressive but overly specific and intuitively unsatisfying solution.
If ARC-AGI were to be broken (within the constraints Chollet and Knoop place on it) Iâd definitely change my opinions, but what theyâd change to depends on the matter of how ARC-AGI was solved. Thatâs all Iâm trying to say in that section (perhaps poorly)
Agreed, though it is possible that my approach is/âwas SOTA on the private set. (E.g., because Jack Cole et al.âs approach is somewhat more overfit.)
Iâm waiting on the private leaderboard results and then Iâll revise.
I donât think it is a dead end.
As I say in the post:
ARC-AGI probably isnât a good benchmark for evaluating progress towards TAI: substantial âelicitationâ effort could massively improve performance on ARC-AGI in a way that might not transfer to more important and realistic tasks.
But, I still think that work like ARC-AGI can be good on the margin for getting a better understanding of current AI capabilities.
You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.
In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning? I would say that a system can (potentially!) be learning as long as there is some evolving state. In the case of transformers and in-context learning, that state is activations.
Ah sorry I misread the trilemma, my bad! I think Iâd still hold the 3rd to be true (Current LLMs never âlearnâ at runtime) though Iâm open to changing my mind on that looking at further research. I guess I could see ways to reject 1 (e.g. if I copied the answers and just used a lookup table Iâd get 100% but I donât think thereâs any learning, so itâs certainly feasible for this to be false, but agreed it doesnât feel satisfying), or 2 (Maybe Chollet would say selection-from-memorised-templates doesnât count as a learning, also agreed unsatisfying). Itâs a good challenge!
Iâm not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated, and thatâs where the âlearningâ (if we want to call it that) comes inâthe model is âlearningâ to store information/âgenerate information with some combination of accurately predicting the next token in its training data and satisfying the RL model created from human reward labelling. Which is my issue with calling ICL âlearningâ since the model weights are fixed, the model isnât learning anything. Similarly, all the activation functions between the layers do not change either. I also donât make intuitive sense to me to call the outputs of layers as âlearningâ - the activations are âjust matmulâ which I know is reductionist, but they arenât a thing that acquires a new state in my mind.
But again, this is something I want to do a deep dive into myself, so I accept that my thoughts on ICL might not be very clear
Sure, I was just using this as an example. I should have made this more clera.
Here is a version of the exact same paragraph you wrote but for activations and incontext learning:
in pre-training and RLHF the model activations are being changed and updated by each layer, and thatâs where the âin-context learningâ (if we want to call it that) comes inâthe activations are being updated/âoptimized to better predict the next token and understand the text. The layers learned to in-context learn (update the activations) across a wide variety of data in pretraining.
(We can show transformers learning to optimization in [very toy cases](https://ââwww.lesswrong.com/ââposts/ââHHSuvG2hqAnGT5Wzp/ââno-convincing-evidence-for-gradient-descent-in-activation#Transformers_Learn_in_Context_by_Gradient_Descent__van_Oswald_et_al__2022_).)
Fair enough if you want to say âthe model isnât learning, the activations are learningâ, but then you should also say âshort term (<1 minute) learning in humans isnât the brain learning, it is the transient neural state learningâ.
Iâll have to dive into the technical details here I think, but the mystery of in-context learning has certainly shot up my reading list, and I really appreciate that link btw! It seems Blaine has some of the similary a-priori scepticism that I do towards it, but the right way for me to proceed is dive into the empirical side and see if my ideas hold water there.
Quoting from a substack comment I wrote in response:
It is worth noting that hundreds (thousands?) of high quality researcher years have been put into making GPT4o more performant.
I can buy that GPT4o would be best, but perhaps other LLMs might reached âokâ scores on ARC-AGI if directly swapped out? Iâm not sure what you refer to be âcareful optimizationâ here though.
On these analogies:
This is an interesting point actually. I suppose credit-assingment for learning is a very difficult problem. In this case though, the child stranded would (hopefully!) survive and make a life for themselves and learn the skills they need to survive. Theyâre active agents using their innate general intelligence to solve novel problems (per chollet). If I put a hard-drive with gpt4oâs weights in the forest, itâll just rust. And thatâll happen no matter how big we make that model/âhard-drive imo.[1]
Agreed here, will be very interesting to see how improved multimodality affects ARC-AGI scores. I think that we have interesting cases of humans being able to perform these takes in their head presumably without sight? e.g. Blind Chess Players with high ratings or Mathematicians who can reason without sight. I think Cholletâs point in the interview is that they seem to be able to parse the JSON inputs fine in various cases, but still canât perform generalisation.
Yep I think this is true, and perhaps my greatest fear from delegating power to complex AI systems. This is an empirical question weâll have to find out, can we simply automate away everything humans do/âare needed for through a combination of systems even if each individual part/âmodel used in said system is not intelligent?
Yep saw Maxâs comments and think he did a great job on X bringing some clarifications. I still think the hard part is the scaffolding. Money is easy for SanFran VCs to provide, and we know theyâre all fine to scrape-data-first-ask-legal-forgiveness later.
I think thereâs a separate point where enough scaffolding + LLM means the resulting AI system is not well described by being an LLM anymore. Take the case of CICERO by Meta. Is that a âscaffolded LLMâ? Iâd rather describe it as a system which incorporates an LLM as a particular part. Itâs harder to naturally scale such a system in the way that you can with the transformer architecuter by stacking more layers or pre-training for longer on more data.
My intuition here is that scaffolding to make a system work well on ARC-AGI would make it less useable on other tasks, so sacrificing generality for specific performance. Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each? (Just thinking out loud here)
Final point, Iâve really appreciate your original work, comments on substack/âX/âhere. I do apologise if I didnât make clear what parts were my personal reflections/âvibes instead of more technical disagreements on interpretationâthese are very complex topics (at least for me) and Iâm trying my best to form a good explanation of the various evidence and data we have on this. Regardless of our disagreements on this topic, Iâve learned a lot :)
Similarly, you can pre-train a model to create weights and get to a humongous size. But it wonât do anything until you ask it to generate a token. At least, thatâs my intuition. Iâm quite sceptical of how pre-training a transformer is going to lead to creating a mesa-optimiser
I think this seems like mostly a fallacy. (I feel like there should be a post explaning this somewhere.)
Here is an alternative version of what you said to indicate why I donât think this is a very interesting claim:
Sure you can have a very smart quadriplegic who is very knowledgable. But they wonât do anything until you let them control some actuator.
If your view is that âprediction wonât result in intelligenceâ, fair enough, though its notable that the human brain seems to heavily utilize prediction objectives.
(folding in replies to different sub-comments here)
I think our misunderstanding here is caused by the word do. Sure, Stephen Hawking couldnât control his limbs, but nevertheless his mind was always working. He kept writing books and papers throughout his life, and his brain was âalways onâ. A transformer model is a set of frozen weights that are only âonâ when a prompt is entered. Thatâs what I mean by âit wonât do anythingâ.
Hmm, maybe weâre differing on what hard works means here! Could be a difference between whatâs expensive, time-consuming, etc. Iâm not sure this holds for any reasonable scheme, and I definitely think that you deserve a lot of credit for the work youâve done, much more than GPT4o.
Congrats! I saw that result and am impressed! Itâs definitely clearly SOTA on the ARC-AGI-PUB leaderboard, but the original â34%->50% in 6 days ARC-AGI breakthroughâ claim is still incorrect.
I think much worse LLMs like GPT-2 or GPT-3 would virtually eliminate performance.
This is very clear as these LLMs canât code basically at all.
If you instead consider LLMs which are only somewhat less powerful like llama-3-70b (which is perhaps 10x less effective compute?), the reduction in perf will be smaller.
Yes, it seems reasonable to try out general purpose scaffolds (like what METR does) and include ARC-AGI in general purpose task benchmarks.
I expect substantial performance reductions from general purpose scaffolding, though some fraction will be due to not having prefix compute allocating test time compute less effectively.
For this project? In general?
As far as this project, seems extremely implausible to me that the hard part of this project is the scaffolding work I did. This probably holds for any reasonable scheme for dividing credit and determining what is difficult.
I donât think the objection is to ARC (the benchmark), I think the objection is to specific (very strong!) claims that chollet makes.
I think the benchmark is a useful contribution as I note in another comment.
Oh yeah this wasnât against you at all! I think youâre a great researcher, and an excellent interlocutor, and I learn a lot (and am learning a lot) from both your work and your reactions to my reaction.[1] Point five was very much a reaction against a âvibeâ I saw in the wake of your results being published.
Like letâs take Buckâs tweet for example. We know now that a) your results arenât technically SOTA and b) Itâs not an LLM solution, itâs an LLM + your scaffolding + program search, and I think thatâs importantly not the same thing.
I sincerely hope my post + comments have been somewhat more stimulating than frustrating for you
I think my results are probably SOTA based on more recent updates.
I feel like this is a pretty strange way to draw the line about what counts as an âLLM solutionâ.
Consider the following simplified dialogue as an example of why I donât think this is a natural place to draw the line:
Human skeptic: Humans donât exhibit real intelligence. You see, theyâll never do something as impressive as sending a human to the moon.
Humans-have-some-intelligence advocate: Didnât humans go to the moon in 1969?
Human skeptic: That wasnât humans sending someone to the moon that was Humans + Culture + Organizations + Science sending someone to the moon! You see, humans donât exhibit real intelligence!
Humans-have-some-intelligence advocate: ⊠Ok, but do you agree that if we removed the Humans from the overall approach it wouldnât work.
Human skeptic: Yes, but same with the culture and organization!
Humans-have-some-intelligence advocate: Sure, I guess. Iâm happy to just call it humans+etc I guess. Do you have any predictions for specific technical feats which are possible to do with a reasonable amount of intelligence that youâre confident canât be accomplished by building some relatively straightforward organization on top of a bunch of smart humans within the next 15 years?
Human skeptic: No.
Of course, I think actual LLM skeptics often donât answer âNoâ to the last question. They often do have something that they think is unlikely to occur with a relatively straightforward scaffold on top of an LLM (a model descended from the current LLM paradigm, perhaps trained with semi-supervised learning and RLHF).
I actually donât know what in particular Chollet thinks is unlikely here. E.g., I donât know if he has strong views about the performance of my method, but using the SOTA multimodal model in 2 years.
Final final edit: Congrats on the ARC-AGI-PUB results, really impressive :)
This will be my final response on this thread, because life is very time consuming and Iâm rapidly reaching the point where I need to dive back into the technical literature and stress-test my beliefs and intuitions again. I hope Ryan and any readers have found this exchange useful/âenlightening for seeing two different perspectives hopefully have productive disagreement?
If you found my presentation of the scaling-skeptical position highly unconvincing, Iâd recommend following the work and thoughts of Tan Zhi Xuan (find her on X here). One of biggest updates was finding her work after she pushed back on Jacob Steinhardt here, and recently she gave a talk about her approach to Alignment. I urge readers to consider spending much more of their time listening to her than to me about AI.
I donât think so? Again, I wouldnât call CICERO an âLLM solutionâ. Surely thereâll be some amount of scaffolding which tips over into the scaffolding being the main thing and the LLM just being a component part? Itâs probably all blurry lines for sure, but I think itâs important to separate âLLM only systemsâ from âsystems that include LLMsâ, because itâs very easy to conceptual scale up the former but harder to do the latter.
I mean, you use this as a reductio, but thatâs basically the theory of Distributed Cognition, and also linked to the ideas of âcollective intelligenceâ, though thatâs definitely not an area Iâm an expert in by any means. Also reminds me a lot Chalmers and Clarksâ thesis of the Extended Mind.[1]
So I canât speak for Chollet and other LLM skeptics, and I think again LLMs+extra (or extras+LLMs) are a different beast from LLMs on their own and possibly an important crux. Here are some things I donât think will happen in the near-ish future (on the current paradigm):
I believe an adversarial Imitation Game, where the interrogator is aware of both the AI systemâs LLM-based nature and its failure modes, is unlikely to be consistently beaten in the near future.[2]
Primarily-LLM models, in my view, are highly unlikely to exhibit autopoietic behaviour or develop agentic designs independently (i.e. without prompting/âdirection by a human controller).
I donât anticipate these models exponential increase the rate of scientific research or AI development.[3] Theyâll more likely serve as tools used by scientists and researchers themselves to frame problems, but new and novel problems will still remain difficult and be bottlenecked by the real world + Hofstadterâs law.
I donât anticipate Primarily-LLM models to become good at controlling and manoeuvring robotic bodies in the 3D world. This is especially true in a novel-test-case scenario (if someone could make a physical equivalent of ARC to test this, thatâd be great)
This would be even less likely if the scaffolding remained minimal. For instance, if thereâs no initial sorting code explicitly stating [IF challenge == turing_test GO TO turing_test_game_module].
Finally, as an anti-RSI operationalisation, the idea of LLM-based models assisting in designing and constructing a Dyson Sphere within 15 years seems⊠particularly far-fetched for me.
Iâm not sure if this reply was my best, it felt a little all-over-the-place, but we are touching on some deep or complex topics! So Iâll respectfully bow out now, and thank again for the disucssion and giving me so much to think about. I really appreciate it Ryan :)
Then you get into ideas like embodiment/âenactivism etc
I can think of a bunch of strategies to win here, but Iâm not gonna say so it doesnât end up in GPT-5 or 6âČs training data!
Of course, with a new breakthrough, all bets could be off, but itâs also definitionally impossible to predict those, and unrobust to draw straight lines and graphs to predict the future if you think breakthroughs will be need. (Not saying you do this, but some other AIXR people definitely seem to be)
I have thoughts, but a question first: you link a Kambhampati tweet where he says,
What does âCPTâ stand for here? Itâs not a common ML or computer science acronym that Iâve been able to find.
Since nobody else has responded, my best guess would be âconditional probability tableâ.
If this is true, then substituting in a less capable model should have equally good results; would you predict that to be the case? I claim that plugging in an older/âsmaller model would produce much worse results, and if thatâs the case then we should consider a substantial part of the performance to be coming from the model.
This seems to me to be Chollet trying to have it both ways. Either a) ARC is an important measure of âtrueâ intelligence (or at least of the ability to reason over novel problems), and so we should consider LLMsâ poor performance on it a sign that theyâre not general intelligence, or b) ARC isnât a very good measure of true intelligence, in which case LLMsâ performance on it isnât very important. Those canât be simultaneously true. I think that nearly everywhere but in the quote, Chollet has claimed (and continues to claim) that a) is true.
I would frame it as: the model is learning but then forgetting what itâs learned (due to its inability to move anything from working/âshort-term memory to long-term memory). Thatâs something that we see in learning in humans as well (one example: Iâve learned an enormous number of six-digit confirmation codes, each of which I remember just long enough to enter it into the website thatâs asking for it), although of course not so consistently.