Third, and most importantly, I think Ryan’s solution shows that the intelligence is coming for him, and not from Chat-GPT4o. skybrian makes this point in the comments in the substack comments.
[...]
To my eyes, I think the hard part here was the scaffolding done by Ryan rather than the pre-training[4] of the LLM (this is another cruxy point I highlighted in my article).
Certainly some credit goes to me and some to GPT4o.
The solution would be much worse without careful optimization and wouldn’t work at all without gpt4o (or another llm with similar performance).
It’s worth noting a high fraction of my time went into writing prompts and optimization the representation. (Which is perhaps better described as teaching gpt4o and making it easier for it to see the problem.)
There are different analogies here which might be illuminating:
Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.
If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.
You can build systems around people which remove most of the interesting intelligence from various tasks.
I think what is going on here is analogous to all of these.
I think it’s much less conceptually hard to scrape the entire internet and shove it through a transformer architecture. A lot of leg work and cost sure, but the hard part is the ideas bit,
It is worth noting that hundreds (thousands?) of high quality researcher years have been put into making GPT4o more performant.
The solution would be much worse without careful optimization and wouldn’t work at all without gpt4o (or another llm with similar performance).
I can buy that GPT4o would be best, but perhaps other LLMs might reached ‘ok’ scores on ARC-AGI if directly swapped out? I’m not sure what you refer to be ‘careful optimization’ here though.
There are different analogies here which might be illuminating:
Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.
If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.
You can build systems around people which remove most of the interesting intelligence from various tasks.
I think what is going on here is analogous to all of these.
On these analogies:
This is an interesting point actually. I suppose credit-assingment for learning is a very difficult problem. In this case though, the child stranded would (hopefully!) survive and make a life for themselves and learn the skills they need to survive. They’re active agents using their innate general intelligence to solve novel problems (per chollet). If I put a hard-drive with gpt4o’s weights in the forest, it’ll just rust. And that’ll happen no matter how big we make that model/hard-drive imo.[1]
Agreed here, will be very interesting to see how improved multimodality affects ARC-AGI scores. I think that we have interesting cases of humans being able to perform these takes in their head presumably without sight? e.g. Blind Chess Players with high ratings or Mathematicians who can reason without sight. I think Chollet’s point in the interview is that they seem to be able to parse the JSON inputs fine in various cases, but still can’t perform generalisation.
Yep I think this is true, and perhaps my greatest fear from delegating power to complex AI systems. This is an empirical question we’ll have to find out, can we simply automate away everything humans do/are needed for through a combination of systems even if each individual part/model used in said system is not intelligent?
Yep saw Max’s comments and think he did a great job on X bringing some clarifications. I still think the hard part is the scaffolding. Money is easy for SanFran VCs to provide, and we know they’re all fine to scrape-data-first-ask-legal-forgiveness later.
I think there’s a separate point where enough scaffolding + LLM means the resulting AI system is not well described by being an LLM anymore. Take the case of CICERO by Meta. Is that a ‘scaffolded LLM’? I’d rather describe it as a system which incorporates an LLM as a particular part. It’s harder to naturally scale such a system in the way that you can with the transformer architecuter by stacking more layers or pre-training for longer on more data.
My intuition here is that scaffolding to make a system work well on ARC-AGI would make it less useable on other tasks, so sacrificing generality for specific performance. Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each? (Just thinking out loud here)
Final point, I’ve really appreciate your original work, comments on substack/X/here. I do apologise if I didn’t make clear what parts were my personal reflections/vibes instead of more technical disagreements on interpretation—these are very complex topics (at least for me) and I’m trying my best to form a good explanation of the various evidence and data we have on this. Regardless of our disagreements on this topic, I’ve learned a lot :)
Similarly, you can pre-train a model to create weights and get to a humongous size. But it won’t do anything until you ask it to generate a token. At least, that’s my intuition. I’m quite sceptical of how pre-training a transformer is going to lead to creating a mesa-optimiser
But it won’t do anything until you ask it to generate a token. At least, that’s my intuition.
I think this seems like mostly a fallacy. (I feel like there should be a post explaning this somewhere.)
Here is an alternative version of what you said to indicate why I don’t think this is a very interesting claim:
Sure you can have a very smart quadriplegic who is very knowledgable. But they won’t do anything until you let them control some actuator.
If your view is that “prediction won’t result in intelligence”, fair enough, though its notable that the human brain seems to heavily utilize prediction objectives.
(folding in replies to different sub-comments here)
Sure you can have a very smart quadriplegic who is very knowledgable. But they won’t do anything until you let them control some actuator.
I think our misunderstanding here is caused by the word do. Sure, Stephen Hawking couldn’t control his limbs, but nevertheless his mind was always working. He kept writing books and papers throughout his life, and his brain was ‘always on’. A transformer model is a set of frozen weights that are only ‘on’ when a prompt is entered. That’s what I mean by ‘it won’t do anything’.
As far as this project, seems extremely implausible to me that the hard part of this project is the scaffolding work I did.
Hmm, maybe we’re differing on what hard works means here! Could be a difference between what’s expensive, time-consuming, etc. I’m not sure this holds for any reasonable scheme, and I definitely think that you deserve a lot of credit for the work you’ve done, much more than GPT4o.
Congrats! I saw that result and am impressed! It’s definitely clearly SOTA on the ARC-AGI-PUB leaderboard, but the original ’34%->50% in 6 days ARC-AGI breakthrough’ claim is still incorrect.
I can buy that GPT4o would be best, but perhaps other LLMs might reached ‘ok’ scores on ARC-AGI if directly swapped out? I’m not sure what you refer to be ‘careful optimization’ here though.
I think much worse LLMs like GPT-2 or GPT-3 would virtually eliminate performance.
This is very clear as these LLMs can’t code basically at all.
If you instead consider LLMs which are only somewhat less powerful like llama-3-70b (which is perhaps 10x less effective compute?), the reduction in perf will be smaller.
Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each?
Yes, it seems reasonable to try out general purpose scaffolds (like what METR does) and include ARC-AGI in general purpose task benchmarks.
I expect substantial performance reductions from general purpose scaffolding, though some fraction will be due to not having prefix compute allocating test time compute less effectively.
As far as this project, seems extremely implausible to me that the hard part of this project is the scaffolding work I did. This probably holds for any reasonable scheme for dividing credit and determining what is difficult.
Quoting from a substack comment I wrote in response:
It is worth noting that hundreds (thousands?) of high quality researcher years have been put into making GPT4o more performant.
I can buy that GPT4o would be best, but perhaps other LLMs might reached ‘ok’ scores on ARC-AGI if directly swapped out? I’m not sure what you refer to be ‘careful optimization’ here though.
On these analogies:
This is an interesting point actually. I suppose credit-assingment for learning is a very difficult problem. In this case though, the child stranded would (hopefully!) survive and make a life for themselves and learn the skills they need to survive. They’re active agents using their innate general intelligence to solve novel problems (per chollet). If I put a hard-drive with gpt4o’s weights in the forest, it’ll just rust. And that’ll happen no matter how big we make that model/hard-drive imo.[1]
Agreed here, will be very interesting to see how improved multimodality affects ARC-AGI scores. I think that we have interesting cases of humans being able to perform these takes in their head presumably without sight? e.g. Blind Chess Players with high ratings or Mathematicians who can reason without sight. I think Chollet’s point in the interview is that they seem to be able to parse the JSON inputs fine in various cases, but still can’t perform generalisation.
Yep I think this is true, and perhaps my greatest fear from delegating power to complex AI systems. This is an empirical question we’ll have to find out, can we simply automate away everything humans do/are needed for through a combination of systems even if each individual part/model used in said system is not intelligent?
Yep saw Max’s comments and think he did a great job on X bringing some clarifications. I still think the hard part is the scaffolding. Money is easy for SanFran VCs to provide, and we know they’re all fine to scrape-data-first-ask-legal-forgiveness later.
I think there’s a separate point where enough scaffolding + LLM means the resulting AI system is not well described by being an LLM anymore. Take the case of CICERO by Meta. Is that a ‘scaffolded LLM’? I’d rather describe it as a system which incorporates an LLM as a particular part. It’s harder to naturally scale such a system in the way that you can with the transformer architecuter by stacking more layers or pre-training for longer on more data.
My intuition here is that scaffolding to make a system work well on ARC-AGI would make it less useable on other tasks, so sacrificing generality for specific performance. Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each? (Just thinking out loud here)
Final point, I’ve really appreciate your original work, comments on substack/X/here. I do apologise if I didn’t make clear what parts were my personal reflections/vibes instead of more technical disagreements on interpretation—these are very complex topics (at least for me) and I’m trying my best to form a good explanation of the various evidence and data we have on this. Regardless of our disagreements on this topic, I’ve learned a lot :)
Similarly, you can pre-train a model to create weights and get to a humongous size. But it won’t do anything until you ask it to generate a token. At least, that’s my intuition. I’m quite sceptical of how pre-training a transformer is going to lead to creating a mesa-optimiser
I think this seems like mostly a fallacy. (I feel like there should be a post explaning this somewhere.)
Here is an alternative version of what you said to indicate why I don’t think this is a very interesting claim:
Sure you can have a very smart quadriplegic who is very knowledgable. But they won’t do anything until you let them control some actuator.
If your view is that “prediction won’t result in intelligence”, fair enough, though its notable that the human brain seems to heavily utilize prediction objectives.
(folding in replies to different sub-comments here)
I think our misunderstanding here is caused by the word do. Sure, Stephen Hawking couldn’t control his limbs, but nevertheless his mind was always working. He kept writing books and papers throughout his life, and his brain was ‘always on’. A transformer model is a set of frozen weights that are only ‘on’ when a prompt is entered. That’s what I mean by ‘it won’t do anything’.
Hmm, maybe we’re differing on what hard works means here! Could be a difference between what’s expensive, time-consuming, etc. I’m not sure this holds for any reasonable scheme, and I definitely think that you deserve a lot of credit for the work you’ve done, much more than GPT4o.
Congrats! I saw that result and am impressed! It’s definitely clearly SOTA on the ARC-AGI-PUB leaderboard, but the original ’34%->50% in 6 days ARC-AGI breakthrough’ claim is still incorrect.
I think much worse LLMs like GPT-2 or GPT-3 would virtually eliminate performance.
This is very clear as these LLMs can’t code basically at all.
If you instead consider LLMs which are only somewhat less powerful like llama-3-70b (which is perhaps 10x less effective compute?), the reduction in perf will be smaller.
Yes, it seems reasonable to try out general purpose scaffolds (like what METR does) and include ARC-AGI in general purpose task benchmarks.
I expect substantial performance reductions from general purpose scaffolding, though some fraction will be due to not having prefix compute allocating test time compute less effectively.
For this project? In general?
As far as this project, seems extremely implausible to me that the hard part of this project is the scaffolding work I did. This probably holds for any reasonable scheme for dividing credit and determining what is difficult.