This other Ryan Greenblatt is my old account[1]. Here is my LW account.
- ^
Account lost to the mists of time and expired university email addresses.
This other Ryan Greenblatt is my old account[1]. Here is my LW account.
Account lost to the mists of time and expired university email addresses.
You might be interested in discussion here.
We know now that a) your results aren’t technically SOTA
I think my results are probably SOTA based on more recent updates.
It’s not an LLM solution, it’s an LLM + your scaffolding + program search, and I think that’s importantly not the same thing.
I feel like this is a pretty strange way to draw the line about what counts as an “LLM solution”.
Consider the following simplified dialogue as an example of why I don’t think this is a natural place to draw the line:
Human skeptic: Humans don’t exhibit real intelligence. You see, they’ll never do something as impressive as sending a human to the moon.
Humans-have-some-intelligence advocate: Didn’t humans go to the moon in 1969?
Human skeptic: That wasn’t humans sending someone to the moon that was Humans + Culture + Organizations + Science sending someone to the moon! You see, humans don’t exhibit real intelligence!
Humans-have-some-intelligence advocate: … Ok, but do you agree that if we removed the Humans from the overall approach it wouldn’t work.
Human skeptic: Yes, but same with the culture and organization!
Humans-have-some-intelligence advocate: Sure, I guess. I’m happy to just call it humans+etc I guess. Do you have any predictions for specific technical feats which are possible to do with a reasonable amount of intelligence that you’re confident can’t be accomplished by building some relatively straightforward organization on top of a bunch of smart humans within the next 15 years?
Human skeptic: No.
Of course, I think actual LLM skeptics often don’t answer “No” to the last question. They often do have something that they think is unlikely to occur with a relatively straightforward scaffold on top of an LLM (a model descended from the current LLM paradigm, perhaps trained with semi-supervised learning and RLHF).
I actually don’t know what in particular Chollet thinks is unlikely here. E.g., I don’t know if he has strong views about the performance of my method, but using the SOTA multimodal model in 2 years.
Tom Davidson’s model is often referred to in the Community, but it is entirely reliant on the current paradigm + scale reaching AGI.
This seems wrong.
It does use constants from the historical deep learning field to provide guesses for parameters and it assumes that compute is an important driver of AI progress.
These are much weaker assumptions than you seem to be implying.
Note also that this work is based on earlier work like bio anchors which was done just as the current paradigm and scaling were being established. (It was published in the same year as Kaplan et al.)
But it won’t do anything until you ask it to generate a token. At least, that’s my intuition.
I think this seems like mostly a fallacy. (I feel like there should be a post explaning this somewhere.)
Here is an alternative version of what you said to indicate why I don’t think this is a very interesting claim:
Sure you can have a very smart quadriplegic who is very knowledgable. But they won’t do anything until you let them control some actuator.
If your view is that “prediction won’t result in intelligence”, fair enough, though its notable that the human brain seems to heavily utilize prediction objectives.
I can buy that GPT4o would be best, but perhaps other LLMs might reached ‘ok’ scores on ARC-AGI if directly swapped out? I’m not sure what you refer to be ‘careful optimization’ here though.
I think much worse LLMs like GPT-2 or GPT-3 would virtually eliminate performance.
This is very clear as these LLMs can’t code basically at all.
If you instead consider LLMs which are only somewhat less powerful like llama-3-70b (which is perhaps 10x less effective compute?), the reduction in perf will be smaller.
It is also highly variable to what we mean by AGI though.
I’m happy to do timelines to the singularity and operationize this with “we have the technological capacity to pretty easily build projects as impressive as a dyson sphere”.
(Or 1000x electricity production, or whatever.)
In my views, this likely adds only a moderate number of years (3-20 depending on how various details go).
I think there signal vs noise tradeoffs, so I’m naively tempted to retreat toward more exclusivity.
This poses costs of its own, so maybe I’d be in favor of differentiation (some more and some less exclusive version).
Low confidence in this being good overall.
I’m not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated
Sure, I was just using this as an example. I should have made this more clera.
Here is a version of the exact same paragraph you wrote but for activations and incontext learning:
in pre-training and RLHF the model activations are being changed and updated by each layer, and that’s where the ‘in-context learning’ (if we want to call it that) comes in—the activations are being updated/optimized to better predict the next token and understand the text. The layers learned to in-context learn (update the activations) across a wide variety of data in pretraining.
(We can show transformers learning to optimization in [very toy cases](https://www.lesswrong.com/posts/HHSuvG2hqAnGT5Wzp/no-convincing-evidence-for-gradient-descent-in-activation#Transformers_Learn_in_Context_by_Gradient_Descent__van_Oswald_et_al__2022_).)
Fair enough if you want to say “the model isn’t learning, the activations are learning”, but then you should also say “short term (<1 minute) learning in humans isn’t the brain learning, it is the transient neural state learning”.
Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each?
Yes, it seems reasonable to try out general purpose scaffolds (like what METR does) and include ARC-AGI in general purpose task benchmarks.
I expect substantial performance reductions from general purpose scaffolding, though some fraction will be due to not having prefix compute allocating test time compute less effectively.
I still think the hard part is the scaffolding.
For this project? In general?
As far as this project, seems extremely implausible to me that the hard part of this project is the scaffolding work I did. This probably holds for any reasonable scheme for dividing credit and determining what is difficult.
Sure, maybe in a few months we’ll see the top score on the ARC Challenge above 85%, but could such a model work in the real world?
It sound like you agree with my claims that ARC-AGI isn’t that likely to track progress and that other benchmarks could work better?
(The rest of your response seemed to imply something different.)
Fifth and finally, I’m slightly disappointed at Buck and Dwarkesh for kinda posing this as a ‘mic drop’ against ARC.
I don’t think the objection is to ARC (the benchmark), I think the objection is to specific (very strong!) claims that chollet makes.
I think the benchmark is a useful contribution as I note in another comment.
So, if I accept Ryan’s framing of the inconsistent triad, I’d reject the 3rd one, and say that “Current LLMs never “learn” at runtime (e.g. the in-context learning they can do isn’t real learning)”
You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.
I’m quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why it’s called learning at all
In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning? I would say that a system can (potentially!) be learning as long as there is some evolving state. In the case of transformers and in-context learning, that state is activations.
Third, and most importantly, I think Ryan’s solution shows that the intelligence is coming for him, and not from Chat-GPT4o. skybrian makes this point in the comments in the substack comments.
[...]
To my eyes, I think the hard part here was the scaffolding done by Ryan rather than the pre-training[4] of the LLM (this is another cruxy point I highlighted in my article).
Quoting from a substack comment I wrote in response:
Certainly some credit goes to me and some to GPT4o.
The solution would be much worse without careful optimization and wouldn’t work at all without gpt4o (or another llm with similar performance).
It’s worth noting a high fraction of my time went into writing prompts and optimization the representation. (Which is perhaps better described as teaching gpt4o and making it easier for it to see the problem.)
There are different analogies here which might be illuminating:
Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.
If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.
You can build systems around people which remove most of the interesting intelligence from various tasks.
I think what is going on here is analogous to all of these.
Separately, this tweet is relevant: https://x.com/MaxNadeau_/status/1802774696192246133
I think it’s much less conceptually hard to scrape the entire internet and shove it through a transformer architecture. A lot of leg work and cost sure, but the hard part is the ideas bit,
It is worth noting that hundreds (thousands?) of high quality researcher years have been put into making GPT4o more performant.
the claimed numbers are not SOTA, but that is because there are different training sets and I think the ARC-AGI team should more clear about that
Agreed, though it is possible that my approach is/was SOTA on the private set. (E.g., because Jack Cole et al.’s approach is somewhat more overfit.)
I’m waiting on the private leaderboard results and then I’ll revise.
My only sadness here is that I get the impression you think this work is kind of a dead-end?
I don’t think it is a dead end.
As I say in the post:
ARC-AGI probably isn’t a good benchmark for evaluating progress towards TAI: substantial “elicitation” effort could massively improve performance on ARC-AGI in a way that might not transfer to more important and realistic tasks.
But, I still think that work like ARC-AGI can be good on the margin for getting a better understanding of current AI capabilities.
Pair this with the EA concern that we should be concerned about the counterfactual impact of our actions, and that there are opportunities to do good right here and now,[3] it shouldn’t be a primary EA concern.
As in, your crux is that the probability of AGI within the next 50 years is less than 10%?
I think from an x-risk perspective it is quite hard to beat AI risk even on pretty long timelines. (Where the main question is bio risk and what you think about (likely temporary) civilizational collapse due to nuclear war.)
It’s pretty plausible that on longer timelines technical alignment/safety work looks weak relative to other stuff focused on making AI go better.
I don’t comment or post much on the EA forum because the quality of discourse on the EA Forum typically seems mediocre at best. This is especially true for x-risk.
I think this has been true for a while.
Farmed animals are also neglected relative to wild animals
Typo?
You might also be interested in discussion here.