JWS 🔸 comments on On the Dwarkesh/Chollet Podcast, and the cruxes of scaling to AGI

JWS 🔸 25 Jun 2024 17:35 UTC
12 points
1 ∶ 3
Thanks for sharing this Phil, it’s very unforunate it came out just as I went on holiday! To all readers, this will probably be the major substantive response I make in these comments, and to get the most out of it you’ll probably need some technical/background understanding of how AI systems work. I’ll tag @Ryan Greenblatt directly so he can see my points, but only the first is really directed at him, the rest are responding to the ideas and interpretations.
First, to Ryan directly, this is really great work! Like, awesome job 👏👏 My only sadness here is that I get the impression you think this work is kind of a dead-end? On the contrary, I think this is the kind of research programme that could actually lead to updates (either way) across the different factions on AI progress and AI risk. You get mentioned positively on the xrisk-hostile Machine Learning Street Talk about this! Melanie Mitchell is paying attention (and even appeared in your substack comments)! I feel like the iron is hot here and it’s a promising and exciting vein of research!^[1]
Second, as others have pointed out, the claimed numbers are not SOTA, but that is because there are different training sets and I think the ARC-AGI team should more clear about that. But to be clear for all readers, this is what’s happened:
- Ryan got a model to achieve 50% accuracy on the public evaluation training set provided by Chollet in the original repo. Ryan has not got a score on the private set, because those answers are kept privately on Kaggle to prevent data leakage. Note that Ryan’s original claims were based on the different sets being IID and the same difficulty, which is not true. We should expect performance to be lower on the private set.
- The current SOTA on the private test set was Cole, Osman, and Hodel with 34%, though apparently they now have reached 39% on the private set. Ryan has noted this, so I assume we’ll have clarifications/corrections soon to that bit of his piece.
- Therefore Ryan has not achieved SOTA performance on ARC. That doesn’t mean his work isn’t impressive, but it is not true that GPT4o improved the ARC SOTA 16% in 6 days.
- Also note from the comments on Substack, when limited to ~128 sample programmes per case, the results were 26% on the held out test of the training set. It’s good, but not state of the art, and one wonders whether the juice is worth the squeeze there, especially if Jianghong Ying’s calculations of the tokens-per-case is accurate. We seem to need exponential data to improve results.
Currently, as Ryan notes, his solution is inelligble for the ARC prize as it doesn’t meet the various restrictions on runtime/compute/internet connection to enter. While the organisers say that this is meant to encourage efficiency,^[2] I suspect it may be more of a security-conscious decision to limit people’s access to the private test set. It is worth noting that, as the public training and eval sets are on GitHub (as will be most blog pieces about them, and eventually Ryan’s own piece as well as my own) dataset contamination remains an issue to be concerned with.^[3]
Third, and most importantly, I think Ryan’s solution shows that the intelligence is coming for him, and not from Chat-GPT4o. skybrian makes this point in the comments in the substack comments. For example:
- Ryan came up with the idea and implementation to use ASCII encoding since the vision capabilities of GPT4o were so unreliable. Ryan did some feature extraction on the ARC problems.
- Ryan wrote the prompts and did the prompt engineering in lieu of their being fine-tuning available. He also provides the step-by-step reasoning in his prompts. Those long, carefully crafted prompt seems quite domain/problem-specific, and would probably point toward ARC’s insufficiency as a test for generality than an example of general ability in LLMs.
- Ryan notes that the additional approaches and tweaks are critical for performance gain above the ‘just draw more samples’. I think that meme was a bit unkind, let alone inaccurate, and I kinda wish it was removed from the piece tbh.
If you check the repo (linked above), it’s full of some really cool code to make this solution work, but that’s the secret sauce. To my eyes, I think the hard part here was the scaffolding done by Ryan rather than the pre-training^[4] of the LLM (this is another cruxy point I highlighted in my article). I think it’s much less conceptually hard to scrape the entire internet and shove it through a transformer architecture. A lot of leg work and cost sure, but the hard part is the ideas bit, and that’s still basically all Ryan-GPT.
Fourth, I got massively nerdsniped by what ‘in-context learning’ actually is. A lot of talk about it from a quick search seemed to be vague, wish-washy, and highly anthropomorphising. I’m quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why it’s called learning at all. The model certainly isn’t learning anything. After you ask GPT4o a query you can boot up a new instance and it’ll be as clueless as when you started the first session, or you could just flood the context window with enough useless tokens so the original task gets cut off. So, if I accept Ryan’s framing of the inconsistent triad, I’d reject the 3rd one, and say that “Current LLMs never “learn” at runtime (e.g. the in-context learning they can do isn’t real learning)”. I’m going to continue following the ‘in-context learning’ nerdsnipe, but yeah since we know that weights are completely fixed and the model isn’t learning, what is doing it? And can we think of a better name for it than ‘in-context learning’?
Fifth and finally, I’m slightly disappointed at Buck and Dwarkesh for kinda posing this as a ‘mic drop’ against ARC.^[5] Similarly, Zvi seems to dismiss it, though he praises Chollet for making a stand with a benchmark. I contrasnt, I think that the ability (or not) of models to reason robustly, out-of-distribution, without having the ability to learn from trillions of pre-labelled samples is a pretty big crux for AI Safety’s importance. Sure, maybe in a few months we’ll see the top score on the ARC Challenge above 85%, but could such a model work in the real world? Is it actually a general intelligence capable of novel or dangerous acts, such as to motivate AI risk? This is what Chollet is talking about in the podcast when he says:
I’m pretty skeptical that we’re going to see an LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved. If you just train the model on millions or billions of puzzles similar to ARC, you’re relying on the ability to have some overlap between the tasks that you train on and the tasks that you’re going to see at test time. You’re still using memorization.
If you’re reading, thanks for making it through this comment! I’d recommend reading Ryan’s full post first (which Philb linked above), but there’s been a bunch of disparate discussion there, on LessWrong, on HackerNews etc. If you want to pursue what the LLM-reasoning-sceptics think, I’d recommend following/reading Melanie Mitchell and Subbarao Kambhampati. Finally, if you think this is topic/problem is worth collaborating on then feel free to reach out to me. I’d love to hear from anyone who thinks it’s worth investigating and would want to pool resources.
1. ^
  (Ofc your time is valuable and you should pursue what you think is valuable, I’d just hope this could be the start of a cross-factional, positive-sum research program which would be such a breath of fresh air compared to other AI discourse atm)
2. ^
  Ryan estimates he used 1000x runtime compute per problem than Cole et. al, and also spent $40,000 in API costs alone (i wonder how much it costs for just 1 run though?).
3. ^
  In the original interview, Mike mentions that ‘there is an asterisk on any score that’s reported on against the public test set’ for this very reason
4. ^
  H/t to @Max Nadeau for being on top of some of the clarifications on Twitter
5. ^
  Perhaps I’m misinterpreting, and I am using them as a proxy for the response of AI Safety as a whole, but it’s very much the ‘vibe’ I got from those reactions
What links here?
- ryan_greenblatt's comment on Getting 50% (SoTA) on ARC-AGI with GPT-4o by ryan_greenblatt (LessWrong; 26 Jun 2024 17:52 UTC; 6 points)
- Ryan Greenblatt 26 Jun 2024 17:30 UTC
  9 points
  0 ∶ 0
  Parent
  Sure, maybe in a few months we’ll see the top score on the ARC Challenge above 85%, but could such a model work in the real world?
  It sound like you agree with my claims that ARC-AGI isn’t that likely to track progress and that other benchmarks could work better?
  (The rest of your response seemed to imply something different.)
  - JWS 🔸 28 Jun 2024 12:07 UTC
    3 points
    0 ∶ 0
    Parent
    At the moment I think ARC-AGI does a good job at showing the limitations of transformer models on simple tasks that they don’t come across in their training set. I think if the score was claimed, we’d want to see how it came about. It might be through frontier models demonstrating true understanding, but it might through shortcut learning/data leakage/impressive but overly specific and intuitively unsatisfying solution.
    If ARC-AGI were to be broken (within the constraints Chollet and Knoop place on it) I’d definitely change my opinions, but what they’d change to depends on the matter of how ARC-AGI was solved. That’s all I’m trying to say in that section (perhaps poorly)
- Ryan Greenblatt 26 Jun 2024 17:15 UTC
  9 points
  0 ∶ 0
  Parent
  the claimed numbers are not SOTA, but that is because there are different training sets and I think the ARC-AGI team should more clear about that
  Agreed, though it is possible that my approach is/was SOTA on the private set. (E.g., because Jack Cole et al.’s approach is somewhat more overfit.)
  I’m waiting on the private leaderboard results and then I’ll revise.
- Ryan Greenblatt 26 Jun 2024 17:14 UTC
  9 points
  0 ∶ 0
  Parent
  My only sadness here is that I get the impression you think this work is kind of a dead-end?
  I don’t think it is a dead end.
  As I say in the post:
  - ARC-AGI probably isn’t a good benchmark for evaluating progress towards TAI: substantial “elicitation” effort could massively improve performance on ARC-AGI in a way that might not transfer to more important and realistic tasks.
  - But, I still think that work like ARC-AGI can be good on the margin for getting a better understanding of current AI capabilities.
- Ryan Greenblatt 26 Jun 2024 17:26 UTC
  8 points
  2 ∶ 0
  Parent
  So, if I accept Ryan’s framing of the inconsistent triad, I’d reject the 3rd one, and say that “Current LLMs never “learn” at runtime (e.g. the in-context learning they can do isn’t real learning)”
  You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.
  I’m quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why it’s called learning at all
  In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning? I would say that a system can (potentially!) be learning as long as there is some evolving state. In the case of transformers and in-context learning, that state is activations.
  - JWS 🔸 28 Jun 2024 11:55 UTC
    3 points
    0 ∶ 0
    Parent
    You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.
    Ah sorry I misread the trilemma, my bad! I think I’d still hold the 3rd to be true (Current LLMs never “learn” at runtime) though I’m open to changing my mind on that looking at further research. I guess I could see ways to reject 1 (e.g. if I copied the answers and just used a lookup table I’d get 100% but I don’t think there’s any learning, so it’s certainly feasible for this to be false, but agreed it doesn’t feel satisfying), or 2 (Maybe Chollet would say selection-from-memorised-templates doesn’t count as a learning, also agreed unsatisfying). It’s a good challenge!
    In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning?
    I’m not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated, and that’s where the ‘learning’ (if we want to call it that) comes in—the model is ‘learning’ to store information/generate information with some combination of accurately predicting the next token in its training data and satisfying the RL model created from human reward labelling. Which is my issue with calling ICL ‘learning’ since the model weights are fixed, the model isn’t learning anything. Similarly, all the activation functions between the layers do not change either. I also don’t make intuitive sense to me to call the outputs of layers as ‘learning’ - the activations are ‘just matmul’ which I know is reductionist, but they aren’t a thing that acquires a new state in my mind.
    But again, this is something I want to do a deep dive into myself, so I accept that my thoughts on ICL might not be very clear
    - Ryan Greenblatt 28 Jun 2024 18:26 UTC
      2 points
      0 ∶ 0
      Parent
      I’m not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated
      Sure, I was just using this as an example. I should have made this more clera.
      Here is a version of the exact same paragraph you wrote but for activations and incontext learning:
      in pre-training and RLHF the model activations are being changed and updated by each layer, and that’s where the ‘in-context learning’ (if we want to call it that) comes in—the activations are being updated/optimized to better predict the next token and understand the text. The layers learned to in-context learn (update the activations) across a wide variety of data in pretraining.
      (We can show transformers learning to optimization in [very toy cases](https://www.lesswrong.com/posts/HHSuvG2hqAnGT5Wzp/no-convincing-evidence-for-gradient-descent-in-activation#Transformers_Learn_in_Context_by_Gradient_Descent__van_Oswald_et_al__2022_).)
      Fair enough if you want to say “the model isn’t learning, the activations are learning”, but then you should also say “short term (<1 minute) learning in humans isn’t the brain learning, it is the transient neural state learning”.
      - JWS 🔸 29 Jun 2024 9:16 UTC
        2 points
        0 ∶ 0
        Parent
        I’ll have to dive into the technical details here I think, but the mystery of in-context learning has certainly shot up my reading list, and I really appreciate that link btw! It seems Blaine has some of the similary a-priori scepticism that I do towards it, but the right way for me to proceed is dive into the empirical side and see if my ideas hold water there.
- Ryan Greenblatt 26 Jun 2024 17:22 UTC
  8 points
  2 ∶ 0
  Parent
  Third, and most importantly, I think Ryan’s solution shows that the intelligence is coming for him, and not from Chat-GPT4o. skybrian makes this point in the comments in the substack comments.
  [...]
  To my eyes, I think the hard part here was the scaffolding done by Ryan rather than the pre-training^[4] of the LLM (this is another cruxy point I highlighted in my article).
  Quoting from a substack comment I wrote in response:
  Certainly some credit goes to me and some to GPT4o.
  The solution would be much worse without careful optimization and wouldn’t work at all without gpt4o (or another llm with similar performance).
  It’s worth noting a high fraction of my time went into writing prompts and optimization the representation. (Which is perhaps better described as teaching gpt4o and making it easier for it to see the problem.)
  There are different analogies here which might be illuminating:
  Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.
  If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.
  You can build systems around people which remove most of the interesting intelligence from various tasks.
  I think what is going on here is analogous to all of these.
  Separately, this tweet is relevant: https://x.com/MaxNadeau_/status/1802774696192246133
  I think it’s much less conceptually hard to scrape the entire internet and shove it through a transformer architecture. A lot of leg work and cost sure, but the hard part is the ideas bit,
  It is worth noting that hundreds (thousands?) of high quality researcher years have been put into making GPT4o more performant.
  What links here?
  - ryan_greenblatt's comment on Getting 50% (SoTA) on ARC-AGI with GPT-4o by ryan_greenblatt (LessWrong; 2 Jul 2024 3:39 UTC; 5 points)
  - JWS 🔸 28 Jun 2024 12:53 UTC
    6 points
    0 ∶ 0
    Parent
    The solution would be much worse without careful optimization and wouldn’t work at all without gpt4o (or another llm with similar performance).
    I can buy that GPT4o would be best, but perhaps other LLMs might reached ‘ok’ scores on ARC-AGI if directly swapped out? I’m not sure what you refer to be ‘careful optimization’ here though.
    There are different analogies here which might be illuminating:
    Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.
    If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.
    You can build systems around people which remove most of the interesting intelligence from various tasks.
    I think what is going on here is analogous to all of these.
    On these analogies:
    This is an interesting point actually. I suppose credit-assingment for learning is a very difficult problem. In this case though, the child stranded would (hopefully!) survive and make a life for themselves and learn the skills they need to survive. They’re active agents using their innate general intelligence to solve novel problems (per chollet). If I put a hard-drive with gpt4o’s weights in the forest, it’ll just rust. And that’ll happen no matter how big we make that model/hard-drive imo.^[1]
    Agreed here, will be very interesting to see how improved multimodality affects ARC-AGI scores. I think that we have interesting cases of humans being able to perform these takes in their head presumably without sight? e.g. Blind Chess Players with high ratings or Mathematicians who can reason without sight. I think Chollet’s point in the interview is that they seem to be able to parse the JSON inputs fine in various cases, but still can’t perform generalisation.
    Yep I think this is true, and perhaps my greatest fear from delegating power to complex AI systems. This is an empirical question we’ll have to find out, can we simply automate away everything humans do/are needed for through a combination of systems even if each individual part/model used in said system is not intelligent?
    Separately, this tweet is relevant: https://x.com/MaxNadeau_/status/1802774696192246133
    Yep saw Max’s comments and think he did a great job on X bringing some clarifications. I still think the hard part is the scaffolding. Money is easy for SanFran VCs to provide, and we know they’re all fine to scrape-data-first-ask-legal-forgiveness later.
    I think there’s a separate point where enough scaffolding + LLM means the resulting AI system is not well described by being an LLM anymore. Take the case of CICERO by Meta. Is that a ‘scaffolded LLM’? I’d rather describe it as a system which incorporates an LLM as a particular part. It’s harder to naturally scale such a system in the way that you can with the transformer architecuter by stacking more layers or pre-training for longer on more data.
    My intuition here is that scaffolding to make a system work well on ARC-AGI would make it less useable on other tasks, so sacrificing generality for specific performance. Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each? (Just thinking out loud here)
    Final point, I’ve really appreciate your original work, comments on substack/X/here. I do apologise if I didn’t make clear what parts were my personal reflections/vibes instead of more technical disagreements on interpretation—these are very complex topics (at least for me) and I’m trying my best to form a good explanation of the various evidence and data we have on this. Regardless of our disagreements on this topic, I’ve learned a lot :)
    ^
    Similarly, you can pre-train a model to create weights and get to a humongous size. But it won’t do anything until you ask it to generate a token. At least, that’s my intuition. I’m quite sceptical of how pre-training a transformer is going to lead to creating a mesa-optimiser
    - Ryan Greenblatt 28 Jun 2024 18:40 UTC
      2 points
      0 ∶ 0
      Parent
      But it won’t do anything until you ask it to generate a token. At least, that’s my intuition.
      I think this seems like mostly a fallacy. (I feel like there should be a post explaning this somewhere.)
      Here is an alternative version of what you said to indicate why I don’t think this is a very interesting claim:
      Sure you can have a very smart quadriplegic who is very knowledgable. But they won’t do anything until you let them control some actuator.
      If your view is that “prediction won’t result in intelligence”, fair enough, though its notable that the human brain seems to heavily utilize prediction objectives.
      - JWS 🔸 29 Jun 2024 9:27 UTC
        2 points
        0 ∶ 0
        Parent
        (folding in replies to different sub-comments here)
        Sure you can have a very smart quadriplegic who is very knowledgable. But they won’t do anything until you let them control some actuator.
        I think our misunderstanding here is caused by the word do. Sure, Stephen Hawking couldn’t control his limbs, but nevertheless his mind was always working. He kept writing books and papers throughout his life, and his brain was ‘always on’. A transformer model is a set of frozen weights that are only ‘on’ when a prompt is entered. That’s what I mean by ‘it won’t do anything’.
        As far as this project, seems extremely implausible to me that the hard part of this project is the scaffolding work I did.
        Hmm, maybe we’re differing on what hard works means here! Could be a difference between what’s expensive, time-consuming, etc. I’m not sure this holds for any reasonable scheme, and I definitely think that you deserve a lot of credit for the work you’ve done, much more than GPT4o.
        I think my results are probably SOTA based on more recent updates.
        Congrats! I saw that result and am impressed! It’s definitely clearly SOTA on the ARC-AGI-PUB leaderboard, but the original ’34%->50% in 6 days ARC-AGI breakthrough’ claim is still incorrect.
    - Ryan Greenblatt 28 Jun 2024 18:36 UTC
      2 points
      0 ∶ 0
      Parent
      I can buy that GPT4o would be best, but perhaps other LLMs might reached ‘ok’ scores on ARC-AGI if directly swapped out? I’m not sure what you refer to be ‘careful optimization’ here though.
      I think much worse LLMs like GPT-2 or GPT-3 would virtually eliminate performance.
      This is very clear as these LLMs can’t code basically at all.
      If you instead consider LLMs which are only somewhat less powerful like llama-3-70b (which is perhaps 10x less effective compute?), the reduction in perf will be smaller.
    - Ryan Greenblatt 28 Jun 2024 18:20 UTC
      2 points
      0 ∶ 0
      Parent
      Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each?
      Yes, it seems reasonable to try out general purpose scaffolds (like what METR does) and include ARC-AGI in general purpose task benchmarks.
      I expect substantial performance reductions from general purpose scaffolding, though some fraction will be due to not having prefix compute allocating test time compute less effectively.
    - Ryan Greenblatt 28 Jun 2024 18:18 UTC
      2 points
      0 ∶ 0
      Parent
      I still think the hard part is the scaffolding.
      For this project? In general?
      As far as this project, seems extremely implausible to me that the hard part of this project is the scaffolding work I did. This probably holds for any reasonable scheme for dividing credit and determining what is difficult.
- Ryan Greenblatt 26 Jun 2024 17:28 UTC
  7 points
  0 ∶ 0
  Parent
  Fifth and finally, I’m slightly disappointed at Buck and Dwarkesh for kinda posing this as a ‘mic drop’ against ARC.
  I don’t think the objection is to ARC (the benchmark), I think the objection is to specific (very strong!) claims that chollet makes.
  I think the benchmark is a useful contribution as I note in another comment.
  - JWS 🔸 28 Jun 2024 12:13 UTC
    4 points
    1 ∶ 1
    Parent
    Oh yeah this wasn’t against you at all! I think you’re a great researcher, and an excellent interlocutor, and I learn a lot (and am learning a lot) from both your work and your reactions to my reaction.^[1] Point five was very much a reaction against a ‘vibe’ I saw in the wake of your results being published.
    Like let’s take Buck’s tweet for example. We know now that a) your results aren’t technically SOTA and b) It’s not an LLM solution, it’s an LLM + your scaffolding + program search, and I think that’s importantly not the same thing.
    ^
    I sincerely hope my post + comments have been somewhat more stimulating than frustrating for you
    - Ryan Greenblatt 29 Jun 2024 4:11 UTC
      3 points
      0 ∶ 0
      Parent
      We know now that a) your results aren’t technically SOTA
      I think my results are probably SOTA based on more recent updates.
      It’s not an LLM solution, it’s an LLM + your scaffolding + program search, and I think that’s importantly not the same thing.
      I feel like this is a pretty strange way to draw the line about what counts as an “LLM solution”.
      Consider the following simplified dialogue as an example of why I don’t think this is a natural place to draw the line:
      Human skeptic: Humans don’t exhibit real intelligence. You see, they’ll never do something as impressive as sending a human to the moon.
      Humans-have-some-intelligence advocate: Didn’t humans go to the moon in 1969?
      Human skeptic: That wasn’t humans sending someone to the moon that was Humans + Culture + Organizations + Science sending someone to the moon! You see, humans don’t exhibit real intelligence!
      Humans-have-some-intelligence advocate: … Ok, but do you agree that if we removed the Humans from the overall approach it wouldn’t work.
      Human skeptic: Yes, but same with the culture and organization!
      Humans-have-some-intelligence advocate: Sure, I guess. I’m happy to just call it humans+etc I guess. Do you have any predictions for specific technical feats which are possible to do with a reasonable amount of intelligence that you’re confident can’t be accomplished by building some relatively straightforward organization on top of a bunch of smart humans within the next 15 years?
      Human skeptic: No.
      Of course, I think actual LLM skeptics often don’t answer “No” to the last question. They often do have something that they think is unlikely to occur with a relatively straightforward scaffold on top of an LLM (a model descended from the current LLM paradigm, perhaps trained with semi-supervised learning and RLHF).
      I actually don’t know what in particular Chollet thinks is unlikely here. E.g., I don’t know if he has strong views about the performance of my method, but using the SOTA multimodal model in 2 years.
      - JWS 🔸 29 Jun 2024 13:03 UTC
        2 points
        0 ∶ 0
        Parent
        Final final edit: Congrats on the ARC-AGI-PUB results, really impressive :)
        This will be my final response on this thread, because life is very time consuming and I’m rapidly reaching the point where I need to dive back into the technical literature and stress-test my beliefs and intuitions again. I hope Ryan and any readers have found this exchange useful/enlightening for seeing two different perspectives hopefully have productive disagreement?
        If you found my presentation of the scaling-skeptical position highly unconvincing, I’d recommend following the work and thoughts of Tan Zhi Xuan (find her on X here). One of biggest updates was finding her work after she pushed back on Jacob Steinhardt here, and recently she gave a talk about her approach to Alignment. I urge readers to consider spending much more of their time listening to her than to me about AI.
        I feel like this is a pretty strange way to draw the line about what counts as an “LLM solution”.
        I don’t think so? Again, I wouldn’t call CICERO an “LLM solution”. Surely there’ll be some amount of scaffolding which tips over into the scaffolding being the main thing and the LLM just being a component part? It’s probably all blurry lines for sure, but I think it’s important to separate ‘LLM only systems’ from ‘systems that include LLMs’, because it’s very easy to conceptual scale up the former but harder to do the latter.
        Human skeptic: That wasn’t humans sending someone to the moon that was Humans + Culture + Organizations + Science sending someone to the moon! You see, humans don’t exhibit real intelligence!
        I mean, you use this as a reductio, but that’s basically the theory of Distributed Cognition, and also linked to the ideas of ‘collective intelligence’, though that’s definitely not an area I’m an expert in by any means. Also reminds me a lot Chalmers and Clarks’ thesis of the Extended Mind.^[1]
        Of course, I think actual LLM skeptics often don’t answer “No” to the last question. They often do have something that they think is unlikely to occur with a relatively straightforward scaffold on top of an LLM (a model descended from the current LLM paradigm, perhaps trained with semi-supervised learning and RLHF).
        So I can’t speak for Chollet and other LLM skeptics, and I think again LLMs+extra (or extras+LLMs) are a different beast from LLMs on their own and possibly an important crux. Here are some things I don’t think will happen in the near-ish future (on the current paradigm):
        I believe an adversarial Imitation Game, where the interrogator is aware of both the AI system’s LLM-based nature and its failure modes, is unlikely to be consistently beaten in the near future.^[2]
        Primarily-LLM models, in my view, are highly unlikely to exhibit autopoietic behaviour or develop agentic designs independently (i.e. without prompting/direction by a human controller).
        I don’t anticipate these models exponential increase the rate of scientific research or AI development.^[3] They’ll more likely serve as tools used by scientists and researchers themselves to frame problems, but new and novel problems will still remain difficult and be bottlenecked by the real world + Hofstadter’s law.
        I don’t anticipate Primarily-LLM models to become good at controlling and manoeuvring robotic bodies in the 3D world. This is especially true in a novel-test-case scenario (if someone could make a physical equivalent of ARC to test this, that’d be great)
        This would be even less likely if the scaffolding remained minimal. For instance, if there’s no initial sorting code explicitly stating [IF challenge == turing_test GO TO turing_test_game_module].
        Finally, as an anti-RSI operationalisation, the idea of LLM-based models assisting in designing and constructing a Dyson Sphere within 15 years seems… particularly far-fetched for me.
        I’m not sure if this reply was my best, it felt a little all-over-the-place, but we are touching on some deep or complex topics! So I’ll respectfully bow out now, and thank again for the disucssion and giving me so much to think about. I really appreciate it Ryan :)
        ^
        Then you get into ideas like embodiment/enactivism etc
        ^
        I can think of a bunch of strategies to win here, but I’m not gonna say so it doesn’t end up in GPT-5 or 6′s training data!
        ^
        Of course, with a new breakthrough, all bets could be off, but it’s also definitionally impossible to predict those, and unrobust to draw straight lines and graphs to predict the future if you think breakthroughs will be need. (Not saying you do this, but some other AIXR people definitely seem to be)
- Egg Syntax 25 Jun 2024 19:23 UTC
  2 points
  1 ∶ 0
  Parent
  I have thoughts, but a question first: you link a Kambhampati tweet where he says,
  ...as the context window changes (with additional prompt words), the LLM, by design, switches the CPT used to generate next token—given that all these CPTs have been pre-computed?
  What does ‘CPT’ stand for here? It’s not a common ML or computer science acronym that I’ve been able to find.
  - DanielFilan 27 Nov 2024 23:44 UTC
    2 points
    0 ∶ 0
    Parent
    Since nobody else has responded, my best guess would be “conditional probability table”.
- Egg Syntax 25 Jun 2024 19:57 UTC
  1 point
  0 ∶ 0
  Parent
  I think Ryan’s solution shows that the intelligence is coming for him, and not from Chat-GPT4o.
  If this is true, then substituting in a less capable model should have equally good results; would you predict that to be the case? I claim that plugging in an older/smaller model would produce much worse results, and if that’s the case then we should consider a substantial part of the performance to be coming from the model.
  This is what Chollet is talking about in the podcast when he says...‘I’m pretty skeptical that we’re going to see an LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved.’
  This seems to me to be Chollet trying to have it both ways. Either a) ARC is an important measure of ‘true’ intelligence (or at least of the ability to reason over novel problems), and so we should consider LLMs’ poor performance on it a sign that they’re not general intelligence, or b) ARC isn’t a very good measure of true intelligence, in which case LLMs’ performance on it isn’t very important. Those can’t be simultaneously true. I think that nearly everywhere but in the quote, Chollet has claimed (and continues to claim) that a) is true.
- Egg Syntax 25 Jun 2024 19:26 UTC
  1 point
  0 ∶ 0
  Parent
  I’m quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why it’s called learning at all. The model certainly isn’t learning anything.
  I would frame it as: the model is learning but then forgetting what it’s learned (due to its inability to move anything from working/short-term memory to long-term memory). That’s something that we see in learning in humans as well (one example: I’ve learned an enormous number of six-digit confirmation codes, each of which I remember just long enough to enter it into the website that’s asking for it), although of course not so consistently.

JWS 🔸 comments on On the Dwarkesh/​Chollet Podcast, and the cruxes of scaling to AGI

JWS 🔸 comments on On the Dwarkesh/Chollet Podcast, and the cruxes of scaling to AGI