āEA-Adjacentā now I guess.
šø 10% Pledger.
Likes pluralist conceptions of the good.
Dislikes Bay Culture being in control of the future.
āEA-Adjacentā now I guess.
šø 10% Pledger.
Likes pluralist conceptions of the good.
Dislikes Bay Culture being in control of the future.
Iāll have to dive into the technical details here I think, but the mystery of in-context learning has certainly shot up my reading list, and I really appreciate that link btw! It seems Blaine has some of the similary a-priori scepticism that I do towards it, but the right way for me to proceed is dive into the empirical side and see if my ideas hold water there.
From the summary page on Open Phil:
In this framework, AGI is developed by improving and scaling up approaches within the current ML paradigm, not by discovering new algorithmic paradigms.
From this presentation about it to GovAI (from April 2023) at 05:10:
So the kinda zoomed out idea behind the Compute-centric framwork is that Iām assuming something like the current paradigm is going to lead to human-level AI and further, and Iām assuming that we get there by scaling up and improving the current algorithmic approaches. So itās going to look like better versions of transformers that are more efficient and that allow for larger context windows...ā
Both of these seem to be pretty scaling-maximalist to me, so I donāt think the quote seems wrong, at least to me? Itād be pretty hard to make a model which includes the possibility of the paradigm not getting us to AGI and then needing a period of exploration across the field to find the other breakthroughs needed.
The solution would be much worse without careful optimization and wouldnāt work at all without gpt4o (or another llm with similar performance).
I can buy that GPT4o would be best, but perhaps other LLMs might reached āokā scores on ARC-AGI if directly swapped out? Iām not sure what you refer to be ācareful optimizationā here though.
There are different analogies here which might be illuminating:
Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.
If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.
You can build systems around people which remove most of the interesting intelligence from various tasks.
I think what is going on here is analogous to all of these.
On these analogies:
This is an interesting point actually. I suppose credit-assingment for learning is a very difficult problem. In this case though, the child stranded would (hopefully!) survive and make a life for themselves and learn the skills they need to survive. Theyāre active agents using their innate general intelligence to solve novel problems (per chollet). If I put a hard-drive with gpt4oās weights in the forest, itāll just rust. And thatāll happen no matter how big we make that model/āhard-drive imo.[1]
Agreed here, will be very interesting to see how improved multimodality affects ARC-AGI scores. I think that we have interesting cases of humans being able to perform these takes in their head presumably without sight? e.g. Blind Chess Players with high ratings or Mathematicians who can reason without sight. I think Cholletās point in the interview is that they seem to be able to parse the JSON inputs fine in various cases, but still canāt perform generalisation.
Yep I think this is true, and perhaps my greatest fear from delegating power to complex AI systems. This is an empirical question weāll have to find out, can we simply automate away everything humans do/āare needed for through a combination of systems even if each individual part/āmodel used in said system is not intelligent?
Separately, this tweet is relevant: https://āāx.com/āāMaxNadeau_/āāstatus/āā1802774696192246133
Yep saw Maxās comments and think he did a great job on X bringing some clarifications. I still think the hard part is the scaffolding. Money is easy for SanFran VCs to provide, and we know theyāre all fine to scrape-data-first-ask-legal-forgiveness later.
I think thereās a separate point where enough scaffolding + LLM means the resulting AI system is not well described by being an LLM anymore. Take the case of CICERO by Meta. Is that a āscaffolded LLMā? Iād rather describe it as a system which incorporates an LLM as a particular part. Itās harder to naturally scale such a system in the way that you can with the transformer architecuter by stacking more layers or pre-training for longer on more data.
My intuition here is that scaffolding to make a system work well on ARC-AGI would make it less useable on other tasks, so sacrificing generality for specific performance. Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each? (Just thinking out loud here)
Final point, Iāve really appreciate your original work, comments on substack/āX/āhere. I do apologise if I didnāt make clear what parts were my personal reflections/āvibes instead of more technical disagreements on interpretationāthese are very complex topics (at least for me) and Iām trying my best to form a good explanation of the various evidence and data we have on this. Regardless of our disagreements on this topic, Iāve learned a lot :)
Similarly, you can pre-train a model to create weights and get to a humongous size. But it wonāt do anything until you ask it to generate a token. At least, thatās my intuition. Iām quite sceptical of how pre-training a transformer is going to lead to creating a mesa-optimiser
Oh yeah this wasnāt against you at all! I think youāre a great researcher, and an excellent interlocutor, and I learn a lot (and am learning a lot) from both your work and your reactions to my reaction.[1] Point five was very much a reaction against a āvibeā I saw in the wake of your results being published.
Like letās take Buckās tweet for example. We know now that a) your results arenāt technically SOTA and b) Itās not an LLM solution, itās an LLM + your scaffolding + program search, and I think thatās importantly not the same thing.
I sincerely hope my post + comments have been somewhat more stimulating than frustrating for you
At the moment I think ARC-AGI does a good job at showing the limitations of transformer models on simple tasks that they donāt come across in their training set. I think if the score was claimed, weād want to see how it came about. It might be through frontier models demonstrating true understanding, but it might through shortcut learning/ādata leakage/āimpressive but overly specific and intuitively unsatisfying solution.
If ARC-AGI were to be broken (within the constraints Chollet and Knoop place on it) Iād definitely change my opinions, but what theyād change to depends on the matter of how ARC-AGI was solved. Thatās all Iām trying to say in that section (perhaps poorly)
As in, your crux is that the probability of AGI within the next 50 years is less than 10%?
Iām essentially deeply uncertain about how to answer this question, in a true āKnightian Uncertaintyā sense and I donāt know how much it makes sense to use subjective probability calculus. It is also highly variable to what we mean by AGI though. I find many of the arguments Iāve seen to be a) deference to the subjective probabilities of others or b) extrapolation of straight lines on graphsāneither of which I find highly convincing. (I think your arguments seem stronger and more grounded fwiw)
I think from an x-risk perspective it is quite hard to beat AI risk even on pretty long timelines.
I think this can hold, but it holdās not just in light of particular facts about AI progress now but in light of various strong philosophical beliefs about value, what future AI would be like, and how the future would be post the invention of said AI. You may have strong arguments for these, but I find many arguments for the overwhelming importance of AI Safety do very poorly to ground these, especially in the light of compelling interventions to good that exist in the world right now.
You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.
Ah sorry I misread the trilemma, my bad! I think Iād still hold the 3rd to be true (Current LLMs never ālearnā at runtime) though Iām open to changing my mind on that looking at further research. I guess I could see ways to reject 1 (e.g. if I copied the answers and just used a lookup table Iād get 100% but I donāt think thereās any learning, so itās certainly feasible for this to be false, but agreed it doesnāt feel satisfying), or 2 (Maybe Chollet would say selection-from-memorised-templates doesnāt count as a learning, also agreed unsatisfying). Itās a good challenge!
In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning?
Iām not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated, and thatās where the ālearningā (if we want to call it that) comes ināthe model is ālearningā to store information/āgenerate information with some combination of accurately predicting the next token in its training data and satisfying the RL model created from human reward labelling. Which is my issue with calling ICL ālearningā since the model weights are fixed, the model isnāt learning anything. Similarly, all the activation functions between the layers do not change either. I also donāt make intuitive sense to me to call the outputs of layers as ālearningā - the activations are ājust matmulā which I know is reductionist, but they arenāt a thing that acquires a new state in my mind.
But again, this is something I want to do a deep dive into myself, so I accept that my thoughts on ICL might not be very clear
Thanks for sharing this Phil, itās very unforunate it came out just as I went on holiday! To all readers, this will probably be the major substantive response I make in these comments, and to get the most out of it youāll probably need some technical/ābackground understanding of how AI systems work. Iāll tag @Ryan Greenblatt directly so he can see my points, but only the first is really directed at him, the rest are responding to the ideas and interpretations.
First, to Ryan directly, this is really great work! Like, awesome job šš My only sadness here is that I get the impression you think this work is kind of a dead-end? On the contrary, I think this is the kind of research programme that could actually lead to updates (either way) across the different factions on AI progress and AI risk. You get mentioned positively on the xrisk-hostile Machine Learning Street Talk about this! Melanie Mitchell is paying attention (and even appeared in your substack comments)! I feel like the iron is hot here and itās a promising and exciting vein of research![1]
Second, as others have pointed out, the claimed numbers are not SOTA, but that is because there are different training sets and I think the ARC-AGI team should more clear about that. But to be clear for all readers, this is whatās happened:
Ryan got a model to achieve 50% accuracy on the public evaluation training set provided by Chollet in the original repo. Ryan has not got a score on the private set, because those answers are kept privately on Kaggle to prevent data leakage. Note that Ryanās original claims were based on the different sets being IID and the same difficulty, which is not true. We should expect performance to be lower on the private set.
The current SOTA on the private test set was Cole, Osman, and Hodel with 34%, though apparently they now have reached 39% on the private set. Ryan has noted this, so I assume weāll have clarifications/ācorrections soon to that bit of his piece.
Therefore Ryan has not achieved SOTA performance on ARC. That doesnāt mean his work isnāt impressive, but it is not true that GPT4o improved the ARC SOTA 16% in 6 days.
Also note from the comments on Substack, when limited to ~128 sample programmes per case, the results were 26% on the held out test of the training set. Itās good, but not state of the art, and one wonders whether the juice is worth the squeeze there, especially if Jianghong Yingās calculations of the tokens-per-case is accurate. We seem to need exponential data to improve results.
Currently, as Ryan notes, his solution is inelligble for the ARC prize as it doesnāt meet the various restrictions on runtime/ācompute/āinternet connection to enter. While the organisers say that this is meant to encourage efficiency,[2] I suspect it may be more of a security-conscious decision to limit peopleās access to the private test set. It is worth noting that, as the public training and eval sets are on GitHub (as will be most blog pieces about them, and eventually Ryanās own piece as well as my own) dataset contamination remains an issue to be concerned with.[3]
Third, and most importantly, I think Ryanās solution shows that the intelligence is coming for him, and not from Chat-GPT4o. skybrian makes this point in the comments in the substack comments. For example:
Ryan came up with the idea and implementation to use ASCII encoding since the vision capabilities of GPT4o were so unreliable. Ryan did some feature extraction on the ARC problems.
Ryan wrote the prompts and did the prompt engineering in lieu of their being fine-tuning available. He also provides the step-by-step reasoning in his prompts. Those long, carefully crafted prompt seems quite domain/āproblem-specific, and would probably point toward ARCās insufficiency as a test for generality than an example of general ability in LLMs.
Ryan notes that the additional approaches and tweaks are critical for performance gain above the ājust draw more samplesā. I think that meme was a bit unkind, let alone inaccurate, and I kinda wish it was removed from the piece tbh.
If you check the repo (linked above), itās full of some really cool code to make this solution work, but thatās the secret sauce. To my eyes, I think the hard part here was the scaffolding done by Ryan rather than the pre-training[4] of the LLM (this is another cruxy point I highlighted in my article). I think itās much less conceptually hard to scrape the entire internet and shove it through a transformer architecture. A lot of leg work and cost sure, but the hard part is the ideas bit, and thatās still basically all Ryan-GPT.
Fourth, I got massively nerdsniped by what āin-context learningā actually is. A lot of talk about it from a quick search seemed to be vague, wish-washy, and highly anthropomorphising. Iām quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why itās called learning at all. The model certainly isnāt learning anything. After you ask GPT4o a query you can boot up a new instance and itāll be as clueless as when you started the first session, or you could just flood the context window with enough useless tokens so the original task gets cut off. So, if I accept Ryanās framing of the inconsistent triad, Iād reject the 3rd one, and say that āCurrent LLMs never ālearnā at runtime (e.g. the in-context learning they can do isnāt real learning)ā. Iām going to continue following the āin-context learningā nerdsnipe, but yeah since we know that weights are completely fixed and the model isnāt learning, what is doing it? And can we think of a better name for it than āin-context learningā?
Fifth and finally, Iām slightly disappointed at Buck and Dwarkesh for kinda posing this as a āmic dropā against ARC.[5] Similarly, Zvi seems to dismiss it, though he praises Chollet for making a stand with a benchmark. I contrasnt, I think that the ability (or not) of models to reason robustly, out-of-distribution, without having the ability to learn from trillions of pre-labelled samples is a pretty big crux for AI Safetyās importance. Sure, maybe in a few months weāll see the top score on the ARC Challenge above 85%, but could such a model work in the real world? Is it actually a general intelligence capable of novel or dangerous acts, such as to motivate AI risk? This is what Chollet is talking about in the podcast when he says:
Iām pretty skeptical that weāre going to see an LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved. If you just train the model on millions or billions of puzzles similar to ARC, youāre relying on the ability to have some overlap between the tasks that you train on and the tasks that youāre going to see at test time. Youāre still using memorization.
If youāre reading, thanks for making it through this comment! Iād recommend reading Ryanās full post first (which Philb linked above), but thereās been a bunch of disparate discussion there, on LessWrong, on HackerNews etc. If you want to pursue what the LLM-reasoning-sceptics think, Iād recommend following/āreading Melanie Mitchell and Subbarao Kambhampati. Finally, if you think this is topic/āproblem is worth collaborating on then feel free to reach out to me. Iād love to hear from anyone who thinks itās worth investigating and would want to pool resources.
(Ofc your time is valuable and you should pursue what you think is valuable, Iād just hope this could be the start of a cross-factional, positive-sum research program which would be such a breath of fresh air compared to other AI discourse atm)
Ryan estimates he used 1000x runtime compute per problem than Cole et. al, and also spent $40,000 in API costs alone (i wonder how much it costs for just 1 run though?).
In the original interview, Mike mentions that āthere is an asterisk on any score thatās reported on against the public test setā for this very reason
H/āt to @Max Nadeau for being on top of some of the clarifications on Twitter
Perhaps Iām misinterpreting, and I am using them as a proxy for the response of AI Safety as a whole, but itās very much the āvibeā I got from those reactions
Hey Steven! As always I really appreciate your engagement here, and Iām going to have to really simplify but I really appreciate your links[1] and Iām definitely going to check them out š
I think FranƧois is right, but I do think that work on AI safety is overwhelmingly valuable.
Hereās an allegory:
I think the most relevant disagreement that we have[2]is the beginning of your allegory. To indulge it, I donāt think we have knowledge of the intelligent alien species coming to earth, and to the extent we have a conceptual basis for them we canāt see any signs of them in the sky. Pair this with the EA concern that we should be concerned about the counterfactual impact of our actions, and that there are opportunities to do good right here and now,[3] it shouldnāt be a primary EA concern.
Now, what would make it a primary concern is if Dr S is right and that the aliens are spotted and that theyāre on their way, but I donāt think heās right. And, to stretch the analogy to breaking point, Iād be very upset that after I turned my telescope to the co-ordinates Dr S mentions and seeing meteors instead of spaceships, that significant parts of the EA movement were still wanting to have more funding to construct the ultimate-anti-alien-space-laser or do alien-defence-research instead of buying bednets.
(When I say āAGIā I think Iām talking about the same thing that you called digital ābeingsā in this comment.)
A secondary crux I have is that a ādigital beingā in the sense I describe, and possibly the AGI you think of, will likely exhibit certain autopoietic properties that make it significantly different from either the paperclip maxermiser or a āfoom-ingā ASI. This is highly speculative though, based on a lot of philosophical intuitions, and I wouldnāt want to bet humanityās future on it at all in the case where we did see aliens in the sky.
To be clear, you can definitely find some people in AI safety saying AGI is likely in <5 years, although Ajeya is not one of those people. This is a more extreme claim, and does seem pretty implausible unless LLMs will scale to AGI.
My take on it, though I admit driven by selection bias on Twitter, is that many people in the Bay-Social-Scene are buying into the <5 year timelines. Aschenbrenner for sure, Kokotajlo as well, and even maybe Amodei[4] as well? (Edit: Also lots of prominent AI Safety Twitter accounts seem to have bought fully into this worldview, such as the awful āAI Safety Memesā account) However, I do agree itās not all of AI Safety for sure! I just donāt think it that, once you take away that urgency and certainy of the probelm, it ought to be considered the worldās āmost pressing problemā, at least without further controversial philosophical assumptions.
I remember reading and liking your āLLM plateau-istā piece.
I canāt speak for all the otheres you mention, but fwiw I do agree with your frustrations at the AI risk discourse on various sides
Iād argue through increasing human flourishing and reducing the suffering we inflict on animals, but you could sub in your own cause area here for instance, e.g. āpreventing nuclear warā if you thought that was both likely and an x-risk
See the transcript with Dwarkesh at 00:24:26 onwards where he says that superhuman/ātransformative AI capabilities will come within āa few yearsā of the interviewās date (so within a few years of summer 2023)
Yeah itās true, I was mostly just responding of the empirical question of how to identify/āmeasure that split on the Forum itself.
As to dealing with the split and what it represents, my best guess is that there is a Bay-concentrated/āinfluenced group of users who have geographically concentrated views, which much of the rest of EA disagree with/āto varying extents find their beliefs/ābehaviour rude or repugnant or wrong.[1] The longer term question is if that group and the rest of EA[2] can cohere together under one banner or not.
I donāt know the answer there, but Iād very much prefer it to be discussion and mutual understanding rather than acrimony and mutual downvoting. But I admit I have been acrimonious and downvoted others on the Forum, so not sure those on the other side to me[3] would think Iām a good choice to start that dialogue.
Perhaps the feeling is mutual? I donāt know, certainly I think many members of this culture (not just in EA/āRationalist circles but beyond in the Bay) find ānormieā culture morally wrong and intelorable
Big simplification I know
For the record, as per bio, I am a ārest of the world/ānon-Bayā EA
Agreed, and I think @Peter Wildeford has pointed that out in recent threadsāitās very unlikely to be a āconspiracyā and much more likely that opinions and geographical locations are highly correlated. I can remember some recent comments of mine that swung from slighty upvoted to highly downvoted and back to slightly upvoted
This might be something that the Forum team is better placed to answer, but if anyone can think of a way to try to tease this out using data on the public API let me know and I can try and investigate it
I wish Clara had pushed Jason more in this interview about what EA is and what Jasonās issues with it are in more specific detail. I think heās kind-of attacking an enemy of his own making (linking @jasoncrawford so he can correct me). For example:
He presents a potted version of EA history, which Clara pushes back on/ācorrects, and Jason never acknowledges that. But to me he was using that timeline as part of the case for āEA was good but has since gone off trackā or āEA has underlying epistemic problems which lead it to go off trackā
He seems to connect EA to utilitarianism, but never elaborates on his issue with this. I think heās perhaps upset at naĆÆve utilitarianism, but again many EAs have written against this. When he talks about his scepticism about what the long-term future holds as a separation point to EA is false. Many EAs, including myself and Clara in the interview feel this way, and Jason doesnāt respond to it at all!
One moral point that does come up is the Drowning Child thought experiment. Clara rejects its implications because of empirical effectiveness (which is odd because Iām sure Singer believes that this is true as well, but the fact that we have identified charities that can save lives makes the analogy hold). Iām much less sure what Jasonās disagreement consists on, if itās from a similar empirical angle or a rejection of moral universalism.
A bunch of the funding to get progress studies, and in particular Roots of Progress (Jasonās org) seems to have come from EA sources. So this is clearly a case of EA doing the āfund something and see what happensā approach. I guess I donāt have a clear sense of where RoP funding does come from and how it evaluates stuff though.
In practice, Iām not sure that Iād want to say that Progress Studies is the movement of the people and EA is the movement of elites. I think that they demographically appeal to very similar types of people, so Iām not sure what that point is meant to prove.
Even though Jason admits he is oversimplifying, I wish he could have provided more receipts. He often talks about what EAs are like, but I donāt know if he has any data apart from vibes and intuition.
My impression is that Jason is rhetorically trying to set EA up as a poor alternative to Progress Studies/āProgress movement/āwhatever so that he can knock it down. (e.g. see this twitter thread of his for an exampleāof note here he uses Helen Toner as an example of an EA driven to a terrible decision by EA ideology, whereas now it seems to be a case of a playing a high-stakes power-struggle and losing. I wonder if he has made a correction.) This article is Jason presenting his take on what the differences are, and I donāt think that itās an unbiased one or one thatās devoid of strategic intent.
tl;drāI donāt really recognise the EA Jason is presenting here that much,[1] and I think heās using it deliberately as a foil to increase the stature of the āProgress Communityā
Maybe itās a Bay vs UK thing, I donāt know
I think I disagree with this perspective because, to me, the doing is the identity in a certain importance sense.
Like I think everyone GWWC Pledger should reasonably be expected to be identified as an EA, even if they donāt claim the self identity. If MacAskill or Moskovitizās behaviour changed 0% apart from they stopped self-identifying as an EA, I still think itād make sense to consider them EAs.
What really annoys me with the āEA = Specific EA Communityā is takes like this or thisāthe ideas part of EA is what matters. If CEA and OpenPhil disbanded Iād still be donating to effective charities because of the ideas involved, and the āself-identification/āspecific community lineageā explanation cannot really explain this imho.
(p.s. not trying to go in too hard on you David, I was torn about whether to respond to this thread or @Karthik Tadepalliās above. Perhaps we should meet and have a chat about it sometime if you think thatās productive at all?)
I go on holiday for a few days and like everything community-wise explodes. Current mood.
Edit: I retracted the below because I think it is unkind and wasnāt truth-seeking enough. I apologise if I caused too much stress to @Dustin Moskovitz or @Alexander_Berger, even if I have disagreements with GVF/āOP about things I very much appreciate what both of you are doing for the world, let alone āEAā or its surrounding community.
Wait what, weāre (or GV is) defunding animal stuff to focus more on AI stuff? That seems really bad to me, I feel like āPRā damage to EA is much more coming from the āAI eschatonā side than the āhelp the animalsā side (and also that interventions on animal welfare are plausibly much more valuable than AI)[1]
I think if you subscribe to a Housing-Theory-Of-Everything or a Lars Doucet Georgist Persepctive[1] then YIMBY stuff might be seen as an unblocker to good political-economic outcomes in everything else.
Which particular resolution criteria do you think itās unreasonable to believe will be met by 2027/ā2032 (depending on whether itās the weak AGI question or the strong one)?
Two of the four in particular stand out. First, the Turing Test one exactly for the reason you mentionāasking the model to violate the terms of service is surely an easy way to win. Thatās the resolution criteria, so unless the Metaculus users think thatāll be solved in 3 years[1] then the estimates should be higher. Second, the SAT-passing requires āhaving less than ten SAT exams as part of the training dataā, which is very unlikely in current Frontier models, and labs probably arenāt keen to share what exactly they have trained on.
it is just unclear whether people are forecasting on the actual resolution criteria or on their own idea of what āAGIā is.
No reason to assume an individual Metaculus commentator agrees with the Metaculus timeline, so I donāt think thatās very fair.
I donāt know if it is unfair. This is Metaculus! Premier forecasting website! These people should be reading the resolution criteria and judging their predictions according to them. Just going off personal vibes on how much they āfeel the AGIā feels like a sign of epistemic rot to me. I know not every Metaculus user agrees with this, but it is shaped by the aggregate ā 2027/ā2032 are very short timelines, and those are median community predictions. This is my main issue with the Metaculus timelines atm.
I actually think the two Metaculus questions are just bad questions.
I mean, I do agree with you in the sense that they donāt fully match AGI, but thatās partly because āAGIā covers a bunch of different ideas and concepts. It might well be possible for a system to satisfy these conditions but not replace knowledge workers, perhaps a new market focusing on automation and employment might be better but that also has its issues with operationalisation.
On top of everything else needed to successfully pass the imitation game
(folding in replies to different sub-comments here)
I think our misunderstanding here is caused by the word do. Sure, Stephen Hawking couldnāt control his limbs, but nevertheless his mind was always working. He kept writing books and papers throughout his life, and his brain was āalways onā. A transformer model is a set of frozen weights that are only āonā when a prompt is entered. Thatās what I mean by āit wonāt do anythingā.
Hmm, maybe weāre differing on what hard works means here! Could be a difference between whatās expensive, time-consuming, etc. Iām not sure this holds for any reasonable scheme, and I definitely think that you deserve a lot of credit for the work youāve done, much more than GPT4o.
Congrats! I saw that result and am impressed! Itās definitely clearly SOTA on the ARC-AGI-PUB leaderboard, but the original ā34%->50% in 6 days ARC-AGI breakthroughā claim is still incorrect.