Even if youāre certain that AGI is only 5 years away and will eradicate all diseases, a lot of children are going to die of malaria in those 5 years. Donating to malaria charities could reduce that number.
tobycrisford šø
Sure, I think Iāve seen that comment before, and Iām aware Chollet also included loads of caveats in his initial write up of the o3 results.
But going from zero fluid intelligence to non-zero fluid intelligence seems like it should be considered a very significant milestone! Even if the amount of fluid intelligence is still small.
Previously there was a question around whether the new wave of AI models were capable of any fluid intelligence at all. Now, even someone like Chollet has concluded they are, so it just becomes a question of how easily those capabilities can scale?
Thatās the way Iām currently thinking about it anyway. Very open to the possibility that the nearness of AGI is still being overhyped.
I think this is a misapplication of Bayes rule.
What matters is not that the 1st scenario is much more lilely than the 2nd under the hypothesis that pain is experienced (it clearly is). The relevant question is whether the 1st scenario is much more likely under the hypothesis that pain is experienced than under the hypothesis that pain is not experienced (itās relation to the second scenario is irrelevant, a red herring). And whether this is actually the case is much less clear.
This is what your footnote equation says too, so Iām not disagreeing with that, but I think the way you presented the argument in the text hides this, and might lead someone to misunderstand what it is they are being asked to judge is āmuch more likelyā.
You can make an evolutionary argument for why we would expect an animal to react āvigorouslyā to sustaining damage, and it is not clear why this evolutionary explanation requires the pain to be āexperiencedā. So someone could make an argument that the likelihood of scenario 1 is high under both hypotheses, in which case it should only cause a small change in your priors.
I thought the post was really interesting, thank you for sharing it! It has updated me towards thinking that thereās a higher chance insects might be sentient. But I think things are still a lot more complicated than suggested by this reply.
Thank you for the detailed reply Jared!
It makes sense that including outcome_2 would risk controlling away much of any effect of veganuary on outcome. And your answers to those pre-empted follow up questions make sense to me as well!
But does that then mean my original concern is still valid..? There is still a possibility that a statistically significant coefficient for veganuary_2 in the model might not be causal, but due to a confounder? Even a confounder that was actually measured, like activism exposure?
This is a fantastic , clearly written, post. Thank you for writing up and sharing!
In the 3 models, why is outcome_2 not included as a predictor?
Iām just trying to wrap my head around how the 3-wave separation works, but canāt quite follow how the confounders will be controlled for if the treatment is the only variable included from wave 2.
For example, in the first model:
Suppose āactivismā was a confounder for the effect of āveganuaryā on āoutcomeā (so āactivismā caused increased āveganuaryā exposure, as well as increased āoutcomeā).
Suppose we have 2 participants with identical Wave 1 responses.
Between wave 1 and wave 2, the first participant is exposed to āactivismā, which increases both their āveganuaryā and āoutcomeā values, and this change persists all the way through to Wave 3.
The first participant now has higher outcome_3 and veganuary_2 than the second participant, with all other predictors in the model equal, so this will lead to a positive coefficient for veganuary_2, even though the relationship between veganuary and outcome is not causal.
I can see how this problem is avoided if outcome_2 is included as a predictor instead (or maybe as well as..?) outcome_1. So maybe this is just a typo..? If so I would be interested in the explanation for whether you need outcome_1 and outcome_2, or if just outcome_2 is enough. Iām finding that quite confusing to think about!
Thanks for sharing the original definition! I didnāt realise Turing had defined the parameters so precisely, and that they werenāt actually that strict! I
I probably need to stop saying that AI hasnāt passed the Turing test yet then. I guess it has! Youāre right that this ends up being an argument over semantics, but seems fair to let Alan Turing define what the term āTuring Testā should mean.
But I do think that the stricter form of the Turing test defined in that metaculus forecast is still a really useful metric for deciding when AGI has been achieved, whereas this much weaker Turing test probably isnāt.
(Also, for what itās worth, the business tasks I have in mind here arenāt really ācomplexā, they are the kind of tasks that an average human could quite easily do well on within a 5-minute window, possibly as part of a Turing-test style setup, but LLMs struggle with)
I donāt think we should say AI has passed the Turing test until it has passed the test under conditions similar to this:
But I do really like that these researchers have put the test online for people to try!
https://āāturingtest.live/āā
Iāve had one conversation as the interrogator, and I was able to easily pick out the human in 2 questions. My opener was:
āHi, how many words are there in this sentence?ā
The AI said ā8ā, I said āare you sure?ā, and it re-iterated its incorrect answer after claiming to have recounted.
The human said ā9ā, I said āare you sure?ā, and they said āyes?ā.. indicating confusion and annoyance for being challenged on such an obvious question.
Maybe I was paired with one of the worse LLMs⦠but unless itās using hidden chain of thought under the hood (which it doesnāt sound like it is) then I donāt think even GPT 4.5 can accurately perform counting tasks without writing out its full working.
My current job involves trying to get LLMs to automate business tasks, and my impression is that current state of the art models are still a fair way from something which is truly indistinguishable from an average human, even when confronted with relatively simple questions! (Not saying they wonāt quickly close the gap though, maybe they will!)
Evolution is chaotic and messy, but so is stochastic gradient descent (the word āstochasticā is in the name!) The optimisation function might be clean, but the process we use to search for optimum models is not.
If AGI emerges from the field of machine learning in the state itās in today, then it wonāt be ādesignedā to pursue a goal, any more than humans were designed. Instead it will emerge from a random process, through billions of tiny updates, and this process will just have been rigged to favour things which do well on some chosen metric.
This seems extremely similar to how humans were created, through evolution by natural selection. In the case of humans, the metric being optimized for was the ability to spread our genes. In AIs, it might be accuracy at predicting the next word, or human helpfulness scores.
The closest things to AGI we have so far do not act with āstrict logical efficiencyā, or always behave rationally. In fact, logic puzzles are one of the things they particularly struggle with!
I voted ādisagreeā on this, not because Iām highly confident you are wrong, but because I think things are a lot less straightforward than this. A couple of counterpoints that I think clash with this thesis:
Human morality may be a consequence of evolution, but modern āmoralā behaviour often involves acting in ways which have no evolutionary advantage. For example, lots of EAs make significant sacrifices to help people on the other side of the world, who are outside their community and will never have a chance to reciprocate, or to help non-human animals who we evolved to eat. I think thereās two ways you can take this: (1) the evolutionary explanation of morality is flawed or incomplete, or (2) evolution has given us some generic ability to feel compassion to others which originally helped us to co-operate more effectively, but is now āmisfiringā and leading us to e.g. embrace utilitarianism. I think either explanation is good news for morality in AGIs. Moral behaviour may follow naturally from relatively simple ideas or values that we might expect an AGI to have or adopt (especially if we intentionally try to make this happen).
You draw a distinction between AGI which is āprogrammed with a goal and will optimise towards that goalā and humans who evolved to survive, but actually these processes seem very similar. Evolutionary pressures select for creatures who excel at a single goal: reproducing, in a very similar way to how ML training algorithms like gradient descent will select for artificial intelligences that excel at a single goal: minimizing some cost function. But a lot of humans have still ended up adopting goals which donāt seem to align with the primary goal (e.g. donating kidneys to strangers, or using contraception), and thereās every reason to expect AGI to be the same (I think in AI safety they use the term āmesa-optimizationā to describe this phenomenon...?) Now I think in AI safety this is usually talked about as a bad thing. Maybe AGI could end up being a mesa-optimizer for some bad goal that their designer never considered. But it seems like a lot of your argument rests on there being this big distinction between AI training, and evolution. If the two things are in fact very similar, then that again seems to be a reason for some optimism. Humans were created through an optimization procedure that optimized for a primary goal, but we now often act in moral ways, even if this conflicts with that goal. Maybe the same could happen for AGIs!
To be clear, I donāt think this is a watertight argument that AGIs will be moral, I think itās an argument for just being really uncertain. For example, maybe utilitarianism is a kind of natural idea that any intelligent being who feels some form of compassion might arrive at (this seems very plausible to me), but maybe a pure utilitarian superintelligence would actually be a bad outcome! Maybe we donāt want the universe filled with organisms on heroin! Or for everyone else to be sacrificed to an AGI utility monster.
I can see lots of reasons for worry, but I think thereās reasons for optimism too.
Iām feeling inspired by Anneliese Doddsā decision to resign as a government minister over this issue, which is grabbing the headlines today! Before that Iād been feeling very disappointed about the lack of pushback I was seeing in news coverage.
I havenāt written my letter to my MP yet, but Iāve remembered that I am actually a member of the Labour party. Would a letter to my local Labour MP have even more impact if I also cancelled my Labour membership in protest? Ok, I might not be a government minister, Iām just an ordinary party member who hasnāt attended a party event in years, but still, they get some money from me at the moment!
Or would cancelling the membership mean I have less influence on future issues, and so ultimately be counter-productive? Any thoughts?
In addition, o3 was also trained on the public data of ARC-AI, a dataset comprised of abstract visual reasoning problems in the style of Ravenās progressive matrices [52]. When combined with the large amount of targeted research this benchmark has attracted in recent years, the high scores achieved by o3 should not be considered a reliable metric of general reasoning capabilities.
This take seems to contradict Francois Cholletās own write-up of the o3 ARC results, where he describes the results as:
a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs. o3 is a system capable of adapting to tasks it has never encountered before
(taken from your reference 52 , emphasis mine)
You could write this off as him wanting to talk-up the significance of his own benchmark, but Iām not sure that would be right. He has been very publicly sceptical of the ability of LLMs to scale to general intelligence, so this is a kind of concession from him. And he had already laid the groundwork in his Dwarkesh Patel interview to explain away high ARC performance as cheating if it tackled the problem in the wrong way, cracking it through memorization via an alternative route (e.g. auto-generating millions of ARC-like problems and training on those). He could easily have dismissed the o3 results on those grounds, but chose not to, which made an impression on me (a non-expert trying to decide how to weigh up the opions of different experts). Presumably he is aware that o3 trained on the public dataset, and doesnāt view that as cheating. The public dataset is small, and the problems are explicitly designed to resist memorization, requiring general intelligence. Being told the solution to earlier problems is not supposed to help you solve later problems.
Whatās your take on this? Do you disagree with the write up in [52]? Or do you think Iām mischaracterizing his position (there are plenty of caveats outside the bit I selectively quoted as wellāso maybe I am)?
The fact that the human-level ARC performance could only be achieved by extremely high inference-time compute costs seems significant too. Why would we get inference time scaling if chain-of-thought consisted of not much more than post-hoc rationalizations, instead of real reasoning?
For context, I used to be pretty sympathetic to the āLLMs do most of the impressive stuff by memorization and are pretty terrible at novel tasksā position, and still think this is a good model for the non-reasoning LLMs, but my views have changed a lot since the reasoning models, particularly because of the ARC results.
This is an interesting analysis!
I agree with MaxRaās point. When I skim read āMetaculus pro forecasters were better than the bot team, but not with statistical significanceā I immediately internalised that the message was ābots are getting almost as good as prosā (a message I probably already got from the post title!) and it was only when I forced myself to slow down and read it more carefully that I realised this is not what this result means (for example you could have done this study only using a single question, and this stated result could have been true, but likely not tell you much either way about their relative performance). I only then noticed that both main results were null results. Iām then not sure if this actually supports the āBots are closing the gapā claim or not..?
The histogram plot is really useful, and the points of reference are helpful too. Iād be interested to know what the histogram would look like if you compared pro human forecasters to average human forecasters on a similar set of questions? How big an effect do we see there? Or maybe to get more directly at what Iām wondering: how do bots compare to average human forecasters? Are they better with statistical significance, or not? Has this study already been done?
Thanks for the link, Iāve just given your previous post a read. It is great! Extremely well written! Thanks for sharing!
I have a few thoughts on it I thought Iād just share. Would be interested to read a reply but donāt worry if it would be too time consuming.
I agree that your laser example is a good response to the āreplace one neuron at a timeā argument, and that at least in the context of that argument, computational complexity does matter. You canāt replace components of a brain with simulated parts if the simulated parts canāt keep up with the rest. If neurons are not individually replaceable, or at least not individually replaceable with something that can match the speed of a real neuron, (and I accept this seems possible) then I agree that the āreplace one neuron at a timeā thought experiment fails.
Computational complexity still seems pretty irrelevant for the other thought experiments: whether we can simulate a whole brain on a computer, and whether we can simulate a brain with a pencil and paper. Sure, itās going to take a very long time to get results, but why does that matter? Itās a thought experiment anyway.
I agree with you that the answer to the question āis this system conscious?ā should be observer independent. But I didnāt really follow why this belief is incompatible with functionalism?
I like the āreplace one neuron at a timeā thought-experiment, but accept it has flaws. For me, itās that we could in principle simulate a brain on a digital computer and have it behave identically, that convinces me of functionalism. I canāt grok how some system could behave identically but its thoughts not āexistā.
Thanks for the reply, this definitely helps!
The brain operating according to the known laws of physics doesnāt imply we can simulate it on a modern computer (assuming you mean a digital computer). A trivial example is certain quantum phenomena. Digital hardware doesnāt cut it.
Could you explain what you mean by this..? I wasnāt aware that there were any quantum phenomena that could not be simulated on a digital computer? Where do the non-computable functions appear in quantum theory? (My background: I have a PhD in theoretical physics, which certainly doesnāt make me an expert on this question, but Iād be very surprised if this was true and Iād never heard about it! And Iād be a bit embarrassed if it was a fact considered ātrivialā and I was unaware of it!)
There are quantum processes that canāt be simulated efficiently on a digital computer, but that is a different question.
I donāt think I fully understand exactly what you are arguing for here, but would be interested in asking a few questions to help me understand it better, if youāre happy to answer?
If the human brain operates according to the known laws of physics, then in principle we could simulate it on a modern computer, and it would behave identically to the real thing (i.e. would respond in the same way to the same stimuli, and claim to see a purple ball with grandmaās face on it if given simulated LSD). Would such a brain simulation have qualia according to your view? Yes, no, or you donāt think the brain operates according to known laws of physics?
If (1) is answered no, what would happen if you gradually replaced a biological brain with a simulated brain bit by bit, replacing sections of the cells one at a time with a machine running a simulation of its counterpart? What would that feel like for the person? Their consciousness would slowly be disappearing but they would not outwardly behave any differently, which seems very odd.
If (1) is answered yes, does that mean that whatever this strange property of the EM field is, it will necessarily be possessed by the inner workings of the computer as well, when this simulation is run?
If (3) is answered yes, what if you instead ran the simulation with a pencil and paper, instead of an electronic computer. Would that simulated brain have qualia? You can execute any computer program with a pencil and paper (using paper as the memory and doing the necessary instructions yourself with the pencil) if you have enough time. But it seems much clearer here that there will be nothing unusual happening in the EM field when you do this simulation.
If all the fields of physics are made of qualia, then everything is made of qualia, including the electron field, the quark fields, etc?
Ah, thatās a really interesting way of looking at it, that you can trade training-compute for inference-compute to only bring forward capabilities that would have come about anyway via simply training larger models. I hadnāt quite got this message from your post.
My understanding of Francois Cholletās position (heās where I first heard the comparison of logarithmic inference-time scaling to brute force searchābefore I saw Tobyās thread) is that RL on chain of thought has unlocked genuinely new capabilities that would have been impossible simply by scaling traditional LLMs (or maybe it has to be chain of thought combined with tree-searchābut whatever the magic ingredient is he has acknowledged that o3 has it).
Of course this could just be his way of explaining why the o3 ARC results donāt prove his earlier positions wrong. People donāt like to admit when theyāre wrong! But this view still seems plausible to me, it contradicts the ātrading offā narrative, and Iād be extremely interested to know which picture is correct. Iāll have to read that paper!
But I guess maybe it doesnāt matter a lot in practice, in terms of the impact that reasoning models are capable of having.
This was a thought-provoking and quite scary summary of what reasoning models might mean.
I think this sentence may have a mistake though:
āyou can have GPT-o1 think 100-times longer than normal, and get linear increases in accuracy on coding problems.ā
Doesnāt the graph show that the accuracy gains are only logarithmic? The x-axis is a log scale.
This logarithmic relationship between performance and test-time compute is characteristic of brute-force search, and maybe is the one part of this story that means the consequences wonāt be quite so explosive? Or have I misunderstood?
It might be fair to say that the o3 improvements are something fundamentally different to simple scaling, and that Chollet is still correct in his āLLMs will not simply scale to AGIā prediction. I didnāt mean in my comment to suggest he was wrong about that.
I could imagine someone criticizing him for exaggerating how far away we were from coming up with the necessary new ideas, given the o3 results, but Iām not so interested in the debate about exactly how right or wrong the predictions of this one person were.
The interesting thing for me is: whether he was wrong, or whether he was right but o3 does represent a fundamentally different kind of model, the upshot for how seriously we should take o3 seems the same! It feels like a pretty big deal!
He could have reacted to this news by criticizing the way that o3 achieved its results. He already said in the Dwarkesh Patel interview that someone beating ARC wouldnāt necessarily imply progress towards general intelligence if the way they achieved it went against the spirit of the task. When I clicked the link in this post, I thought it likely I was about to read an argument along those lines. But thatās not what I got. Instead he was acknowledging that this was important progress.
Iām by no means an expert, but timelines in the 2030s still seems pretty close to me! Iād have thought, based on arguments from people like Chollet, that we might be a bit further off than that (although only with the low confidence of a layperson trying to interpret the competing predictions of experts who seem to radically disagree with each other).
Given all the problems you mention, and the high costs still involved in running this on simple tasks, I agree it still seems many years away. But previously Iād have put a fairly significant probability on AGI not being possible this century (as well as assigning a significant probability to it happening very soon, basically ending up highly uncertain). But it feels like these results make the idea that AGI is still 100 years away seem much less plausible than it was before.
The ARC performance is a huge update for me.
Iāve previously found Francois Cholletās arguments that LLMs are unlikely to scale to AGI pretty convincing. Mainly because he had created an until-now unbeaten benchmark to back those arguments up.
But reading his linked write-up, he describes this as ānot merely an incremental improvement, but a genuine breakthroughā. He does not admit he was wrong, but instead paints o3 as something fundamentally different to previous LLM-based AIs, which for the purpose of assessing the significance of o3, amounts to the same thing!
Interesting, thanks!