Your accusation of bad faith is incorrect. You shouldnât be so quick to throw the term âbad faithâ around (it means something specific and serious, involving deception or dishonesty) just because you disagree with something â thatâs a bad habit that closes you off to different perspectives.
I think itâs an entirely apt analogy. We do not have an argument from the laws of physics that shows Avi Loeb is wrong about the possible imminent threat from aliens, or the probability of it. The most convincing argument against Loebâs conclusions is about the epistemology of science. That same argument applies, mutatis mutandis, to near-term AGI discourse.
With the work you mentioned, there is often an ambiguity involved. To the extent itâs scientifically defensible, itâs mostly not about AGI. To the extent itâs about AGI, itâs mostly not scientifically defensible.
For example, the famous METR graph about the time horizons of tasks AI systems can complete 80% 50% of the time is probably perfectly fine if you only take it for what it is, which is a fairly narrow, heavily caveated series of measurements of current AI systems on artificially simplified benchmark tasks. Thatâs scientifically defensible, but itâs not about AGI.
When people outside of METR make an inference from this graph to conclusions about imminent AGI, that is not scientifically defensible. This is not a complaint about METRâs research â which is not directly about AGI (at least not in this case) â but about the interpretation of it by people outside of METR to draw conclusions the research does not support. That interpretation is just a hand-wavy philosophical argument, not a scientifically defensible piece of research.
Just to be clear, this is not a criticism of METR, but a criticism of people who misinterpret their work and ignore the caveats that people at METR themselves give.
I suppose itâs worth asking: what evidence, scientific or otherwise, would convince you that this all has been a mistake? That the belief in a significant probability of near-term AGI actually wasnât well-supported after all?
I can give many possible answers to the opposite question, such as (weighted out of 5 in terms of how important they would be to me deciding that I was wrong):
Profitable applications of LLMs or other AI tools that justify current investment levels (3/â5)
Evidence of significant progress on fundamental research problems such as generalization, data inefficiency, hierarchical planning, continual learning, reliability, and so on (5/â5)
Any company such as Waymo or Tesla solving Level 4 or 5 autonomy without a human in the loop and without other things that make the problem artificially easy (4/â5)
Profitable and impressive new applications of humanoid robots in real world applications (4/â5)
Any sort of significant credible evidence of a major increase in AI capabilities, such as LLMs being able to autonomously and independently come up with new correct ideas in science, technology, engineering, medicine, philosophy, economics, psychology, etc. (not as a tool for human researchers to more easily search the research literature or anything along those lines, but doing the actual creative intellectual act itself) (5/â5)
A pure reinforcement learning agent learning to play StarCraft II at an above-average level without first bootstrapping via imitation learning, using no more experience to learn this than AlphaStar (3/â5)
My list is very similar to yours. I believe items 1, 2, 3, 4, and 5 have already been achieved to substantial degrees and we continue to see progress in the relevant areas on a quarterly basis. I donât know about the status of 6.
For clarity on item 1, AI company revenues in 2025 are on track to cover 2024 costs, so on a product basis, AI models are profitable; itâs the cost of new models that pull annual figures into the red. I think this will stop being true soon, but thatâs my speculation, not evidence, so I remain open that scaling will continue to make progress towards AGI, potentially soon.
Your accusation of bad faith seems to rest on your view that the restraints imposed by the laws of physics on space travel make an alien invasion or attack extremely improbable. Such an event may indeed be extremely improbable, but the laws of physics do not say so.
I have to imagine that you are referring to the speeds of spacecraft and the distances involved. The Milky Way Galaxy is 100,000 light-years in diameter organized along a plane in a disc shape that is 1,000 light-years thick. NASAâs Parker Space Probe has travelled at 0.064% the speed of light. Letâs round it to 0.05% of the speed of light for simplicity. At 0.05% the speed of light, the Parker Space Probe could travel between the two farthest points in the Milky Way Galaxy in 200 million years.
That means that if the maximum speed of spacecraft in the galaxy were limited to only the top speed of NASAâs fastest space probe today, an alien civilization that reached an advanced stage of science and technology â perhaps including things like AGI, advanced nanotechnology/âatomically precise manufacturing, cheap nuclear fusion, interstellar spaceships, and so on â more than 200 million years ago would have had plenty of time to establish a presence in every star system of the Milky Way. At 1% the speed of light, the window of time shrinks to 10 million years, and so on.
Designs for spacecraft that credible scientists and engineers thought Earth could actually build in the near future include a light sail-based probe that would supposedly travel at 15-20% the speed of light. Such a probe could traverse the diameter of the Milky Way in under 1 million years at top speed. Acceleration and deceleration complicate the picture somewhat, but the fundamental idea still holds.
If there are alien civilizations in our galaxy, we donât have any clear, compelling scientific reason to think they wouldnât be many millions of years older than our civilization. The Earth formed 4.5 billion years ago, so if a habitable planet elsewhere in the galaxy formed just 10% sooner and put life on that planet on the same trajectory as on ours, the aliens would be 450 million years ahead of us. Plenty of time to reach everywhere in the galaxy.
The Fermi paradox has been considered and discussed by people working in physics, astronomy, rocket/âspacecraft engineering, SETI, and related fields for decades. There is no consensus on the correct resolution to the paradox. Certainly, there is no consensus that the laws of physics resolve it.
So, if Iâm understanding your reasoning correctly â that surely I must be behaving in a dishonest or deceitful way, i.e. engaging in bad faith, because obviously everyone knows the restraints imposed by the laws of physics on space travel make an alien attack on Earth extremely improbable â then your accusation of bad faith seems to rest on a mistake.
Thanks for giving me the opportunity to talk about this because the Fermi paradox is always so much fun to talk about.
My list is very similar to yours. I believe items 1, 2, 3, 4, and 5 have already been achieved to substantial degrees and we continue to see progress in the relevant areas on a quarterly basis. I donât know about the status of 6.
Itâs hard to know what âto substantial degreesâ means. That sounds very subjective. Without the âto substantial degreesâ caveat, it would be easy to prove that 1, 3, 4, and 5 have not been achieved, and fairly straightforward to make a strong case that 2 has not been achieved.
For example, it is simply a fact that Waymo vehicles have a human in the loop â Waymo openly says so â so Waymo has not achieved Level 4â5 autonomy without a human in the loop. Has Waymo achieved Level 4â5 autonomy without humans in the loop âto a substantial degreeâ? That seems subjective. I donât know what âto a substantial degreeâ means to you, and it might mean something different to me, or to other people.
Humanoid robots have not achieved any profitable new applications in recent years, as far as Iâm aware. Again, I donât know what achieving this âto a substantial degreeâ might mean to you.
I would be curious to know what progress you think has been made recently on the fundamental research problems I mentioned, or what the closest examples are to LLMs engaging in the sort of creative intellectual act I described. I imagine the examples you have in mind are not something the majority of AI experts would agree fit the descriptions I gave.
For clarity on item 1, AI company revenues in 2025 are on track to cover 2024 costs, so on a product basis, AI models are profitable; itâs the cost of new models that pull annual figures into the red. I think this will stop being true soon, but thatâs my speculation, not evidence, so I remain open that scaling will continue to make progress towards AGI, potentially soon.
Distinguish here between gold mining vs. selling picks and shovels. Iâm talking about applications of LLMs and AI tools that are profitable for end users. Nvidia is extremely profitable because it sells GPUs to AI companies. In theory, in a hypothetical scenario, AI companies could become profitable by selling AI models as a service (e.g. API tokens, subscriptions) to businesses. But then would those business customers see any profit from the use of LLMs (or other AI tools)? Thatâs what Iâm talking about. Nvidia is selling picks and shovels, and to some extent even the AI companies are selling picks and shovels. Whereâs the gold?
The six-item list I gave was a list of some things that â each on their own but especially in combination â would go a long way toward convincing me that Iâm wrong and my near-term AGI skepticism is a mistake. When you say your list is similar, Iâm not quite sure what you mean. Do you mean that if those things didnât happen, that would convince you that the probability or level of credence you assign to near-term AGI is way too high? I was trying to ask you what evidence would convince you that youâre wrong.
âAny sort of significant credible evidence of a major increase in AI capabilities, such as LLMs being able to autonomously and independently come up with new correct ideas in science, technology, engineering, medicine, philosophy, economics, psychology, etcâ
Just in the spirit of pinning people to concrete claims: would you count progress on Frontier Math 4, like say, models hitting 40%*, as being evidence that this is not so far off for mathematics specifically? (To be clear, I think it is very easy to imagine models that are doing genuinely significant research maths but still canât reliably be a personal assistant, so I am not saying this is strong evidence of near-term AGI or anything like that.) Frontier Math Tier 4 questions allegedly require some degree of ârealâ mathematical creativity and were designed by actual research mathematicians-including in some cases Terry Tao EDIT: that is he supplied some Frontier Math questions, not sure if any were Tier 4, so weâre not talking cranks here. Epoch claim some of the problems can take experts weeks. If you wouldnât count this as evidence that genuine AI contributions to research mathematics might not be more than 6-7 years off, what, if anything would you count as evidence of that? If you donât like Frontier Math Tier 4 as an early warning sign is that because:
1) You think itâs not really true that the problems require real creativity, and you donât think âuncreativeâ ways of solving them will ever get you to being able to do actual research mathematics that could get in good journals.
2) You just donât trust models not to be trained on the test set because there was a scandal about Open AI having access to the answers. (Though as Iâve said, current state of the art is a Google model).
3) 40% is too low, something like 90% would be needed for a real early warning sign.
4) In principle, this would be a good early warning sign if for all we knew RL scaling could continue for many more orders of magnitude, but since we know it canât continue for more than a few, it isnât because by the time your hitting a high level on Frontier Math 4, your hitting the limits of RL-scaling and canât improve further
Of course, maybe you think the metric is fine, but you just expect progress to stall well before scores are high enough to be an early warning sign of real mathematical creativity, because of limits to RL-scaling? *Current best is some version of Gemini at 18%.
I wonder if you noticed that you changed the question? Did you not notice or did you change the question deliberately?
What I brought up as a potential form of important evidence for near-term AGI was:
Any sort of significant credible evidence of a major increase in AI capabilities, such as LLMs being able to autonomously and independently come up with new correct ideas in science, technology, engineering, medicine, philosophy, economics, psychology, etc. (not as a tool for human researchers to more easily search the research literature or anything along those lines, but doing the actual creative intellectual act itself)
You turned the question into:
If you wouldnât count this as evidence that genuine AI contributions to research mathematics might not be more than 6-7 years off, what, if anything would you count as evidence of that?
Now, rather than asking me about the evidence I use to forecast near-term AGI, youâre asking me to forecast the arrival of the evidence I would use for forecasting near-term AGI? Why?
My thought process didnât go beyond âYarrow seems committed to a very low chance of AI having real, creative research insights in the next few years, here is something that puts some pressure on thatâ. Obviously I agree that when AGI will arrive is a different question from when models will have real insights in research mathematics. Nonetheless I got the feeling-maybe incorrectly, that your strength of conviction that AGI is partly based on things like âmodels in the current paradigm canât have âreal insightââ, so it seemed relevant, even though âreal insight in maths is probably coming soon, but AGI likely over 20 years awayâ is perfectly coherent, and indeed close to my own view.
I have no idea when AI systems will be able to do math research and generate original, creative ideas autonomously, but it will certainly be very interesting if/âwhen they do.
It seems like thereâs not much of a connection between the FrontierMath benchmark and this, though. LLMs have been scoring well on question-and-answer benchmarks in multiple domains for years and havenât produced any original, correct ideas yet, as far as Iâm aware. So, why would this be different?
LLMs have been scoring above 100 on IQ tests for years and yet canât do most of the things humans who score above 100 on IQ tests can do. If an LLM does well on math problems that are hard for mathematicians or math grad students or whatever, that doesnât necessarily imply it will be able to do the other things, even within the domain of math, that mathematicians or math grad students do.
We have good evidence for this because LLMs as far back as GPT-4 nearly 3 years ago have done well on a bunch of written tests. Despite there being probably over 1 billion regular users of LLMs and trillions of queries put to LLMs, thereâs no indication Iâm aware of an LLM coming up with a novel, correct idea of any note in any academic or technical field. Is there a reason to think performance on the FrontierMath benchmark would be different than the trend weâve already seen with other benchmarks over the last few years?
The FrontierMath problems may indeed require creativity from humans to solve them, but that doesnât necessarily mean solving them is a sign of creativity from LLMs. By analogy, playing grandmaster-level chess may require creativity from humans, but not from computers.
This is related to an old idea in AI called Moravecâs paradox, which warns us not to assume what is hard for humans is hard for computers, or what is easy for humans is easy for computers.
I guess I feel like if being able to solve mathematical problems designed by research mathematicians to be similar to the kind of problems they solve in their actual work is not decent evidence that AIs are on track to be able to do original research in mathematics in less than say 8 years then what would you EVER accept as empirical evidence that we are on track for that, but not there yet?
Note that I am not saying this should push your overall confidence to over 50% or anything, just that it ought to move you up by a non-trivial amount relative to whatever your credence was before. I am certainly NOT saying that skill on Frontier Math 4 will inevitably transfer to real research mathematics, just that you should think there is a substantial risk that it will.
I am not persuaded by the analogy to IQ test scores for the following reason. It is far from clear that the tasks that LLMs canât do despite scoring 100 on IQ tests are anything like as similar as the Frontier Math 4 tasks are at least allegedly designed to resemble real research questions in mathematics*, because the latter are being deliberately designed for similarity, whereas IQ tests are just designed so that skill on them correlates with skill on intellectual tasks in general among humans. (I also think the inference towards âthey will be able to DO research mathâ, from progress on Frontier Math 4, is rather less shaky than âthey will DO proper research math in the same way as humansâ. Itâs not clear to me what tasks actually require âreal creativityâ if that means a particular reasoning style, rather than just the production of novel insights as an end product. I donât think you or anyone else knows this either.) Real math is also uniquely suited to questions-and-answer benchmarks I think, because things really are often posed as extremely well-defined problems with determinate answers, i.e. prove X. Proving things is not literally the only skill mathematicians have, but being able to prove the right stuff is enough to be making a real contribution. In my view that makes claims for construct validity here much more plausible than say, inferring Chat-GTP can be a lawyer if it passes the bar exam.
In general, your argument here seems like it could be deployed against literally any empirical evidence that AIs were approaching being able to do a task, short of them actually performing that task. You can always say âjust because in humans, ability to do X is correlated with ability to do Y, doesnât mean the techniques the models are using to do X can do Y with a bit of improvement.â And yes, that is always true, that it doesnât *automatically* mean that. But if you allow this to mean that no success on any task ever significantly moves you at all about future real world progress on intuitively similar but harder tasks, you are basically saying it is impossible to get empirical evidence that progress is coming before it has arrived, which is just pretty suspicious a priori. What you should do in my view, is think carefully about the construct validity of the particular benchmark in question, and then-roughly-updated your view based on how likely you think it is to be basically valid, and what it would mean if it was. You should take into account the risk that success on Frontier Math 4 is giving real signal, not just the risk that it is meaningless.
My personal guess is that it is somewhat meaningful, and we will see the first real AI contributions to maths in 6-7 years, that is 60% chance by then of AI proofs important enough for credible mid-ranking journals. EDIT: I forgot my own forecast here, I expect saturation in about 5 years so âseveralâ years is an exaggeration. Nonetheless I expect some gap between Frontier Math 4 being saturated and the first real contribuitions to research mathematics: I guess 6-9 years until real contributions is more like my forecast than 6-7 To be clear, I say âsomewhatâ because this is several years after I expect the benchmark itself to saturate. But I am not shocked if someone thinks âno, it is more likely to be meaninglessâ. But I do think if your going to make a strong version of the âitâs meaninglessâ case where you donât see the results as signal to any non-negligible degree, you need more than to just say âsome other benchmarks in far less formal demains, apparently far less similar to the real world tasks being measured, have low construct validity.â
In your view, is it possible to design a benchmark that a) does not literally amount to âproduce a novel important proofâ, but b) nonetheless improvements on the benchmark give decent evidence that we are moving towards models being able to do this? If it is possible, how would it differ from Frontier Math 4?
*I am prepared to change my mind on this if a bunch of mathematicians say âno, actually the questions donât look like they were optimized for this.â
I am not breaking new ground by saying it would be far more interesting to see an AI system behave like a playful, curious toddler or a playful, curious cat than a mathematician. That would be a sign of fundamental, paradigm-shifting capabilities improvement and would make me think maybe AGI is coming soon.
I agree that IQ tests were designed for humans, not machines, and thatâs a reason to think itâs a poor test for machines, but what about all the other tests that were designed for machines? GPT-4 scored quite high on a number of LLM benchmarks in March 2023. Has enough time passed that we can say LLM benchmark performance doesnât meaningfully translate into real world capabilities? Or do we have to reserve judgment for some number of years still?
If your argument is that math as a domain is uniquely well-suited to the talents of LLMs, that could be true. I donât know. Maybe LLMs will become an amazing AI tool for math, similar to AlphaFold for protein structure prediction. That would certainly be interesting, and would be exciting progress for AI.
I would say this argument is highly irreducibly uncertain and approaches the level of uncertainty of something like guessing whether the fundamental structure of physical reality matches the fundamental mathematical structure of string theory. Iâm not sure itâs meaningful to assign probabilities to that.
It also doesnât seem like it would be particularly consequential outside of mathematics, or outside of things that mathematical research directly affects. If benchmark performance in other domains doesnât generalize to research, but benchmark performance in math does generalize to math research, well, then, that affects math research and only math research. Which is really interesting, but would be a breakthrough akin to AlphaFold â consequential for one domain and not others.
You said that my argument against accepting FrontierMath performance as evidence for AIs soon being able to perform original math research is overly general, such that a similar argument could be used against any evidence of progress. But what you said about that is overly general and similar reasoning could be used against any argument about not accepting a certain piece of evidence about current AI capabilities to support a certain conclusion about AI capabilities forecasting.
I suppose looking at the general contours of arguments from 30,000 feet in the air rather than their specifics and worrying âwhat ifâ is not particularly useful.
I guess I still just want to ask: If models hit 80% on frontier math by like June 2027, how much does that change your opinion on whether models will be capable of âgenuine creativityâ in at least one domain by 2033. Iâm not asking for an exact figure, just a ballpark guess. If the answer is âhardly at allâ, is there anything short of an 100% clear example of a novel publishable research insight in some domain, that would change your opinion on when âreal creativityâ will arrive?
What I just said: AI systems acting like a toddler or a cat would make me think AGI might be developed soon.
Iâm not sure FrontierMath is any more meaningful than any other benchmark, including those on which LLMs have already gotten high scores. But I donât know.
I asked about genuine research creativity not AGI, but I donât think this conversation is going anywhere at this point. It seems obvious to me that âdoes stuff mathematicians say makes up the building blocks of real researchâ is meaningful evidence that the chance that models will do research level maths in the near future is not ultra-low, given that capabilities do increase with time. I donât think this analogous to IQ tests or the bar exam, and for other benchmarks, I would really need to see what your claiming is the equivalent of the transfer from frontier math 4 to real math that was intuitive but failed.
What percentage probability would you assign to your ability to accurately forecast this particular question?
Iâm not sure why youâre interested in getting me to forecast this. I havenât ever made any forecasts about AI systemsâ ability to do math research. I havenât made any statements about AI systemsâ current math capabilities. I havenât said that evidence of AI systemsâ ability to do math research would affect how I think about AGI. So, whatâs the relevance? Does it have a deeper significance, or is it just a random tangent?
If there is a connection to the broader topic of AGI or AI capabilities, I already gave a bunch of examples of evidence I would consider to be relevant and that would change my mind. Math wasnât one of them. I would be happy to think of more examples as well.
I think a potentially good counterexample to your argument about FrontierMath â original math research is natural language processing â replacing human translators. Surely you would agree that LLMs have mastered the basic building blocks of translation? So, 2-3 years after GPT-4, why is demand for human translators still growing? One analysis claims that growth is counterfactually less that it would have been without the increase in the usage of machine translation, but demand is still growing.
I think this points to the difficulty in making these sorts of predictions. If back in 2015, someone had described to you the capabilities and benchmark performance of GPT-4 in 2023, as well as the rate of scaling of new models and progress on benchmarks, would you have thought that demand for human translators would continue to grow for at least the next 2-3 years?
I donât have any particular point other than what seems intuitively obvious in the realm of AI capabilities forecasting may in fact be false, and I am skeptical of hazy extrapolations.
The most famous example of a failed prediction of this sort is Geoffrey Hintonâs prediction in 2016 that radiologistsâ jobs would be fully automated by 2021. Almost ten years after this prediction, the number of radiologists is still growing and radiologistsâ salaries are growing. AI tools that assist in interpreting radiology scans exist, but evidence is mixed on whether they actually help or hinder radiologists (and possibly harm patients).
Your accusation of bad faith is incorrect. You shouldnât be so quick to throw the term âbad faithâ around (it means something specific and serious, involving deception or dishonesty) just because you disagree with something â thatâs a bad habit that closes you off to different perspectives.
I think itâs an entirely apt analogy. We do not have an argument from the laws of physics that shows Avi Loeb is wrong about the possible imminent threat from aliens, or the probability of it. The most convincing argument against Loebâs conclusions is about the epistemology of science. That same argument applies, mutatis mutandis, to near-term AGI discourse.
With the work you mentioned, there is often an ambiguity involved. To the extent itâs scientifically defensible, itâs mostly not about AGI. To the extent itâs about AGI, itâs mostly not scientifically defensible.
For example, the famous METR graph about the time horizons of tasks AI systems can complete
80%50% of the time is probably perfectly fine if you only take it for what it is, which is a fairly narrow, heavily caveated series of measurements of current AI systems on artificially simplified benchmark tasks. Thatâs scientifically defensible, but itâs not about AGI.When people outside of METR make an inference from this graph to conclusions about imminent AGI, that is not scientifically defensible. This is not a complaint about METRâs research â which is not directly about AGI (at least not in this case) â but about the interpretation of it by people outside of METR to draw conclusions the research does not support. That interpretation is just a hand-wavy philosophical argument, not a scientifically defensible piece of research.
Just to be clear, this is not a criticism of METR, but a criticism of people who misinterpret their work and ignore the caveats that people at METR themselves give.
I suppose itâs worth asking: what evidence, scientific or otherwise, would convince you that this all has been a mistake? That the belief in a significant probability of near-term AGI actually wasnât well-supported after all?
I can give many possible answers to the opposite question, such as (weighted out of 5 in terms of how important they would be to me deciding that I was wrong):
Profitable applications of LLMs or other AI tools that justify current investment levels (3/â5)
Evidence of significant progress on fundamental research problems such as generalization, data inefficiency, hierarchical planning, continual learning, reliability, and so on (5/â5)
Any company such as Waymo or Tesla solving Level 4 or 5 autonomy without a human in the loop and without other things that make the problem artificially easy (4/â5)
Profitable and impressive new applications of humanoid robots in real world applications (4/â5)
Any sort of significant credible evidence of a major increase in AI capabilities, such as LLMs being able to autonomously and independently come up with new correct ideas in science, technology, engineering, medicine, philosophy, economics, psychology, etc. (not as a tool for human researchers to more easily search the research literature or anything along those lines, but doing the actual creative intellectual act itself) (5/â5)
A pure reinforcement learning agent learning to play StarCraft II at an above-average level without first bootstrapping via imitation learning, using no more experience to learn this than AlphaStar (3/â5)
My list is very similar to yours. I believe items 1, 2, 3, 4, and 5 have already been achieved to substantial degrees and we continue to see progress in the relevant areas on a quarterly basis. I donât know about the status of 6.
For clarity on item 1, AI company revenues in 2025 are on track to cover 2024 costs, so on a product basis, AI models are profitable; itâs the cost of new models that pull annual figures into the red. I think this will stop being true soon, but thatâs my speculation, not evidence, so I remain open that scaling will continue to make progress towards AGI, potentially soon.
Do you stand by your accusation of bad faith?
Your accusation of bad faith seems to rest on your view that the restraints imposed by the laws of physics on space travel make an alien invasion or attack extremely improbable. Such an event may indeed be extremely improbable, but the laws of physics do not say so.
I have to imagine that you are referring to the speeds of spacecraft and the distances involved. The Milky Way Galaxy is 100,000 light-years in diameter organized along a plane in a disc shape that is 1,000 light-years thick. NASAâs Parker Space Probe has travelled at 0.064% the speed of light. Letâs round it to 0.05% of the speed of light for simplicity. At 0.05% the speed of light, the Parker Space Probe could travel between the two farthest points in the Milky Way Galaxy in 200 million years.
That means that if the maximum speed of spacecraft in the galaxy were limited to only the top speed of NASAâs fastest space probe today, an alien civilization that reached an advanced stage of science and technology â perhaps including things like AGI, advanced nanotechnology/âatomically precise manufacturing, cheap nuclear fusion, interstellar spaceships, and so on â more than 200 million years ago would have had plenty of time to establish a presence in every star system of the Milky Way. At 1% the speed of light, the window of time shrinks to 10 million years, and so on.
Designs for spacecraft that credible scientists and engineers thought Earth could actually build in the near future include a light sail-based probe that would supposedly travel at 15-20% the speed of light. Such a probe could traverse the diameter of the Milky Way in under 1 million years at top speed. Acceleration and deceleration complicate the picture somewhat, but the fundamental idea still holds.
If there are alien civilizations in our galaxy, we donât have any clear, compelling scientific reason to think they wouldnât be many millions of years older than our civilization. The Earth formed 4.5 billion years ago, so if a habitable planet elsewhere in the galaxy formed just 10% sooner and put life on that planet on the same trajectory as on ours, the aliens would be 450 million years ahead of us. Plenty of time to reach everywhere in the galaxy.
The Fermi paradox has been considered and discussed by people working in physics, astronomy, rocket/âspacecraft engineering, SETI, and related fields for decades. There is no consensus on the correct resolution to the paradox. Certainly, there is no consensus that the laws of physics resolve it.
So, if Iâm understanding your reasoning correctly â that surely I must be behaving in a dishonest or deceitful way, i.e. engaging in bad faith, because obviously everyone knows the restraints imposed by the laws of physics on space travel make an alien attack on Earth extremely improbable â then your accusation of bad faith seems to rest on a mistake.
Thanks for giving me the opportunity to talk about this because the Fermi paradox is always so much fun to talk about.
Itâs hard to know what âto substantial degreesâ means. That sounds very subjective. Without the âto substantial degreesâ caveat, it would be easy to prove that 1, 3, 4, and 5 have not been achieved, and fairly straightforward to make a strong case that 2 has not been achieved.
For example, it is simply a fact that Waymo vehicles have a human in the loop â Waymo openly says so â so Waymo has not achieved Level 4â5 autonomy without a human in the loop. Has Waymo achieved Level 4â5 autonomy without humans in the loop âto a substantial degreeâ? That seems subjective. I donât know what âto a substantial degreeâ means to you, and it might mean something different to me, or to other people.
Humanoid robots have not achieved any profitable new applications in recent years, as far as Iâm aware. Again, I donât know what achieving this âto a substantial degreeâ might mean to you.
I would be curious to know what progress you think has been made recently on the fundamental research problems I mentioned, or what the closest examples are to LLMs engaging in the sort of creative intellectual act I described. I imagine the examples you have in mind are not something the majority of AI experts would agree fit the descriptions I gave.
Distinguish here between gold mining vs. selling picks and shovels. Iâm talking about applications of LLMs and AI tools that are profitable for end users. Nvidia is extremely profitable because it sells GPUs to AI companies. In theory, in a hypothetical scenario, AI companies could become profitable by selling AI models as a service (e.g. API tokens, subscriptions) to businesses. But then would those business customers see any profit from the use of LLMs (or other AI tools)? Thatâs what Iâm talking about. Nvidia is selling picks and shovels, and to some extent even the AI companies are selling picks and shovels. Whereâs the gold?
The six-item list I gave was a list of some things that â each on their own but especially in combination â would go a long way toward convincing me that Iâm wrong and my near-term AGI skepticism is a mistake. When you say your list is similar, Iâm not quite sure what you mean. Do you mean that if those things didnât happen, that would convince you that the probability or level of credence you assign to near-term AGI is way too high? I was trying to ask you what evidence would convince you that youâre wrong.
âAny sort of significant credible evidence of a major increase in AI capabilities, such as LLMs being able to autonomously and independently come up with new correct ideas in science, technology, engineering, medicine, philosophy, economics, psychology, etcâ
Just in the spirit of pinning people to concrete claims: would you count progress on Frontier Math 4, like say, models hitting 40%*, as being evidence that this is not so far off for mathematics specifically? (To be clear, I think it is very easy to imagine models that are doing genuinely significant research maths but still canât reliably be a personal assistant, so I am not saying this is strong evidence of near-term AGI or anything like that.) Frontier Math Tier 4 questions allegedly require some degree of ârealâ mathematical creativity and were designed by actual research mathematicians-including in some cases Terry Tao EDIT: that is he supplied some Frontier Math questions, not sure if any were Tier 4, so weâre not talking cranks here. Epoch claim some of the problems can take experts weeks. If you wouldnât count this as evidence that genuine AI contributions to research mathematics might not be more than 6-7 years off, what, if anything would you count as evidence of that? If you donât like Frontier Math Tier 4 as an early warning sign is that because:
1) You think itâs not really true that the problems require real creativity, and you donât think âuncreativeâ ways of solving them will ever get you to being able to do actual research mathematics that could get in good journals.
2) You just donât trust models not to be trained on the test set because there was a scandal about Open AI having access to the answers. (Though as Iâve said, current state of the art is a Google model).
3) 40% is too low, something like 90% would be needed for a real early warning sign.
4) In principle, this would be a good early warning sign if for all we knew RL scaling could continue for many more orders of magnitude, but since we know it canât continue for more than a few, it isnât because by the time your hitting a high level on Frontier Math 4, your hitting the limits of RL-scaling and canât improve further
Of course, maybe you think the metric is fine, but you just expect progress to stall well before scores are high enough to be an early warning sign of real mathematical creativity, because of limits to RL-scaling?
*Current best is some version of Gemini at 18%.
I wonder if you noticed that you changed the question? Did you not notice or did you change the question deliberately?
What I brought up as a potential form of important evidence for near-term AGI was:
You turned the question into:
Now, rather than asking me about the evidence I use to forecast near-term AGI, youâre asking me to forecast the arrival of the evidence I would use for forecasting near-term AGI? Why?
My thought process didnât go beyond âYarrow seems committed to a very low chance of AI having real, creative research insights in the next few years, here is something that puts some pressure on thatâ. Obviously I agree that when AGI will arrive is a different question from when models will have real insights in research mathematics. Nonetheless I got the feeling-maybe incorrectly, that your strength of conviction that AGI is partly based on things like âmodels in the current paradigm canât have âreal insightââ, so it seemed relevant, even though âreal insight in maths is probably coming soon, but AGI likely over 20 years awayâ is perfectly coherent, and indeed close to my own view.
Anyway, why canât you just answer my question?
I have no idea when AI systems will be able to do math research and generate original, creative ideas autonomously, but it will certainly be very interesting if/âwhen they do.
It seems like thereâs not much of a connection between the FrontierMath benchmark and this, though. LLMs have been scoring well on question-and-answer benchmarks in multiple domains for years and havenât produced any original, correct ideas yet, as far as Iâm aware. So, why would this be different?
LLMs have been scoring above 100 on IQ tests for years and yet canât do most of the things humans who score above 100 on IQ tests can do. If an LLM does well on math problems that are hard for mathematicians or math grad students or whatever, that doesnât necessarily imply it will be able to do the other things, even within the domain of math, that mathematicians or math grad students do.
We have good evidence for this because LLMs as far back as GPT-4 nearly 3 years ago have done well on a bunch of written tests. Despite there being probably over 1 billion regular users of LLMs and trillions of queries put to LLMs, thereâs no indication Iâm aware of an LLM coming up with a novel, correct idea of any note in any academic or technical field. Is there a reason to think performance on the FrontierMath benchmark would be different than the trend weâve already seen with other benchmarks over the last few years?
The FrontierMath problems may indeed require creativity from humans to solve them, but that doesnât necessarily mean solving them is a sign of creativity from LLMs. By analogy, playing grandmaster-level chess may require creativity from humans, but not from computers.
This is related to an old idea in AI called Moravecâs paradox, which warns us not to assume what is hard for humans is hard for computers, or what is easy for humans is easy for computers.
I guess I feel like if being able to solve mathematical problems designed by research mathematicians to be similar to the kind of problems they solve in their actual work is not decent evidence that AIs are on track to be able to do original research in mathematics in less than say 8 years then what would you EVER accept as empirical evidence that we are on track for that, but not there yet?
Note that I am not saying this should push your overall confidence to over 50% or anything, just that it ought to move you up by a non-trivial amount relative to whatever your credence was before. I am certainly NOT saying that skill on Frontier Math 4 will inevitably transfer to real research mathematics, just that you should think there is a substantial risk that it will.
I am not persuaded by the analogy to IQ test scores for the following reason. It is far from clear that the tasks that LLMs canât do despite scoring 100 on IQ tests are anything like as similar as the Frontier Math 4 tasks are at least allegedly designed to resemble real research questions in mathematics*, because the latter are being deliberately designed for similarity, whereas IQ tests are just designed so that skill on them correlates with skill on intellectual tasks in general among humans. (I also think the inference towards âthey will be able to DO research mathâ, from progress on Frontier Math 4, is rather less shaky than âthey will DO proper research math in the same way as humansâ. Itâs not clear to me what tasks actually require âreal creativityâ if that means a particular reasoning style, rather than just the production of novel insights as an end product. I donât think you or anyone else knows this either.) Real math is also uniquely suited to questions-and-answer benchmarks I think, because things really are often posed as extremely well-defined problems with determinate answers, i.e. prove X. Proving things is not literally the only skill mathematicians have, but being able to prove the right stuff is enough to be making a real contribution. In my view that makes claims for construct validity here much more plausible than say, inferring Chat-GTP can be a lawyer if it passes the bar exam.
In general, your argument here seems like it could be deployed against literally any empirical evidence that AIs were approaching being able to do a task, short of them actually performing that task. You can always say âjust because in humans, ability to do X is correlated with ability to do Y, doesnât mean the techniques the models are using to do X can do Y with a bit of improvement.â And yes, that is always true, that it doesnât *automatically* mean that. But if you allow this to mean that no success on any task ever significantly moves you at all about future real world progress on intuitively similar but harder tasks, you are basically saying it is impossible to get empirical evidence that progress is coming before it has arrived, which is just pretty suspicious a priori. What you should do in my view, is think carefully about the construct validity of the particular benchmark in question, and then-roughly-updated your view based on how likely you think it is to be basically valid, and what it would mean if it was. You should take into account the risk that success on Frontier Math 4 is giving real signal, not just the risk that it is meaningless.
My personal guess is that it is somewhat meaningful, and we will see the first real AI contributions to maths in 6-7 years, that is 60% chance by then of AI proofs important enough for credible mid-ranking journals. EDIT: I forgot my own forecast here, I expect saturation in about 5 years so âseveralâ years is an exaggeration. Nonetheless I expect some gap between Frontier Math 4 being saturated and the first real contribuitions to research mathematics: I guess 6-9 years until real contributions is more like my forecast than 6-7
To be clear, I say âsomewhatâ because this is several years after I expect the benchmark itself to saturate.But I am not shocked if someone thinks âno, it is more likely to be meaninglessâ. But I do think if your going to make a strong version of the âitâs meaninglessâ case where you donât see the results as signal to any non-negligible degree, you need more than to just say âsome other benchmarks in far less formal demains, apparently far less similar to the real world tasks being measured, have low construct validity.âIn your view, is it possible to design a benchmark that a) does not literally amount to âproduce a novel important proofâ, but b) nonetheless improvements on the benchmark give decent evidence that we are moving towards models being able to do this? If it is possible, how would it differ from Frontier Math 4?
*I am prepared to change my mind on this if a bunch of mathematicians say âno, actually the questions donât look like they were optimized for this.â
I am not breaking new ground by saying it would be far more interesting to see an AI system behave like a playful, curious toddler or a playful, curious cat than a mathematician. That would be a sign of fundamental, paradigm-shifting capabilities improvement and would make me think maybe AGI is coming soon.
I agree that IQ tests were designed for humans, not machines, and thatâs a reason to think itâs a poor test for machines, but what about all the other tests that were designed for machines? GPT-4 scored quite high on a number of LLM benchmarks in March 2023. Has enough time passed that we can say LLM benchmark performance doesnât meaningfully translate into real world capabilities? Or do we have to reserve judgment for some number of years still?
If your argument is that math as a domain is uniquely well-suited to the talents of LLMs, that could be true. I donât know. Maybe LLMs will become an amazing AI tool for math, similar to AlphaFold for protein structure prediction. That would certainly be interesting, and would be exciting progress for AI.
I would say this argument is highly irreducibly uncertain and approaches the level of uncertainty of something like guessing whether the fundamental structure of physical reality matches the fundamental mathematical structure of string theory. Iâm not sure itâs meaningful to assign probabilities to that.
It also doesnât seem like it would be particularly consequential outside of mathematics, or outside of things that mathematical research directly affects. If benchmark performance in other domains doesnât generalize to research, but benchmark performance in math does generalize to math research, well, then, that affects math research and only math research. Which is really interesting, but would be a breakthrough akin to AlphaFold â consequential for one domain and not others.
You said that my argument against accepting FrontierMath performance as evidence for AIs soon being able to perform original math research is overly general, such that a similar argument could be used against any evidence of progress. But what you said about that is overly general and similar reasoning could be used against any argument about not accepting a certain piece of evidence about current AI capabilities to support a certain conclusion about AI capabilities forecasting.
I suppose looking at the general contours of arguments from 30,000 feet in the air rather than their specifics and worrying âwhat ifâ is not particularly useful.
I guess I still just want to ask: If models hit 80% on frontier math by like June 2027, how much does that change your opinion on whether models will be capable of âgenuine creativityâ in at least one domain by 2033. Iâm not asking for an exact figure, just a ballpark guess. If the answer is âhardly at allâ, is there anything short of an 100% clear example of a novel publishable research insight in some domain, that would change your opinion on when âreal creativityâ will arrive?
What I just said: AI systems acting like a toddler or a cat would make me think AGI might be developed soon.
Iâm not sure FrontierMath is any more meaningful than any other benchmark, including those on which LLMs have already gotten high scores. But I donât know.
I asked about genuine research creativity not AGI, but I donât think this conversation is going anywhere at this point. It seems obvious to me that âdoes stuff mathematicians say makes up the building blocks of real researchâ is meaningful evidence that the chance that models will do research level maths in the near future is not ultra-low, given that capabilities do increase with time. I donât think this analogous to IQ tests or the bar exam, and for other benchmarks, I would really need to see what your claiming is the equivalent of the transfer from frontier math 4 to real math that was intuitive but failed.
What percentage probability would you assign to your ability to accurately forecast this particular question?
Iâm not sure why youâre interested in getting me to forecast this. I havenât ever made any forecasts about AI systemsâ ability to do math research. I havenât made any statements about AI systemsâ current math capabilities. I havenât said that evidence of AI systemsâ ability to do math research would affect how I think about AGI. So, whatâs the relevance? Does it have a deeper significance, or is it just a random tangent?
If there is a connection to the broader topic of AGI or AI capabilities, I already gave a bunch of examples of evidence I would consider to be relevant and that would change my mind. Math wasnât one of them. I would be happy to think of more examples as well.
I think a potentially good counterexample to your argument about FrontierMath â original math research is natural language processing â replacing human translators. Surely you would agree that LLMs have mastered the basic building blocks of translation? So, 2-3 years after GPT-4, why is demand for human translators still growing? One analysis claims that growth is counterfactually less that it would have been without the increase in the usage of machine translation, but demand is still growing.
I think this points to the difficulty in making these sorts of predictions. If back in 2015, someone had described to you the capabilities and benchmark performance of GPT-4 in 2023, as well as the rate of scaling of new models and progress on benchmarks, would you have thought that demand for human translators would continue to grow for at least the next 2-3 years?
I donât have any particular point other than what seems intuitively obvious in the realm of AI capabilities forecasting may in fact be false, and I am skeptical of hazy extrapolations.
The most famous example of a failed prediction of this sort is Geoffrey Hintonâs prediction in 2016 that radiologistsâ jobs would be fully automated by 2021. Almost ten years after this prediction, the number of radiologists is still growing and radiologistsâ salaries are growing. AI tools that assist in interpreting radiology scans exist, but evidence is mixed on whether they actually help or hinder radiologists (and possibly harm patients).