I wonder if you noticed that you changed the question? Did you not notice or did you change the question deliberately?
What I brought up as a potential form of important evidence for near-term AGI was:
Any sort of significant credible evidence of a major increase in AI capabilities, such as LLMs being able to autonomously and independently come up with new correct ideas in science, technology, engineering, medicine, philosophy, economics, psychology, etc. (not as a tool for human researchers to more easily search the research literature or anything along those lines, but doing the actual creative intellectual act itself)
You turned the question into:
If you wouldnât count this as evidence that genuine AI contributions to research mathematics might not be more than 6-7 years off, what, if anything would you count as evidence of that?
Now, rather than asking me about the evidence I use to forecast near-term AGI, youâre asking me to forecast the arrival of the evidence I would use for forecasting near-term AGI? Why?
My thought process didnât go beyond âYarrow seems committed to a very low chance of AI having real, creative research insights in the next few years, here is something that puts some pressure on thatâ. Obviously I agree that when AGI will arrive is a different question from when models will have real insights in research mathematics. Nonetheless I got the feeling-maybe incorrectly, that your strength of conviction that AGI is partly based on things like âmodels in the current paradigm canât have âreal insightââ, so it seemed relevant, even though âreal insight in maths is probably coming soon, but AGI likely over 20 years awayâ is perfectly coherent, and indeed close to my own view.
I have no idea when AI systems will be able to do math research and generate original, creative ideas autonomously, but it will certainly be very interesting if/âwhen they do.
It seems like thereâs not much of a connection between the FrontierMath benchmark and this, though. LLMs have been scoring well on question-and-answer benchmarks in multiple domains for years and havenât produced any original, correct ideas yet, as far as Iâm aware. So, why would this be different?
LLMs have been scoring above 100 on IQ tests for years and yet canât do most of the things humans who score above 100 on IQ tests can do. If an LLM does well on math problems that are hard for mathematicians or math grad students or whatever, that doesnât necessarily imply it will be able to do the other things, even within the domain of math, that mathematicians or math grad students do.
We have good evidence for this because LLMs as far back as GPT-4 nearly 3 years ago have done well on a bunch of written tests. Despite there being probably over 1 billion regular users of LLMs and trillions of queries put to LLMs, thereâs no indication Iâm aware of an LLM coming up with a novel, correct idea of any note in any academic or technical field. Is there a reason to think performance on the FrontierMath benchmark would be different than the trend weâve already seen with other benchmarks over the last few years?
The FrontierMath problems may indeed require creativity from humans to solve them, but that doesnât necessarily mean solving them is a sign of creativity from LLMs. By analogy, playing grandmaster-level chess may require creativity from humans, but not from computers.
This is related to an old idea in AI called Moravecâs paradox, which warns us not to assume what is hard for humans is hard for computers, or what is easy for humans is easy for computers.
I guess I feel like if being able to solve mathematical problems designed by research mathematicians to be similar to the kind of problems they solve in their actual work is not decent evidence that AIs are on track to be able to do original research in mathematics in less than say 8 years then what would you EVER accept as empirical evidence that we are on track for that, but not there yet?
Note that I am not saying this should push your overall confidence to over 50% or anything, just that it ought to move you up by a non-trivial amount relative to whatever your credence was before. I am certainly NOT saying that skill on Frontier Math 4 will inevitably transfer to real research mathematics, just that you should think there is a substantial risk that it will.
I am not persuaded by the analogy to IQ test scores for the following reason. It is far from clear that the tasks that LLMs canât do despite scoring 100 on IQ tests are anything like as similar as the Frontier Math 4 tasks are at least allegedly designed to resemble real research questions in mathematics*, because the latter are being deliberately designed for similarity, whereas IQ tests are just designed so that skill on them correlates with skill on intellectual tasks in general among humans. (I also think the inference towards âthey will be able to DO research mathâ, from progress on Frontier Math 4, is rather less shaky than âthey will DO proper research math in the same way as humansâ. Itâs not clear to me what tasks actually require âreal creativityâ if that means a particular reasoning style, rather than just the production of novel insights as an end product. I donât think you or anyone else knows this either.) Real math is also uniquely suited to questions-and-answer benchmarks I think, because things really are often posed as extremely well-defined problems with determinate answers, i.e. prove X. Proving things is not literally the only skill mathematicians have, but being able to prove the right stuff is enough to be making a real contribution. In my view that makes claims for construct validity here much more plausible than say, inferring Chat-GTP can be a lawyer if it passes the bar exam.
In general, your argument here seems like it could be deployed against literally any empirical evidence that AIs were approaching being able to do a task, short of them actually performing that task. You can always say âjust because in humans, ability to do X is correlated with ability to do Y, doesnât mean the techniques the models are using to do X can do Y with a bit of improvement.â And yes, that is always true, that it doesnât *automatically* mean that. But if you allow this to mean that no success on any task ever significantly moves you at all about future real world progress on intuitively similar but harder tasks, you are basically saying it is impossible to get empirical evidence that progress is coming before it has arrived, which is just pretty suspicious a priori. What you should do in my view, is think carefully about the construct validity of the particular benchmark in question, and then-roughly-updated your view based on how likely you think it is to be basically valid, and what it would mean if it was. You should take into account the risk that success on Frontier Math 4 is giving real signal, not just the risk that it is meaningless.
My personal guess is that it is somewhat meaningful, and we will see the first real AI contributions to maths in 6-7 years, that is 60% chance by then of AI proofs important enough for credible mid-ranking journals. EDIT: I forgot my own forecast here, I expect saturation in about 5 years so âseveralâ years is an exaggeration. Nonetheless I expect some gap between Frontier Math 4 being saturated and the first real contribuitions to research mathematics: I guess 6-9 years until real contributions is more like my forecast than 6-7 To be clear, I say âsomewhatâ because this is several years after I expect the benchmark itself to saturate. But I am not shocked if someone thinks âno, it is more likely to be meaninglessâ. But I do think if your going to make a strong version of the âitâs meaninglessâ case where you donât see the results as signal to any non-negligible degree, you need more than to just say âsome other benchmarks in far less formal demains, apparently far less similar to the real world tasks being measured, have low construct validity.â
In your view, is it possible to design a benchmark that a) does not literally amount to âproduce a novel important proofâ, but b) nonetheless improvements on the benchmark give decent evidence that we are moving towards models being able to do this? If it is possible, how would it differ from Frontier Math 4?
*I am prepared to change my mind on this if a bunch of mathematicians say âno, actually the questions donât look like they were optimized for this.â
I am not breaking new ground by saying it would be far more interesting to see an AI system behave like a playful, curious toddler or a playful, curious cat than a mathematician. That would be a sign of fundamental, paradigm-shifting capabilities improvement and would make me think maybe AGI is coming soon.
I agree that IQ tests were designed for humans, not machines, and thatâs a reason to think itâs a poor test for machines, but what about all the other tests that were designed for machines? GPT-4 scored quite high on a number of LLM benchmarks in March 2023. Has enough time passed that we can say LLM benchmark performance doesnât meaningfully translate into real world capabilities? Or do we have to reserve judgment for some number of years still?
If your argument is that math as a domain is uniquely well-suited to the talents of LLMs, that could be true. I donât know. Maybe LLMs will become an amazing AI tool for math, similar to AlphaFold for protein structure prediction. That would certainly be interesting, and would be exciting progress for AI.
I would say this argument is highly irreducibly uncertain and approaches the level of uncertainty of something like guessing whether the fundamental structure of physical reality matches the fundamental mathematical structure of string theory. Iâm not sure itâs meaningful to assign probabilities to that.
It also doesnât seem like it would be particularly consequential outside of mathematics, or outside of things that mathematical research directly affects. If benchmark performance in other domains doesnât generalize to research, but benchmark performance in math does generalize to math research, well, then, that affects math research and only math research. Which is really interesting, but would be a breakthrough akin to AlphaFold â consequential for one domain and not others.
You said that my argument against accepting FrontierMath performance as evidence for AIs soon being able to perform original math research is overly general, such that a similar argument could be used against any evidence of progress. But what you said about that is overly general and similar reasoning could be used against any argument about not accepting a certain piece of evidence about current AI capabilities to support a certain conclusion about AI capabilities forecasting.
I suppose looking at the general contours of arguments from 30,000 feet in the air rather than their specifics and worrying âwhat ifâ is not particularly useful.
I guess I still just want to ask: If models hit 80% on frontier math by like June 2027, how much does that change your opinion on whether models will be capable of âgenuine creativityâ in at least one domain by 2033. Iâm not asking for an exact figure, just a ballpark guess. If the answer is âhardly at allâ, is there anything short of an 100% clear example of a novel publishable research insight in some domain, that would change your opinion on when âreal creativityâ will arrive?
What I just said: AI systems acting like a toddler or a cat would make me think AGI might be developed soon.
Iâm not sure FrontierMath is any more meaningful than any other benchmark, including those on which LLMs have already gotten high scores. But I donât know.
I asked about genuine research creativity not AGI, but I donât think this conversation is going anywhere at this point. It seems obvious to me that âdoes stuff mathematicians say makes up the building blocks of real researchâ is meaningful evidence that the chance that models will do research level maths in the near future is not ultra-low, given that capabilities do increase with time. I donât think this analogous to IQ tests or the bar exam, and for other benchmarks, I would really need to see what your claiming is the equivalent of the transfer from frontier math 4 to real math that was intuitive but failed.
What percentage probability would you assign to your ability to accurately forecast this particular question?
Iâm not sure why youâre interested in getting me to forecast this. I havenât ever made any forecasts about AI systemsâ ability to do math research. I havenât made any statements about AI systemsâ current math capabilities. I havenât said that evidence of AI systemsâ ability to do math research would affect how I think about AGI. So, whatâs the relevance? Does it have a deeper significance, or is it just a random tangent?
If there is a connection to the broader topic of AGI or AI capabilities, I already gave a bunch of examples of evidence I would consider to be relevant and that would change my mind. Math wasnât one of them. I would be happy to think of more examples as well.
I think a potentially good counterexample to your argument about FrontierMath â original math research is natural language processing â replacing human translators. Surely you would agree that LLMs have mastered the basic building blocks of translation? So, 2-3 years after GPT-4, why is demand for human translators still growing? One analysis claims that growth is counterfactually less that it would have been without the increase in the usage of machine translation, but demand is still growing.
I think this points to the difficulty in making these sorts of predictions. If back in 2015, someone had described to you the capabilities and benchmark performance of GPT-4 in 2023, as well as the rate of scaling of new models and progress on benchmarks, would you have thought that demand for human translators would continue to grow for at least the next 2-3 years?
I donât have any particular point other than what seems intuitively obvious in the realm of AI capabilities forecasting may in fact be false, and I am skeptical of hazy extrapolations.
The most famous example of a failed prediction of this sort is Geoffrey Hintonâs prediction in 2016 that radiologistsâ jobs would be fully automated by 2021. Almost ten years after this prediction, the number of radiologists is still growing and radiologistsâ salaries are growing. AI tools that assist in interpreting radiology scans exist, but evidence is mixed on whether they actually help or hinder radiologists (and possibly harm patients).
I wonder if you noticed that you changed the question? Did you not notice or did you change the question deliberately?
What I brought up as a potential form of important evidence for near-term AGI was:
You turned the question into:
Now, rather than asking me about the evidence I use to forecast near-term AGI, youâre asking me to forecast the arrival of the evidence I would use for forecasting near-term AGI? Why?
My thought process didnât go beyond âYarrow seems committed to a very low chance of AI having real, creative research insights in the next few years, here is something that puts some pressure on thatâ. Obviously I agree that when AGI will arrive is a different question from when models will have real insights in research mathematics. Nonetheless I got the feeling-maybe incorrectly, that your strength of conviction that AGI is partly based on things like âmodels in the current paradigm canât have âreal insightââ, so it seemed relevant, even though âreal insight in maths is probably coming soon, but AGI likely over 20 years awayâ is perfectly coherent, and indeed close to my own view.
Anyway, why canât you just answer my question?
I have no idea when AI systems will be able to do math research and generate original, creative ideas autonomously, but it will certainly be very interesting if/âwhen they do.
It seems like thereâs not much of a connection between the FrontierMath benchmark and this, though. LLMs have been scoring well on question-and-answer benchmarks in multiple domains for years and havenât produced any original, correct ideas yet, as far as Iâm aware. So, why would this be different?
LLMs have been scoring above 100 on IQ tests for years and yet canât do most of the things humans who score above 100 on IQ tests can do. If an LLM does well on math problems that are hard for mathematicians or math grad students or whatever, that doesnât necessarily imply it will be able to do the other things, even within the domain of math, that mathematicians or math grad students do.
We have good evidence for this because LLMs as far back as GPT-4 nearly 3 years ago have done well on a bunch of written tests. Despite there being probably over 1 billion regular users of LLMs and trillions of queries put to LLMs, thereâs no indication Iâm aware of an LLM coming up with a novel, correct idea of any note in any academic or technical field. Is there a reason to think performance on the FrontierMath benchmark would be different than the trend weâve already seen with other benchmarks over the last few years?
The FrontierMath problems may indeed require creativity from humans to solve them, but that doesnât necessarily mean solving them is a sign of creativity from LLMs. By analogy, playing grandmaster-level chess may require creativity from humans, but not from computers.
This is related to an old idea in AI called Moravecâs paradox, which warns us not to assume what is hard for humans is hard for computers, or what is easy for humans is easy for computers.
I guess I feel like if being able to solve mathematical problems designed by research mathematicians to be similar to the kind of problems they solve in their actual work is not decent evidence that AIs are on track to be able to do original research in mathematics in less than say 8 years then what would you EVER accept as empirical evidence that we are on track for that, but not there yet?
Note that I am not saying this should push your overall confidence to over 50% or anything, just that it ought to move you up by a non-trivial amount relative to whatever your credence was before. I am certainly NOT saying that skill on Frontier Math 4 will inevitably transfer to real research mathematics, just that you should think there is a substantial risk that it will.
I am not persuaded by the analogy to IQ test scores for the following reason. It is far from clear that the tasks that LLMs canât do despite scoring 100 on IQ tests are anything like as similar as the Frontier Math 4 tasks are at least allegedly designed to resemble real research questions in mathematics*, because the latter are being deliberately designed for similarity, whereas IQ tests are just designed so that skill on them correlates with skill on intellectual tasks in general among humans. (I also think the inference towards âthey will be able to DO research mathâ, from progress on Frontier Math 4, is rather less shaky than âthey will DO proper research math in the same way as humansâ. Itâs not clear to me what tasks actually require âreal creativityâ if that means a particular reasoning style, rather than just the production of novel insights as an end product. I donât think you or anyone else knows this either.) Real math is also uniquely suited to questions-and-answer benchmarks I think, because things really are often posed as extremely well-defined problems with determinate answers, i.e. prove X. Proving things is not literally the only skill mathematicians have, but being able to prove the right stuff is enough to be making a real contribution. In my view that makes claims for construct validity here much more plausible than say, inferring Chat-GTP can be a lawyer if it passes the bar exam.
In general, your argument here seems like it could be deployed against literally any empirical evidence that AIs were approaching being able to do a task, short of them actually performing that task. You can always say âjust because in humans, ability to do X is correlated with ability to do Y, doesnât mean the techniques the models are using to do X can do Y with a bit of improvement.â And yes, that is always true, that it doesnât *automatically* mean that. But if you allow this to mean that no success on any task ever significantly moves you at all about future real world progress on intuitively similar but harder tasks, you are basically saying it is impossible to get empirical evidence that progress is coming before it has arrived, which is just pretty suspicious a priori. What you should do in my view, is think carefully about the construct validity of the particular benchmark in question, and then-roughly-updated your view based on how likely you think it is to be basically valid, and what it would mean if it was. You should take into account the risk that success on Frontier Math 4 is giving real signal, not just the risk that it is meaningless.
My personal guess is that it is somewhat meaningful, and we will see the first real AI contributions to maths in 6-7 years, that is 60% chance by then of AI proofs important enough for credible mid-ranking journals. EDIT: I forgot my own forecast here, I expect saturation in about 5 years so âseveralâ years is an exaggeration. Nonetheless I expect some gap between Frontier Math 4 being saturated and the first real contribuitions to research mathematics: I guess 6-9 years until real contributions is more like my forecast than 6-7
To be clear, I say âsomewhatâ because this is several years after I expect the benchmark itself to saturate.But I am not shocked if someone thinks âno, it is more likely to be meaninglessâ. But I do think if your going to make a strong version of the âitâs meaninglessâ case where you donât see the results as signal to any non-negligible degree, you need more than to just say âsome other benchmarks in far less formal demains, apparently far less similar to the real world tasks being measured, have low construct validity.âIn your view, is it possible to design a benchmark that a) does not literally amount to âproduce a novel important proofâ, but b) nonetheless improvements on the benchmark give decent evidence that we are moving towards models being able to do this? If it is possible, how would it differ from Frontier Math 4?
*I am prepared to change my mind on this if a bunch of mathematicians say âno, actually the questions donât look like they were optimized for this.â
I am not breaking new ground by saying it would be far more interesting to see an AI system behave like a playful, curious toddler or a playful, curious cat than a mathematician. That would be a sign of fundamental, paradigm-shifting capabilities improvement and would make me think maybe AGI is coming soon.
I agree that IQ tests were designed for humans, not machines, and thatâs a reason to think itâs a poor test for machines, but what about all the other tests that were designed for machines? GPT-4 scored quite high on a number of LLM benchmarks in March 2023. Has enough time passed that we can say LLM benchmark performance doesnât meaningfully translate into real world capabilities? Or do we have to reserve judgment for some number of years still?
If your argument is that math as a domain is uniquely well-suited to the talents of LLMs, that could be true. I donât know. Maybe LLMs will become an amazing AI tool for math, similar to AlphaFold for protein structure prediction. That would certainly be interesting, and would be exciting progress for AI.
I would say this argument is highly irreducibly uncertain and approaches the level of uncertainty of something like guessing whether the fundamental structure of physical reality matches the fundamental mathematical structure of string theory. Iâm not sure itâs meaningful to assign probabilities to that.
It also doesnât seem like it would be particularly consequential outside of mathematics, or outside of things that mathematical research directly affects. If benchmark performance in other domains doesnât generalize to research, but benchmark performance in math does generalize to math research, well, then, that affects math research and only math research. Which is really interesting, but would be a breakthrough akin to AlphaFold â consequential for one domain and not others.
You said that my argument against accepting FrontierMath performance as evidence for AIs soon being able to perform original math research is overly general, such that a similar argument could be used against any evidence of progress. But what you said about that is overly general and similar reasoning could be used against any argument about not accepting a certain piece of evidence about current AI capabilities to support a certain conclusion about AI capabilities forecasting.
I suppose looking at the general contours of arguments from 30,000 feet in the air rather than their specifics and worrying âwhat ifâ is not particularly useful.
I guess I still just want to ask: If models hit 80% on frontier math by like June 2027, how much does that change your opinion on whether models will be capable of âgenuine creativityâ in at least one domain by 2033. Iâm not asking for an exact figure, just a ballpark guess. If the answer is âhardly at allâ, is there anything short of an 100% clear example of a novel publishable research insight in some domain, that would change your opinion on when âreal creativityâ will arrive?
What I just said: AI systems acting like a toddler or a cat would make me think AGI might be developed soon.
Iâm not sure FrontierMath is any more meaningful than any other benchmark, including those on which LLMs have already gotten high scores. But I donât know.
I asked about genuine research creativity not AGI, but I donât think this conversation is going anywhere at this point. It seems obvious to me that âdoes stuff mathematicians say makes up the building blocks of real researchâ is meaningful evidence that the chance that models will do research level maths in the near future is not ultra-low, given that capabilities do increase with time. I donât think this analogous to IQ tests or the bar exam, and for other benchmarks, I would really need to see what your claiming is the equivalent of the transfer from frontier math 4 to real math that was intuitive but failed.
What percentage probability would you assign to your ability to accurately forecast this particular question?
Iâm not sure why youâre interested in getting me to forecast this. I havenât ever made any forecasts about AI systemsâ ability to do math research. I havenât made any statements about AI systemsâ current math capabilities. I havenât said that evidence of AI systemsâ ability to do math research would affect how I think about AGI. So, whatâs the relevance? Does it have a deeper significance, or is it just a random tangent?
If there is a connection to the broader topic of AGI or AI capabilities, I already gave a bunch of examples of evidence I would consider to be relevant and that would change my mind. Math wasnât one of them. I would be happy to think of more examples as well.
I think a potentially good counterexample to your argument about FrontierMath â original math research is natural language processing â replacing human translators. Surely you would agree that LLMs have mastered the basic building blocks of translation? So, 2-3 years after GPT-4, why is demand for human translators still growing? One analysis claims that growth is counterfactually less that it would have been without the increase in the usage of machine translation, but demand is still growing.
I think this points to the difficulty in making these sorts of predictions. If back in 2015, someone had described to you the capabilities and benchmark performance of GPT-4 in 2023, as well as the rate of scaling of new models and progress on benchmarks, would you have thought that demand for human translators would continue to grow for at least the next 2-3 years?
I donât have any particular point other than what seems intuitively obvious in the realm of AI capabilities forecasting may in fact be false, and I am skeptical of hazy extrapolations.
The most famous example of a failed prediction of this sort is Geoffrey Hintonâs prediction in 2016 that radiologistsâ jobs would be fully automated by 2021. Almost ten years after this prediction, the number of radiologists is still growing and radiologistsâ salaries are growing. AI tools that assist in interpreting radiology scans exist, but evidence is mixed on whether they actually help or hinder radiologists (and possibly harm patients).