I am not breaking new ground by saying it would be far more interesting to see an AI system behave like a playful, curious toddler or a playful, curious cat than a mathematician. That would be a sign of fundamental, paradigm-shifting capabilities improvement and would make me think maybe AGI is coming soon.
I agree that IQ tests were designed for humans, not machines, and thatâs a reason to think itâs a poor test for machines, but what about all the other tests that were designed for machines? GPT-4 scored quite high on a number of LLM benchmarks in March 2023. Has enough time passed that we can say LLM benchmark performance doesnât meaningfully translate into real world capabilities? Or do we have to reserve judgment for some number of years still?
If your argument is that math as a domain is uniquely well-suited to the talents of LLMs, that could be true. I donât know. Maybe LLMs will become an amazing AI tool for math, similar to AlphaFold for protein structure prediction. That would certainly be interesting, and would be exciting progress for AI.
I would say this argument is highly irreducibly uncertain and approaches the level of uncertainty of something like guessing whether the fundamental structure of physical reality matches the fundamental mathematical structure of string theory. Iâm not sure itâs meaningful to assign probabilities to that.
It also doesnât seem like it would be particularly consequential outside of mathematics, or outside of things that mathematical research directly affects. If benchmark performance in other domains doesnât generalize to research, but benchmark performance in math does generalize to math research, well, then, that affects math research and only math research. Which is really interesting, but would be a breakthrough akin to AlphaFold â consequential for one domain and not others.
You said that my argument against accepting FrontierMath performance as evidence for AIs soon being able to perform original math research is overly general, such that a similar argument could be used against any evidence of progress. But what you said about that is overly general and similar reasoning could be used against any argument about not accepting a certain piece of evidence about current AI capabilities to support a certain conclusion about AI capabilities forecasting.
I suppose looking at the general contours of arguments from 30,000 feet in the air rather than their specifics and worrying âwhat ifâ is not particularly useful.
I guess I still just want to ask: If models hit 80% on frontier math by like June 2027, how much does that change your opinion on whether models will be capable of âgenuine creativityâ in at least one domain by 2033. Iâm not asking for an exact figure, just a ballpark guess. If the answer is âhardly at allâ, is there anything short of an 100% clear example of a novel publishable research insight in some domain, that would change your opinion on when âreal creativityâ will arrive?
What I just said: AI systems acting like a toddler or a cat would make me think AGI might be developed soon.
Iâm not sure FrontierMath is any more meaningful than any other benchmark, including those on which LLMs have already gotten high scores. But I donât know.
I asked about genuine research creativity not AGI, but I donât think this conversation is going anywhere at this point. It seems obvious to me that âdoes stuff mathematicians say makes up the building blocks of real researchâ is meaningful evidence that the chance that models will do research level maths in the near future is not ultra-low, given that capabilities do increase with time. I donât think this analogous to IQ tests or the bar exam, and for other benchmarks, I would really need to see what your claiming is the equivalent of the transfer from frontier math 4 to real math that was intuitive but failed.
What percentage probability would you assign to your ability to accurately forecast this particular question?
Iâm not sure why youâre interested in getting me to forecast this. I havenât ever made any forecasts about AI systemsâ ability to do math research. I havenât made any statements about AI systemsâ current math capabilities. I havenât said that evidence of AI systemsâ ability to do math research would affect how I think about AGI. So, whatâs the relevance? Does it have a deeper significance, or is it just a random tangent?
If there is a connection to the broader topic of AGI or AI capabilities, I already gave a bunch of examples of evidence I would consider to be relevant and that would change my mind. Math wasnât one of them. I would be happy to think of more examples as well.
I think a potentially good counterexample to your argument about FrontierMath â original math research is natural language processing â replacing human translators. Surely you would agree that LLMs have mastered the basic building blocks of translation? So, 2-3 years after GPT-4, why is demand for human translators still growing? One analysis claims that growth is counterfactually less that it would have been without the increase in the usage of machine translation, but demand is still growing.
I think this points to the difficulty in making these sorts of predictions. If back in 2015, someone had described to you the capabilities and benchmark performance of GPT-4 in 2023, as well as the rate of scaling of new models and progress on benchmarks, would you have thought that demand for human translators would continue to grow for at least the next 2-3 years?
I donât have any particular point other than what seems intuitively obvious in the realm of AI capabilities forecasting may in fact be false, and I am skeptical of hazy extrapolations.
The most famous example of a failed prediction of this sort is Geoffrey Hintonâs prediction in 2016 that radiologistsâ jobs would be fully automated by 2021. Almost ten years after this prediction, the number of radiologists is still growing and radiologistsâ salaries are growing. AI tools that assist in interpreting radiology scans exist, but evidence is mixed on whether they actually help or hinder radiologists (and possibly harm patients).
I am not breaking new ground by saying it would be far more interesting to see an AI system behave like a playful, curious toddler or a playful, curious cat than a mathematician. That would be a sign of fundamental, paradigm-shifting capabilities improvement and would make me think maybe AGI is coming soon.
I agree that IQ tests were designed for humans, not machines, and thatâs a reason to think itâs a poor test for machines, but what about all the other tests that were designed for machines? GPT-4 scored quite high on a number of LLM benchmarks in March 2023. Has enough time passed that we can say LLM benchmark performance doesnât meaningfully translate into real world capabilities? Or do we have to reserve judgment for some number of years still?
If your argument is that math as a domain is uniquely well-suited to the talents of LLMs, that could be true. I donât know. Maybe LLMs will become an amazing AI tool for math, similar to AlphaFold for protein structure prediction. That would certainly be interesting, and would be exciting progress for AI.
I would say this argument is highly irreducibly uncertain and approaches the level of uncertainty of something like guessing whether the fundamental structure of physical reality matches the fundamental mathematical structure of string theory. Iâm not sure itâs meaningful to assign probabilities to that.
It also doesnât seem like it would be particularly consequential outside of mathematics, or outside of things that mathematical research directly affects. If benchmark performance in other domains doesnât generalize to research, but benchmark performance in math does generalize to math research, well, then, that affects math research and only math research. Which is really interesting, but would be a breakthrough akin to AlphaFold â consequential for one domain and not others.
You said that my argument against accepting FrontierMath performance as evidence for AIs soon being able to perform original math research is overly general, such that a similar argument could be used against any evidence of progress. But what you said about that is overly general and similar reasoning could be used against any argument about not accepting a certain piece of evidence about current AI capabilities to support a certain conclusion about AI capabilities forecasting.
I suppose looking at the general contours of arguments from 30,000 feet in the air rather than their specifics and worrying âwhat ifâ is not particularly useful.
I guess I still just want to ask: If models hit 80% on frontier math by like June 2027, how much does that change your opinion on whether models will be capable of âgenuine creativityâ in at least one domain by 2033. Iâm not asking for an exact figure, just a ballpark guess. If the answer is âhardly at allâ, is there anything short of an 100% clear example of a novel publishable research insight in some domain, that would change your opinion on when âreal creativityâ will arrive?
What I just said: AI systems acting like a toddler or a cat would make me think AGI might be developed soon.
Iâm not sure FrontierMath is any more meaningful than any other benchmark, including those on which LLMs have already gotten high scores. But I donât know.
I asked about genuine research creativity not AGI, but I donât think this conversation is going anywhere at this point. It seems obvious to me that âdoes stuff mathematicians say makes up the building blocks of real researchâ is meaningful evidence that the chance that models will do research level maths in the near future is not ultra-low, given that capabilities do increase with time. I donât think this analogous to IQ tests or the bar exam, and for other benchmarks, I would really need to see what your claiming is the equivalent of the transfer from frontier math 4 to real math that was intuitive but failed.
What percentage probability would you assign to your ability to accurately forecast this particular question?
Iâm not sure why youâre interested in getting me to forecast this. I havenât ever made any forecasts about AI systemsâ ability to do math research. I havenât made any statements about AI systemsâ current math capabilities. I havenât said that evidence of AI systemsâ ability to do math research would affect how I think about AGI. So, whatâs the relevance? Does it have a deeper significance, or is it just a random tangent?
If there is a connection to the broader topic of AGI or AI capabilities, I already gave a bunch of examples of evidence I would consider to be relevant and that would change my mind. Math wasnât one of them. I would be happy to think of more examples as well.
I think a potentially good counterexample to your argument about FrontierMath â original math research is natural language processing â replacing human translators. Surely you would agree that LLMs have mastered the basic building blocks of translation? So, 2-3 years after GPT-4, why is demand for human translators still growing? One analysis claims that growth is counterfactually less that it would have been without the increase in the usage of machine translation, but demand is still growing.
I think this points to the difficulty in making these sorts of predictions. If back in 2015, someone had described to you the capabilities and benchmark performance of GPT-4 in 2023, as well as the rate of scaling of new models and progress on benchmarks, would you have thought that demand for human translators would continue to grow for at least the next 2-3 years?
I donât have any particular point other than what seems intuitively obvious in the realm of AI capabilities forecasting may in fact be false, and I am skeptical of hazy extrapolations.
The most famous example of a failed prediction of this sort is Geoffrey Hintonâs prediction in 2016 that radiologistsâ jobs would be fully automated by 2021. Almost ten years after this prediction, the number of radiologists is still growing and radiologistsâ salaries are growing. AI tools that assist in interpreting radiology scans exist, but evidence is mixed on whether they actually help or hinder radiologists (and possibly harm patients).