Are you referring to the length of tasks that LLMs are able to complete with a 50% success rate? I donât see that as a meaningful indicator of AGI. Indeed, I would say itâs practically meaningless. It truly just doesnât make sense an indicator of progress toward AGI. I find it strange that anyone thinks otherwise. Why should we see that metric as indicating AGI progress anymore than, say, the length of LLMsâ context windows?
I think a much more meaningful indicator from METR would be the rate at which AI coding assistants speeds up coding tasks for human coders. Currently, METRâs finding is that it slows them down by 19%. But this is asymmetric. Failing to clear a low bar like being an unambiguously useful coding assistant in such tests is strong evidence against models nearing human-level capabilities, but clearing a low bar is not strong evidence for models nearing human-level capabilities. By analogy, we might take an AI system being bad at chess as evidence that it has much less than human-level general intelligence. But we shouldnât take an AI system (such as Deep Blue or AlphaZero) being really good at chess as evidence that it has human-level or greater general intelligence.
If I wanted to settle for an indirect proxy for progress toward AGI, I could short companies like Nvidia, Microsoft, Google, or Meta (e.g. see my recent question about this), but, of course, those companies stock pricesâ donât directly measure AGI progress. Conversely, someone who wanted to take the other side of the bet could take a long position in those stocks. But then this isnât much of an improvement on the above. If LLMs became much more useful coding assistants, then this could help justify these companiesâ stock prices, but it wouldnât say much about progress toward AGI. Likewise for other repetitive, text-heavy tasks, like customer support via web chat.
It seems like the flip side should be different: if you do think AGI is very likely to be created within 7 years, shouldnât that imply a long position in stocks like Nvidia, Microsoft, Google, or Meta would be lucrative? In principle, you could believe that LLMs are some number of years away from being able to make a lot of money and at most 7 years away from progressing to AGI, and that the market will give up on LLMs making a lot of money just a few years too soon. But I would find this to be a strange and implausible view.
So, to be clear, you think that if LLMs continue to complete software engineering tasks of exponentially increasing lengths at exponentially decreasing risk of failure, then that tell us nothing about whether LLMs will reach AGI?
I expect most EAs who have enough money to consider investing them to already be investing them in index funds, which, by design, long the Magnificent Seven already.
Iâm not sure if youâre asking about the METR graph on task length or about the practical use of AI coding assistants, which the METR study found is currently negative.
If I understand it correctly, the METR graph doesnât measure an exponentially decreasing failure rate, just a 50% failure rate. (Thereâs also a version of the graph with a 20% failure rate, but thatâs not the one people typically cite.)
I also think automatically graded tasks used in benchmarks donât usually deserve to be called âsoftware engineeringâ or anything that implies that the actual tasks the LLM is doing are practically useful, economically valuable, or could actually substitute for tasks that humans get paid to do.
I think many of these LLM benchmarks are trying to measure such narrow things and such toy problems, which seem to be largely selected so as to make the benchmarks easier for LLMs, that they arenât particularly meaningful.
In terms of studies of real world performance like METRâs study on human coders using an AI coding assistant, thatâs much more interesting and important. Although I find most LLM benchmarks practically meaningless for measuring AGI progress, I think practical performance in economically valuable contexts is much more meaningful.
My point in the above comment was just that an unambiguously useful AI coding assistant would not by itself be strong evidence for near-term AGI. AI systems mastering games like chess and go is impressive and interesting and probably tells us some information about AGI progress, but if someone pointed to AlphaGo beating Lee Seedol as strong evidence that AGI would have been created within 7 years of that point, they would have been wrong.
In other words, progress in AI probably tells us something about AGI progress, but just taking impressive results in AI and saying that implies AGI within 7 years isnât correct, or at least itâs unsupported. Why 7 years and not 17 years or 77 years or 177 years?
If you assume whatever rate of progress you like, that will support any timeline you like based on any evidence you like, but, in my opinion, thatâs no way to make an argument.
On the topic of betting and investing, itâs true that index funds have exposure to AI, and indeed personally I worry about how much exposure the S&P 500 has (global index funds that include small-cap stocks have less, but I donât know how much less). My argument in the comment above is simply that if someone thought it was rational to bet some amount of money on AGI arriving within 7 years, then surely it would be rational to invest that same amount of money in a 100% concentrated investment in AI and not, say, the S&P 500.
Are you referring to the length of tasks that LLMs are able to complete with a 50% success rate? I donât see that as a meaningful indicator of AGI. Indeed, I would say itâs practically meaningless. It truly just doesnât make sense an indicator of progress toward AGI. I find it strange that anyone thinks otherwise. Why should we see that metric as indicating AGI progress anymore than, say, the length of LLMsâ context windows?
I think a much more meaningful indicator from METR would be the rate at which AI coding assistants speeds up coding tasks for human coders. Currently, METRâs finding is that it slows them down by 19%. But this is asymmetric. Failing to clear a low bar like being an unambiguously useful coding assistant in such tests is strong evidence against models nearing human-level capabilities, but clearing a low bar is not strong evidence for models nearing human-level capabilities. By analogy, we might take an AI system being bad at chess as evidence that it has much less than human-level general intelligence. But we shouldnât take an AI system (such as Deep Blue or AlphaZero) being really good at chess as evidence that it has human-level or greater general intelligence.
If I wanted to settle for an indirect proxy for progress toward AGI, I could short companies like Nvidia, Microsoft, Google, or Meta (e.g. see my recent question about this), but, of course, those companies stock pricesâ donât directly measure AGI progress. Conversely, someone who wanted to take the other side of the bet could take a long position in those stocks. But then this isnât much of an improvement on the above. If LLMs became much more useful coding assistants, then this could help justify these companiesâ stock prices, but it wouldnât say much about progress toward AGI. Likewise for other repetitive, text-heavy tasks, like customer support via web chat.
It seems like the flip side should be different: if you do think AGI is very likely to be created within 7 years, shouldnât that imply a long position in stocks like Nvidia, Microsoft, Google, or Meta would be lucrative? In principle, you could believe that LLMs are some number of years away from being able to make a lot of money and at most 7 years away from progressing to AGI, and that the market will give up on LLMs making a lot of money just a few years too soon. But I would find this to be a strange and implausible view.
So, to be clear, you think that if LLMs continue to complete software engineering tasks of exponentially increasing lengths at exponentially decreasing risk of failure, then that tell us nothing about whether LLMs will reach AGI?
I expect most EAs who have enough money to consider investing them to already be investing them in index funds, which, by design, long the Magnificent Seven already.
Iâm not sure if youâre asking about the METR graph on task length or about the practical use of AI coding assistants, which the METR study found is currently negative.
If I understand it correctly, the METR graph doesnât measure an exponentially decreasing failure rate, just a 50% failure rate. (Thereâs also a version of the graph with a 20% failure rate, but thatâs not the one people typically cite.)
I also think automatically graded tasks used in benchmarks donât usually deserve to be called âsoftware engineeringâ or anything that implies that the actual tasks the LLM is doing are practically useful, economically valuable, or could actually substitute for tasks that humans get paid to do.
I think many of these LLM benchmarks are trying to measure such narrow things and such toy problems, which seem to be largely selected so as to make the benchmarks easier for LLMs, that they arenât particularly meaningful.
In terms of studies of real world performance like METRâs study on human coders using an AI coding assistant, thatâs much more interesting and important. Although I find most LLM benchmarks practically meaningless for measuring AGI progress, I think practical performance in economically valuable contexts is much more meaningful.
My point in the above comment was just that an unambiguously useful AI coding assistant would not by itself be strong evidence for near-term AGI. AI systems mastering games like chess and go is impressive and interesting and probably tells us some information about AGI progress, but if someone pointed to AlphaGo beating Lee Seedol as strong evidence that AGI would have been created within 7 years of that point, they would have been wrong.
In other words, progress in AI probably tells us something about AGI progress, but just taking impressive results in AI and saying that implies AGI within 7 years isnât correct, or at least itâs unsupported. Why 7 years and not 17 years or 77 years or 177 years?
If you assume whatever rate of progress you like, that will support any timeline you like based on any evidence you like, but, in my opinion, thatâs no way to make an argument.
On the topic of betting and investing, itâs true that index funds have exposure to AI, and indeed personally I worry about how much exposure the S&P 500 has (global index funds that include small-cap stocks have less, but I donât know how much less). My argument in the comment above is simply that if someone thought it was rational to bet some amount of money on AGI arriving within 7 years, then surely it would be rational to invest that same amount of money in a 100% concentrated investment in AI and not, say, the S&P 500.