Iâm not sure if youâre asking about the METR graph on task length or about the practical use of AI coding assistants, which the METR study found is currently negative.
If I understand it correctly, the METR graph doesnât measure an exponentially decreasing failure rate, just a 50% failure rate. (Thereâs also a version of the graph with a 20% failure rate, but thatâs not the one people typically cite.)
I also think automatically graded tasks used in benchmarks donât usually deserve to be called âsoftware engineeringâ or anything that implies that the actual tasks the LLM is doing are practically useful, economically valuable, or could actually substitute for tasks that humans get paid to do.
I think many of these LLM benchmarks are trying to measure such narrow things and such toy problems, which seem to be largely selected so as to make the benchmarks easier for LLMs, that they arenât particularly meaningful.
In terms of studies of real world performance like METRâs study on human coders using an AI coding assistant, thatâs much more interesting and important. Although I find most LLM benchmarks practically meaningless for measuring AGI progress, I think practical performance in economically valuable contexts is much more meaningful.
My point in the above comment was just that an unambiguously useful AI coding assistant would not by itself be strong evidence for near-term AGI. AI systems mastering games like chess and go is impressive and interesting and probably tells us some information about AGI progress, but if someone pointed to AlphaGo beating Lee Seedol as strong evidence that AGI would have been created within 7 years of that point, they would have been wrong.
In other words, progress in AI probably tells us something about AGI progress, but just taking impressive results in AI and saying that implies AGI within 7 years isnât correct, or at least itâs unsupported. Why 7 years and not 17 years or 77 years or 177 years?
If you assume whatever rate of progress you like, that will support any timeline you like based on any evidence you like, but, in my opinion, thatâs no way to make an argument.
On the topic of betting and investing, itâs true that index funds have exposure to AI, and indeed personally I worry about how much exposure the S&P 500 has (global index funds that include small-cap stocks have less, but I donât know how much less). My argument in the comment above is simply that if someone thought it was rational to bet some amount of money on AGI arriving within 7 years, then surely it would be rational to invest that same amount of money in a 100% concentrated investment in AI and not, say, the S&P 500.
Iâm not sure if youâre asking about the METR graph on task length or about the practical use of AI coding assistants, which the METR study found is currently negative.
If I understand it correctly, the METR graph doesnât measure an exponentially decreasing failure rate, just a 50% failure rate. (Thereâs also a version of the graph with a 20% failure rate, but thatâs not the one people typically cite.)
I also think automatically graded tasks used in benchmarks donât usually deserve to be called âsoftware engineeringâ or anything that implies that the actual tasks the LLM is doing are practically useful, economically valuable, or could actually substitute for tasks that humans get paid to do.
I think many of these LLM benchmarks are trying to measure such narrow things and such toy problems, which seem to be largely selected so as to make the benchmarks easier for LLMs, that they arenât particularly meaningful.
In terms of studies of real world performance like METRâs study on human coders using an AI coding assistant, thatâs much more interesting and important. Although I find most LLM benchmarks practically meaningless for measuring AGI progress, I think practical performance in economically valuable contexts is much more meaningful.
My point in the above comment was just that an unambiguously useful AI coding assistant would not by itself be strong evidence for near-term AGI. AI systems mastering games like chess and go is impressive and interesting and probably tells us some information about AGI progress, but if someone pointed to AlphaGo beating Lee Seedol as strong evidence that AGI would have been created within 7 years of that point, they would have been wrong.
In other words, progress in AI probably tells us something about AGI progress, but just taking impressive results in AI and saying that implies AGI within 7 years isnât correct, or at least itâs unsupported. Why 7 years and not 17 years or 77 years or 177 years?
If you assume whatever rate of progress you like, that will support any timeline you like based on any evidence you like, but, in my opinion, thatâs no way to make an argument.
On the topic of betting and investing, itâs true that index funds have exposure to AI, and indeed personally I worry about how much exposure the S&P 500 has (global index funds that include small-cap stocks have less, but I donât know how much less). My argument in the comment above is simply that if someone thought it was rational to bet some amount of money on AGI arriving within 7 years, then surely it would be rational to invest that same amount of money in a 100% concentrated investment in AI and not, say, the S&P 500.