DeepMind certainly seems to be saying that AlphaZero is better at searching a more limited set of promising moves than Stockfish, a traditional chess engine (unfortunately they don’t compare it to earlier versions of AlphaGo on this metric).
Only at test time. AlphaZero has much more experience gained from its training phase. (Stockfish has no training phase, though you could think of all of the human domain knowledge encoded in it as a form of “training”.)
AlphaZero went from a bundle of blank learning algorithms to stronger than the best human chess players in history...in less than two hours.
Humans are extremely poorly optimized for playing chess.
I don’t agree with Garfinkel that OpenAI’s analysis should make us more pessimistic about human-level AI timelines. While it makes sense to revise our estimate of AI algorithms downward, it doesn’t follow that we should do the same for our estimate of overall progress in AI. By cortical neuron count, systems like AlphaZero are at about the same level as a blackbird (albeit one that lives for 18 years),[7] so there’s a clear case for future advances being more impressive than current ones as we approach the human level.
Sounds like you are using a model where (our understanding of) current capabilities and rates of progress of AI are not very relevant for determining future capabilities, because we don’t know the absolute quantitative capability corresponding to “human-level AI”. Instead, you model it primarily on the absolute amount of compute needed.
Suppose you did know the absolute capability corresponding to “human-level AI”, e.g. you can say something like “once we are able to solve Atari benchmarks using only 10k samples from the environment, we will have human-level AI”, and you found that metric much more persuasive than the compute used by a human brain. Would you then agree with Garfinkel’s point?
I think that its performance at test time is one of the more relevant measures—I take grandmasters’ considering fewer moves during a game as evidence that they’ve learned something more of the ‘essence’ of chess than AlphaZero, and I think AlphaZero’s learning was similarly superior to Stockfish’s relatively blind approach. Training time is also an important measure—but that’s why Carey brings up the 300-year AlphaGo Zero milestone.
Indeed we are. And it’s not clear to me that we’re much better optimized for general cognition. We’re extremely bad at doing math that pocket calculators have no problem with, yet it took us a while to build a good chess and Go-playing AI. I worry we have very little idea how hard different cognitive tasks will be to something with a brain-equivalent amount of compute.
I’m focusing on compute partly because it’s the easiest to measure. My understanding (and I think everyone else’s) of AI capabilities is largely shaped by how impressive the results of major papers intuitively seem. And when AI can use something like the amount of compute a human brain has, we should eventually get a similar level of capability, so I think compute is a good yardstick.
I’m not sure I fully understand how the metric would work. For the Atari example, it seems clear to me that we could easily reach it without making a generalizable AI system, or vice versa. I’m not sure what metric could be appropriate—I think we’d have to know a lot more about intelligence. And I don’t know if we’ll need a completely different computing paradigm from ML to learn in a more general way. There might not be a relevant capability level for ML systems that would correspond to human-level AI.
But let’s say that we could come up with a relevant metric. Then I’d agree with Garfinkel, as long as people in the community had known roughly the current state of AI in relation to it and the rate of advance toward it before the release of “AI and Compute”.
My understanding (and I think everyone else’s) of AI capabilities is largely shaped by how impressive the results of major papers intuitively seem.
I claim that this is not how I think about AI capabilities, and it is not how many AI researchers think about AI capabilities. For a particularly extreme example, the Go-explore paper out of Uber had a very nominally impressive result on Montezuma’s Revenge, but much of the AI community didn’t find it compelling because of the assumptions that their algorithm used.
I’m not sure I fully understand how the metric would work. For the Atari example, it seems clear to me that we could easily reach it without making a generalizable AI system, or vice versa.
Tbc, I definitely did not intend for that to be an actual metric.
But let’s say that we could come up with a relevant metric. Then I’d agree with Garfinkel, as long as people in the community had known roughly the current state of AI in relation to it and the rate of advance toward it before the release of “AI and Compute”.
I would say that I have a set of intuitions and impressions that function as a very weak prediction of what AI will look like in the future, along the lines of that sort of metric. I trust timelines based on extrapolation of progress using these intuitions more than timelines based solely on compute.To the extent that you hear timeline estimates from people like me who do this sort of “progress extrapolation” who also did not know about how compute has been scaling, you would want to lengthen their timeline estimates. I’m not sure how timeline predictions break down on this axis.
I claim that this is not how I think about AI capabilities, and it is not how many AI researchers think about AI capabilities. For a particularly extreme example, the Go-explore paper out of Uber had a very nominally impressive result on Montezuma’s Revenge, but much of the AI community didn’t find it compelling because of the assumptions that their algorithm used.
Sorry, I meant the results in light of which methods were used, implications for other research, etc. The sentence would better read, “My understanding (and I think everyone else’s) of AI capabilities is largely shaped by how impressive major papers seem.”
Tbc, I definitely did not intend for that to be an actual metric.
Yeah, totally got that—I just think that making a relevant metric would be hard, and we’d have to know a lot that we don’t know now, including whether current ML techniques can ever lead to AGI.
I would say that I have a set of intuitions and impressions that function as a very weak prediction of what AI will look like in the future, along the lines of that sort of metric. I trust timelines based on extrapolation of progress using these intuitions more than timelines based solely on compute.
Interesting. Yeah, I don’t much trust my own intuitions on our current progress. I’d love to have a better understanding of how to evaluate the implications of new developments, but I really can’t do much better than, “GPT-2 impressed me a lot more than AlphaStar.” And just to be 100% clear—I tend to think that the necessary amount of compute is somewhere in the 18-to-300-year range. After we reach it, I’m stuck using my intuition to guess when we’ll have the right algorithms to create AGI.
Only at test time. AlphaZero has much more experience gained from its training phase. (Stockfish has no training phase, though you could think of all of the human domain knowledge encoded in it as a form of “training”.)
Humans are extremely poorly optimized for playing chess.
Sounds like you are using a model where (our understanding of) current capabilities and rates of progress of AI are not very relevant for determining future capabilities, because we don’t know the absolute quantitative capability corresponding to “human-level AI”. Instead, you model it primarily on the absolute amount of compute needed.
Suppose you did know the absolute capability corresponding to “human-level AI”, e.g. you can say something like “once we are able to solve Atari benchmarks using only 10k samples from the environment, we will have human-level AI”, and you found that metric much more persuasive than the compute used by a human brain. Would you then agree with Garfinkel’s point?
Thanks for the comment! In order:
I think that its performance at test time is one of the more relevant measures—I take grandmasters’ considering fewer moves during a game as evidence that they’ve learned something more of the ‘essence’ of chess than AlphaZero, and I think AlphaZero’s learning was similarly superior to Stockfish’s relatively blind approach. Training time is also an important measure—but that’s why Carey brings up the 300-year AlphaGo Zero milestone.
Indeed we are. And it’s not clear to me that we’re much better optimized for general cognition. We’re extremely bad at doing math that pocket calculators have no problem with, yet it took us a while to build a good chess and Go-playing AI. I worry we have very little idea how hard different cognitive tasks will be to something with a brain-equivalent amount of compute.
I’m focusing on compute partly because it’s the easiest to measure. My understanding (and I think everyone else’s) of AI capabilities is largely shaped by how impressive the results of major papers intuitively seem. And when AI can use something like the amount of compute a human brain has, we should eventually get a similar level of capability, so I think compute is a good yardstick.
I’m not sure I fully understand how the metric would work. For the Atari example, it seems clear to me that we could easily reach it without making a generalizable AI system, or vice versa. I’m not sure what metric could be appropriate—I think we’d have to know a lot more about intelligence. And I don’t know if we’ll need a completely different computing paradigm from ML to learn in a more general way. There might not be a relevant capability level for ML systems that would correspond to human-level AI.
But let’s say that we could come up with a relevant metric. Then I’d agree with Garfinkel, as long as people in the community had known roughly the current state of AI in relation to it and the rate of advance toward it before the release of “AI and Compute”.
Mostly agree with all of this; some nitpicks:
I claim that this is not how I think about AI capabilities, and it is not how many AI researchers think about AI capabilities. For a particularly extreme example, the Go-explore paper out of Uber had a very nominally impressive result on Montezuma’s Revenge, but much of the AI community didn’t find it compelling because of the assumptions that their algorithm used.
Tbc, I definitely did not intend for that to be an actual metric.
I would say that I have a set of intuitions and impressions that function as a very weak prediction of what AI will look like in the future, along the lines of that sort of metric. I trust timelines based on extrapolation of progress using these intuitions more than timelines based solely on compute.To the extent that you hear timeline estimates from people like me who do this sort of “progress extrapolation” who also did not know about how compute has been scaling, you would want to lengthen their timeline estimates. I’m not sure how timeline predictions break down on this axis.
Sorry, I meant the results in light of which methods were used, implications for other research, etc. The sentence would better read, “My understanding (and I think everyone else’s) of AI capabilities is largely shaped by how impressive major papers seem.”
Yeah, totally got that—I just think that making a relevant metric would be hard, and we’d have to know a lot that we don’t know now, including whether current ML techniques can ever lead to AGI.
Interesting. Yeah, I don’t much trust my own intuitions on our current progress. I’d love to have a better understanding of how to evaluate the implications of new developments, but I really can’t do much better than, “GPT-2 impressed me a lot more than AlphaStar.” And just to be 100% clear—I tend to think that the necessary amount of compute is somewhere in the 18-to-300-year range. After we reach it, I’m stuck using my intuition to guess when we’ll have the right algorithms to create AGI.