Without commenting on your wider message, I want to pick on two specific factual claims that you are making.
AlphaZero went from a bundle of blank learning algorithms to stronger than the best human chess players in history...in less than two hours.
Training time of the final program is a deeply misleading metric, as these programs have been through endless reruns and tests to get the setup right. I think it is most honest to count total engineering time.
I know people are wary of Kurzweil, but he does seem to be on fairly solid ground here.
Extrapolating FLOPS is inherently fraught, as is the very idea of FLOPS being a useful unit. The problem is best illustrated by the following CS proverb: “A supercomputer is a device for turning computational complexity into communication complexity.” In particular, estimates for the complexity of imitating a small, mostly separate, part of a brain don’t linearly scale to estimates of imitating the much more interconnected whole.
I don’t think I quite follow your criticism of FLOP/s; can you say more about why you think it’s not a useful unit? It seems like you’re saying that a linear extrapolation of FLOP/s isn’t accurate to estimate the compute requirements of larger models. (I know there are a variety of criticisms that can be made, but I’m interested in better understanding your point above)
The issue is that FLOPS cannot accurately represent computing power across different computing architectures, in particular between single CPUs versus computing clusters. As an example, let’s compare 1 computer of 100 MFLOPS with a cluster of 1000 computers of 1 MFLOPS each. The latter option has 10 times as many FLOPS, but there is a wide variety of computational problems in which the former will always be much faster. This means that FLOPS don’t meaningfully tell you which option is better, it will always depend on how well the problem you want to solve maps onto your hardware.
In large-scale computing, the bottleneck is often the communication speed in the network. If the calculations you have to do don’t neatly fall apart into roughly separate tasks, the different computers have to communicate a lot, which slows everything down. Adding more FLOPS (computers) won’t prevent that in the slightest.
You can not extrapolate FLOPS estimates without justifying why the communication overhead doesn’t make the estimated quantity meaningless on parallel hardware.
I remember looking into communication speed, but unfortunately I can’t find the sources I found last time! As I recall, when I checked the communication figures weren’t meaningfully different from processing speed figures.
Yeah, basically computers are closer in communication speed to a human brain than they are in processing speed. Which makes intuitive sense—they can transfer information at the speed of light, while brains are stuck sending chemical signals in many (all?) cases.
2nd edit: On your earlier point about training time vs. total engineering time...”Most honest” isn’t really the issue. It’s what you care about—training time illustrates that human-level performance can be quickly surpassed by an AI system’s capabilities once it’s built. Then the AI will keep improving, leaving us in the dust (although the applicability of current algorithms to more complex tasks is unclear). Total engineering time would show that these are massive projects which take time to develop...which is also true.
Without commenting on your wider message, I want to pick on two specific factual claims that you are making.
Training time of the final program is a deeply misleading metric, as these programs have been through endless reruns and tests to get the setup right. I think it is most honest to count total engineering time.
Extrapolating FLOPS is inherently fraught, as is the very idea of FLOPS being a useful unit. The problem is best illustrated by the following CS proverb: “A supercomputer is a device for turning computational complexity into communication complexity.” In particular, estimates for the complexity of imitating a small, mostly separate, part of a brain don’t linearly scale to estimates of imitating the much more interconnected whole.
I don’t think I quite follow your criticism of FLOP/s; can you say more about why you think it’s not a useful unit? It seems like you’re saying that a linear extrapolation of FLOP/s isn’t accurate to estimate the compute requirements of larger models. (I know there are a variety of criticisms that can be made, but I’m interested in better understanding your point above)
The issue is that FLOPS cannot accurately represent computing power across different computing architectures, in particular between single CPUs versus computing clusters. As an example, let’s compare 1 computer of 100 MFLOPS with a cluster of 1000 computers of 1 MFLOPS each. The latter option has 10 times as many FLOPS, but there is a wide variety of computational problems in which the former will always be much faster. This means that FLOPS don’t meaningfully tell you which option is better, it will always depend on how well the problem you want to solve maps onto your hardware.
In large-scale computing, the bottleneck is often the communication speed in the network. If the calculations you have to do don’t neatly fall apart into roughly separate tasks, the different computers have to communicate a lot, which slows everything down. Adding more FLOPS (computers) won’t prevent that in the slightest.
You can not extrapolate FLOPS estimates without justifying why the communication overhead doesn’t make the estimated quantity meaningless on parallel hardware.
I remember looking into communication speed, but unfortunately I can’t find the sources I found last time! As I recall, when I checked the communication figures weren’t meaningfully different from processing speed figures.
Edit: found it! AI Impacts on TEPS (traversed edges per second): https://aiimpacts.org/brain-performance-in-teps/
Yeah, basically computers are closer in communication speed to a human brain than they are in processing speed. Which makes intuitive sense—they can transfer information at the speed of light, while brains are stuck sending chemical signals in many (all?) cases.
2nd edit: On your earlier point about training time vs. total engineering time...”Most honest” isn’t really the issue. It’s what you care about—training time illustrates that human-level performance can be quickly surpassed by an AI system’s capabilities once it’s built. Then the AI will keep improving, leaving us in the dust (although the applicability of current algorithms to more complex tasks is unclear). Total engineering time would show that these are massive projects which take time to develop...which is also true.