Machine learning involves repetitive operations which can be processed simultaneously (parallelization)
I agree, but of course Amdahl’s Law remains in effect.
The goal of hardware optimization is often parallization (sic)
Generally when designing hardware increased throughput or reduced latency (for some representative set of workloads) are the main goals. Parallelization is one particular technique that can help achieve those goals, but there are many ideas/techniques/optimizations that one can apply.
The widespread development of machine learning hardware started in mid-early 2010s and a significant advance in investment and progress occurred in the late 2010s
Sure… I mean deep learning wasn’t even a thing until 2012. I think the important concept here is that hardware designs have a long time horizon (generally 2-3 years) because it takes that long to do a clean-sheet design and also because if you’re spending millions of dollars to design/tapeout/manufacture a new chip, you need to be convinced that the workload is real and people will still be using it years from now when you’re trying to sell your new chip.
CUDA optimization, or optimization of low-level instruction sets for machine learning operations (kernels), generated significant improvements but has exhausted its low-hanging fruit
Like the other commenter, this could be true but I’m not sure what the argument is for this. And again, it depends on the workload. My recollection is that even early versions of cuDNN (circa 2015) were good enough that you got >90% of the max floating point performance on at least some of the CNN workloads common at that time (of course transformers weren’t invented yet).
The development of specialized hardware and instruction sets for certain kernels leads to fracturing and incentivizes incremental development, since newer kernels will be unoptimized and consequently slower
This could be true, I suppose. But I’m doubtful because those hardware designs are being produced by companies that have studied the workloads and are convinced they can do better. If anything competition may incentivize all hardware manufacturers to spend more time optimizing kernel performance than they otherwise would.
I agree, but of course Amdahl’s Law remains in effect.
Generally when designing hardware increased throughput or reduced latency (for some representative set of workloads) are the main goals. Parallelization is one particular technique that can help achieve those goals, but there are many ideas/techniques/optimizations that one can apply.
Sure… I mean deep learning wasn’t even a thing until 2012. I think the important concept here is that hardware designs have a long time horizon (generally 2-3 years) because it takes that long to do a clean-sheet design and also because if you’re spending millions of dollars to design/tapeout/manufacture a new chip, you need to be convinced that the workload is real and people will still be using it years from now when you’re trying to sell your new chip.
Like the other commenter, this could be true but I’m not sure what the argument is for this. And again, it depends on the workload. My recollection is that even early versions of cuDNN (circa 2015) were good enough that you got >90% of the max floating point performance on at least some of the CNN workloads common at that time (of course transformers weren’t invented yet).
This could be true, I suppose. But I’m doubtful because those hardware designs are being produced by companies that have studied the workloads and are convinced they can do better. If anything competition may incentivize all hardware manufacturers to spend more time optimizing kernel performance than they otherwise would.