intermediate programs (interpreters, compilers, assemblers) are used to translate human programming languages into increasingly repetitive and specific languages until they become hardware-readable machine code. This translation is typically done through strict, unambiguous rules, which is good from an organizational and cleanliness perspective, but often results in code which consumes orders of magnitude more low-level instructions (and consequently, time) than if they were hand-translated by a human. This problem is amplified when those compilers do not understand that they are optimizing for machine learning: compilation protocols optimized to render graphics, or worse for CPUs, are far slower.
This is at best an imperfect description of how compilers work. I’m not sure what you mean by “repetitive”, but yeah, the purpose is to translate high-level languages to machine code. However:
Hardware does not care about code organization and cleanliness, nor does the compiler. When designing a compiler/hardware stack the principal metrics are correctness and performance. (Performance is very important, but in relative terms is a distant second to correctness.)
The number of instructions in a program, assembly or otherwise, is not equivalent to runtime. As a trivial example, “while(1)” is a short program with infinite runtime. Some optimizations such as loop unrolling increase instruction count while reducing runtime.
Such optimizations are trivial for a compiler, and tricky but possible for a human to get right.
“often results in code which consumes orders of magnitude more low-level instructions”: not sure what this means. Compilers are pretty efficient, you can play around with source code and see the actual assembly pretty easy (e.g. Godbolt is good for this). There’s no significant section of dead code being produced in the common case.
(Of course the raw number of instructions increases from C or whatever language, this is simply how RISC-like assembly works. “int C = A + B;” turns into “Load A. Load B. Add A and B. Allocate C on the stack. Write the computed value to C’s memory location.”)
Humans can sometimes beat the compiler (particularly for tight loops), but compilers in 2023 are really good. I think the senior/junior engineer vs compiler example is wrong. I would say (for a modest loop or critical function): the senior engineer (who has much more experience and knowledge of which tools, metrics, and techniques to use) can gain modest improvement by spending significant time. The junior engineer would probably spend even more time for only a slight improvement.
“This problem is amplified when those compilers do not understand that they are optimizing for machine learning”: Compilers never know the purpose of the code they are optimizing; as you say they are following rule-based optimizations based on various forms of analysis. In LLVM this is basically analysis passes which produce data for optimization passes. For something like PyTorch, “compilation” means PyTorch is analyzing the operation graph you created and mapping it to kernel operations which can be performed on your GPU.
“compilation protocols optimized to render graphics, or worse for CPUs, are far slower”: I don’t understand what you mean by this. What is a compilation protocol for graphics? Can you explain in terms of common compiler/ML tools? (E.g. LLVM MLIR, PyTorch, CUDA?)
I honestly don’t understand how the power plant/flashlight analogy corresponds to compilers. Are you saying this maps to something like LLVM analysis and optimization passes? If so this is wrong; running multiple passes with different optimizations increases performance. Multiple optimization passes was historically (i.e. circa early 2000s) hard for compilers to do but (LLVM author) Chris Lattner’s key idea was to perform all the optimizations on a simple intermediate layer of code (IR) before lowering to machine code.
Machine learning involves repetitive operations which can be processed simultaneously (parallelization)
I agree, but of course Amdahl’s Law remains in effect.
The goal of hardware optimization is often parallization (sic)
Generally when designing hardware increased throughput or reduced latency (for some representative set of workloads) are the main goals. Parallelization is one particular technique that can help achieve those goals, but there are many ideas/techniques/optimizations that one can apply.
The widespread development of machine learning hardware started in mid-early 2010s and a significant advance in investment and progress occurred in the late 2010s
Sure… I mean deep learning wasn’t even a thing until 2012. I think the important concept here is that hardware designs have a long time horizon (generally 2-3 years) because it takes that long to do a clean-sheet design and also because if you’re spending millions of dollars to design/tapeout/manufacture a new chip, you need to be convinced that the workload is real and people will still be using it years from now when you’re trying to sell your new chip.
CUDA optimization, or optimization of low-level instruction sets for machine learning operations (kernels), generated significant improvements but has exhausted its low-hanging fruit
Like the other commenter, this could be true but I’m not sure what the argument is for this. And again, it depends on the workload. My recollection is that even early versions of cuDNN (circa 2015) were good enough that you got >90% of the max floating point performance on at least some of the CNN workloads common at that time (of course transformers weren’t invented yet).
The development of specialized hardware and instruction sets for certain kernels leads to fracturing and incentivizes incremental development, since newer kernels will be unoptimized and consequently slower
This could be true, I suppose. But I’m doubtful because those hardware designs are being produced by companies that have studied the workloads and are convinced they can do better. If anything competition may incentivize all hardware manufacturers to spend more time optimizing kernel performance than they otherwise would.
This is at best an imperfect description of how compilers work. I’m not sure what you mean by “repetitive”, but yeah, the purpose is to translate high-level languages to machine code. However:
Hardware does not care about code organization and cleanliness, nor does the compiler. When designing a compiler/hardware stack the principal metrics are correctness and performance. (Performance is very important, but in relative terms is a distant second to correctness.)
The number of instructions in a program, assembly or otherwise, is not equivalent to runtime. As a trivial example, “while(1)” is a short program with infinite runtime. Some optimizations such as loop unrolling increase instruction count while reducing runtime.
Such optimizations are trivial for a compiler, and tricky but possible for a human to get right.
“often results in code which consumes orders of magnitude more low-level instructions”: not sure what this means. Compilers are pretty efficient, you can play around with source code and see the actual assembly pretty easy (e.g. Godbolt is good for this). There’s no significant section of dead code being produced in the common case.
(Of course the raw number of instructions increases from C or whatever language, this is simply how RISC-like assembly works. “int C = A + B;” turns into “Load A. Load B. Add A and B. Allocate C on the stack. Write the computed value to C’s memory location.”)
Humans can sometimes beat the compiler (particularly for tight loops), but compilers in 2023 are really good. I think the senior/junior engineer vs compiler example is wrong. I would say (for a modest loop or critical function): the senior engineer (who has much more experience and knowledge of which tools, metrics, and techniques to use) can gain modest improvement by spending significant time. The junior engineer would probably spend even more time for only a slight improvement.
“This problem is amplified when those compilers do not understand that they are optimizing for machine learning”: Compilers never know the purpose of the code they are optimizing; as you say they are following rule-based optimizations based on various forms of analysis. In LLVM this is basically analysis passes which produce data for optimization passes. For something like PyTorch, “compilation” means PyTorch is analyzing the operation graph you created and mapping it to kernel operations which can be performed on your GPU.
“compilation protocols optimized to render graphics, or worse for CPUs, are far slower”: I don’t understand what you mean by this. What is a compilation protocol for graphics? Can you explain in terms of common compiler/ML tools? (E.g. LLVM MLIR, PyTorch, CUDA?)
I honestly don’t understand how the power plant/flashlight analogy corresponds to compilers. Are you saying this maps to something like LLVM analysis and optimization passes? If so this is wrong; running multiple passes with different optimizations increases performance. Multiple optimization passes was historically (i.e. circa early 2000s) hard for compilers to do but (LLVM author) Chris Lattner’s key idea was to perform all the optimizations on a simple intermediate layer of code (IR) before lowering to machine code.
I agree, but of course Amdahl’s Law remains in effect.
Generally when designing hardware increased throughput or reduced latency (for some representative set of workloads) are the main goals. Parallelization is one particular technique that can help achieve those goals, but there are many ideas/techniques/optimizations that one can apply.
Sure… I mean deep learning wasn’t even a thing until 2012. I think the important concept here is that hardware designs have a long time horizon (generally 2-3 years) because it takes that long to do a clean-sheet design and also because if you’re spending millions of dollars to design/tapeout/manufacture a new chip, you need to be convinced that the workload is real and people will still be using it years from now when you’re trying to sell your new chip.
Like the other commenter, this could be true but I’m not sure what the argument is for this. And again, it depends on the workload. My recollection is that even early versions of cuDNN (circa 2015) were good enough that you got >90% of the max floating point performance on at least some of the CNN workloads common at that time (of course transformers weren’t invented yet).
This could be true, I suppose. But I’m doubtful because those hardware designs are being produced by companies that have studied the workloads and are convinced they can do better. If anything competition may incentivize all hardware manufacturers to spend more time optimizing kernel performance than they otherwise would.