Nate / Elizer / others I’ve seen arguing for a sharp left turn appeal to an evolution → human capabilities analogy and say that evolution’s outer optimization process built a much faster human inner optimization process whose capability gains vastly outstripped those evolution built into humans. They seem to expect a similar thing to happen with SGD creating some inner thing which is not SGD and gains capabilities much faster than SGD can “insert” them into the AI. Then, just like human civilization exploded in capabilities over a tiny evolutionary timeframe, so too will AIs explode in capabilities over a tiny “SGD timeframe”.
I think this is very wrong, and that “evolution → human capabilities” is a very bad reference class to make predictions about “AI training → AI capabilities”. We don’t train out AIs via an outer optimizer over possible inner learning processes, where each inner learning process is initialized from scratch, then takes billions inner learning steps before the outer optimization process take one step, and then is deleted after the outer optimizer’s single step. Obviously, such a “two layer” training process would experience a “sharp left turn” once each inner learner became capable of building off the progress made by the previous inner learners (which happened in humans via culture / technological progress from one generation to another).
However, this “sharp left turn” does not occur because the inner learning processes is inherently better / more foomy / etc. than the outer optimizer. It happens because you devoted billions of times more resources to the inner learning processes, but then deleted each inner learner after a short amount of time. Once the inner learning processes become capable enough to pass their knowledge along to their successors, you get what looks like a sharp left turn. But that sharp left turn only happens because the inner learners have found a kludgy workaround past the crippling flaw where they all get deleted shortly after initialization.
In my frame, we’ve already figured out and applied the “sharp left turn” to our AI systems, in that we don’t waste our compute on massive amounts of incredibly inefficient neural architecture search or hyperparameter tuning[1]. We know that, for a given compute budget, the best way to spend it on capabilities is to train a single big model in accordance with the empirical scaling laws discovered in the Chinchilla paper, not to split the compute budget across millions of different training runs for vastly tinier models with slightly different architectures / training processes. The marginal return on architecture tweaking is much lower than the return to direct scaling.
(Also, we don’t delete our AIs well before they’re fully trained and start again from scratch using the same number of parameters. I feel a little silly to be emphasizing this point so often, but I think it really does get to the crux of the matter. Evolution’s sharp left turn happened because evolution spent compute in a shockingly inefficient manner for increasing capabilities. Once you condition on this specific failure mode of evolution, there really is nothing else to be explained here, and no reason to suppose some general tendency towards “sharpness” in inner capability gains.)
It can be useful to do hyperparameter tuning on smaller versions of the model you’re training. My point is that relatively little of your compute budget should go into such tweaking
Nate / Elizer / others I’ve seen arguing for a sharp left turn appeal to an evolution → human capabilities analogy and say that evolution’s outer optimization process built a much faster human inner optimization process whose capability gains vastly outstripped those evolution built into humans. They seem to expect a similar thing to happen with SGD creating some inner thing which is not SGD and gains capabilities much faster than SGD can “insert” them into the AI. Then, just like human civilization exploded in capabilities over a tiny evolutionary timeframe, so too will AIs explode in capabilities over a tiny “SGD timeframe”.
I think this is very wrong, and that “evolution → human capabilities” is a very bad reference class to make predictions about “AI training → AI capabilities”. We don’t train out AIs via an outer optimizer over possible inner learning processes, where each inner learning process is initialized from scratch, then takes billions inner learning steps before the outer optimization process take one step, and then is deleted after the outer optimizer’s single step. Obviously, such a “two layer” training process would experience a “sharp left turn” once each inner learner became capable of building off the progress made by the previous inner learners (which happened in humans via culture / technological progress from one generation to another).
However, this “sharp left turn” does not occur because the inner learning processes is inherently better / more foomy / etc. than the outer optimizer. It happens because you devoted billions of times more resources to the inner learning processes, but then deleted each inner learner after a short amount of time. Once the inner learning processes become capable enough to pass their knowledge along to their successors, you get what looks like a sharp left turn. But that sharp left turn only happens because the inner learners have found a kludgy workaround past the crippling flaw where they all get deleted shortly after initialization.
In my frame, we’ve already figured out and applied the “sharp left turn” to our AI systems, in that we don’t waste our compute on massive amounts of incredibly inefficient neural architecture search or hyperparameter tuning[1]. We know that, for a given compute budget, the best way to spend it on capabilities is to train a single big model in accordance with the empirical scaling laws discovered in the Chinchilla paper, not to split the compute budget across millions of different training runs for vastly tinier models with slightly different architectures / training processes. The marginal return on architecture tweaking is much lower than the return to direct scaling.
(Also, we don’t delete our AIs well before they’re fully trained and start again from scratch using the same number of parameters. I feel a little silly to be emphasizing this point so often, but I think it really does get to the crux of the matter. Evolution’s sharp left turn happened because evolution spent compute in a shockingly inefficient manner for increasing capabilities. Once you condition on this specific failure mode of evolution, there really is nothing else to be explained here, and no reason to suppose some general tendency towards “sharpness” in inner capability gains.)
It can be useful to do hyperparameter tuning on smaller versions of the model you’re training. My point is that relatively little of your compute budget should go into such tweaking