AI alignment: Evaluate the extent to which large language models have natural abstractions
Artificial Intelligence
The natural abstraction hypothesis is the hypothesis that neural networks will learn abstractions very similar to human concepts because these concepts are a better decomposition of reality than the alternatives. If it were true in practice, it would imply that large NNs (and large LMs in particular, due to being trained on natural language) would learn faithful models of human values, as well as bound the difficulty of translating between the model and human ontologies in ELK, avoiding the hard case of ELK in practice. If it turns out that the natural abstraction hypothesis is true at relevant scales, this would allow us to sidestep a large part of the alignment problem, and if it is false then this allows us to know to avoid a class of approaches that would be doomed to fail.
We’d like to see work towards gathering evidence on whether natural abstractions holds in practice and how this scales with model size, with a focus on interpretability of model latents, and experiments in toy environments that test whether human simulators are favored in practice. Work towards modifying model architectures to encourage natural abstractions would also be helpful towards this end.
AI alignment: Evaluate the extent to which large language models have natural abstractions
Artificial Intelligence
The natural abstraction hypothesis is the hypothesis that neural networks will learn abstractions very similar to human concepts because these concepts are a better decomposition of reality than the alternatives. If it were true in practice, it would imply that large NNs (and large LMs in particular, due to being trained on natural language) would learn faithful models of human values, as well as bound the difficulty of translating between the model and human ontologies in ELK, avoiding the hard case of ELK in practice. If it turns out that the natural abstraction hypothesis is true at relevant scales, this would allow us to sidestep a large part of the alignment problem, and if it is false then this allows us to know to avoid a class of approaches that would be doomed to fail.
We’d like to see work towards gathering evidence on whether natural abstractions holds in practice and how this scales with model size, with a focus on interpretability of model latents, and experiments in toy environments that test whether human simulators are favored in practice. Work towards modifying model architectures to encourage natural abstractions would also be helpful towards this end.