Getting former hiring managers from quant firms to help with alignment hiring
Artificial Intelligence, Empowering Exceptional People
Despite having lots of funding, alignment seems to not have been very successful at attracting top talent to date. Quant firms, on the other hand, have become known for very successfully acquiring talent and putting them to work on difficult conceptual and engineering problems. Although buy-in to alignment before one can contribute is often cited as a reason, this is, if anything, even more of a problem for quant firms, since very few people are inherently interested in quant trading as an end. As such, importing some of this know how could help substantially improve alignment hiring and onboarding efficiency.
AI alignment: Evaluate the extent to which large language models have natural abstractions
Artificial Intelligence
The natural abstraction hypothesis is the hypothesis that neural networks will learn abstractions very similar to human concepts because these concepts are a better decomposition of reality than the alternatives. If it were true in practice, it would imply that large NNs (and large LMs in particular, due to being trained on natural language) would learn faithful models of human values, as well as bound the difficulty of translating between the model and human ontologies in ELK, avoiding the hard case of ELK in practice. If it turns out that the natural abstraction hypothesis is true at relevant scales, this would allow us to sidestep a large part of the alignment problem, and if it is false then this allows us to know to avoid a class of approaches that would be doomed to fail.
We’d like to see work towards gathering evidence on whether natural abstractions holds in practice and how this scales with model size, with a focus on interpretability of model latents, and experiments in toy environments that test whether human simulators are favored in practice. Work towards modifying model architectures to encourage natural abstractions would also be helpful towards this end.