Misaligned AGI will turn us off before we can turn it off.
Leo Gao
“As AI systems get more powerful, we exit the regime where models fail to understand what we want, and enter the regime where they know exactly what we want and yet pursue their own goals anyways, while tricking us into thinking they aren’t until it’s too late.” (source: https://t.co/s3fbTdv29V)
Historical investigation on the relation between incremental improvements and paradigm shifts
Artificial Intelligence
One major question that heavily influences the choice of alignment research directions is the degree to which incremental improvements are necessary for major paradigm shifts. As the field of alignment is largely preparadigmatic, there is a high chance that we may require a paradigm shift before we can make substantial progress towards aligning superhuman AI systems, rather than merely incremental improvements. The answer to this question determines whether the best approach to alignment is to choose metrics and try to make incremental progress on alignment research questions, or to attempt to mostly fund things that are long shots, or something else entirely. Research in this direction would entail combing through historical materials in the field of AI, as well as in other scientific domains more broadly, to gain a better understanding of the context in which past paradigm shifts occurred, and putting together a report summarizing the findings.
Some possible ways-the-world-could-be include:
Incremental improvements have negligible impact on when paradigm shifts happen and could be eliminated entirely without any negative impact on when paradigm shifts occur. All or the vast majority of incremental work is visible from the start as low risk low reward, and potentially paradigm shift causing work is visible from the start as high risk high reward.
Incremental improvements serve to increase attention in the field and thus increase the amount of funding for the field as a whole, thereby proportionally increasing the absolute number of people working on paradigmatic directions, but funding those working on potential paradigm shifts directly would yield the same paradigm shifts at the same time
Incremental improvements are necessary to convince risk averse funding sources to continue funding something, since putting money into something for years with no visible output is not popular with many funders, and thus forces researchers to divert a certain % of their time to working on funder-legible incremental improvements.
Most paradigm shifts arise from attempts to make incremental improvements that accidentally uncover something deeper in the process. It is difficult to tell before embarking on a project whether it will only yield an incremental improvement, no improvement at all, or a paradigm shift.
Most paradigm shifts cannot occur until incremental improvements lay the foundation for the paradigm shift to happen, no matter how much effort is put into trying to recognize paradigm shifts.
Creating materials for alignment onboarding
Artificial Intelligence
At present, the pipeline from AI capabilities researcher to AI alignment researcher is not very user friendly. While there are a few people like Rob Miles and Richard Ngo who have produced excellent onboarding materials, this niche is still fairly underserved compared to onboarding in many other fields. Creating more materials for a field has the advantage that because there are different formats that different people find more helpful, having more increases the likelihood that something works for any given person. While there are many different possible angles for onboarding, there are several potential avenues that stand out as promising due to successes in other fields:
High production value videos (similar to 3blue1brown, Kurzgesagt)
Course-like lectures and quizzes (similar to Khan academy)
Interactive learning apps (similar to Brilliant)
Getting former hiring managers from quant firms to help with alignment hiring
Artificial Intelligence, Empowering Exceptional People
Despite having lots of funding, alignment seems to not have been very successful at attracting top talent to date. Quant firms, on the other hand, have become known for very successfully acquiring talent and putting them to work on difficult conceptual and engineering problems. Although buy-in to alignment before one can contribute is often cited as a reason, this is, if anything, even more of a problem for quant firms, since very few people are inherently interested in quant trading as an end. As such, importing some of this know how could help substantially improve alignment hiring and onboarding efficiency.
AI alignment: Evaluate the extent to which large language models have natural abstractions
Artificial Intelligence
The natural abstraction hypothesis is the hypothesis that neural networks will learn abstractions very similar to human concepts because these concepts are a better decomposition of reality than the alternatives. If it were true in practice, it would imply that large NNs (and large LMs in particular, due to being trained on natural language) would learn faithful models of human values, as well as bound the difficulty of translating between the model and human ontologies in ELK, avoiding the hard case of ELK in practice. If it turns out that the natural abstraction hypothesis is true at relevant scales, this would allow us to sidestep a large part of the alignment problem, and if it is false then this allows us to know to avoid a class of approaches that would be doomed to fail.
We’d like to see work towards gathering evidence on whether natural abstractions holds in practice and how this scales with model size, with a focus on interpretability of model latents, and experiments in toy environments that test whether human simulators are favored in practice. Work towards modifying model architectures to encourage natural abstractions would also be helpful towards this end.
How much work would it be to make something like this as a python library, and how much would that reduce its usefulness? I think this is really cool and have been looking for something like this, but I am multiple times more likely to use something if it’s a python library as opposed to a brand new language, and I assume others think similarly.