What relatively well-scoped research activities are particularly likely to be useful for longtermism-oriented AI alignment? [...] (3) Activity that is likely to be relevant for the hardest and most important parts of the problem, while also being the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)
I’m planning to spend some time working on this question, or rather part of it. In particular I’m going to explore the argument that interpretability research falls into this category, with some attention to which specific aspects or angles of interpretability research seem most useful.
Since I don’t plan to spend much time thoroughly examining other research directions besides interpretability, I don’t expect to have a complete comparative answer to the question. But by answering the question for interpretability, I hope to at least put together a fairly comprehensive argument for (or perhaps against, we’ll see after I look at the evidence!) interpretability research that could be used by those considering it as a target for their funding or their time. I also hope that then someone trying to answer the larger question could use my work on interpretability as part of a comparative analysis across different research activities.
If someone is already working on this particular question and I’m duplicating effort, please let me know and perhaps we can sync up. Otherwise, I hope to have something to show on this question in a few/several weeks!
I’m planning to spend some time working on this question, or rather part of it. In particular I’m going to explore the argument that interpretability research falls into this category, with some attention to which specific aspects or angles of interpretability research seem most useful.
Since I don’t plan to spend much time thoroughly examining other research directions besides interpretability, I don’t expect to have a complete comparative answer to the question. But by answering the question for interpretability, I hope to at least put together a fairly comprehensive argument for (or perhaps against, we’ll see after I look at the evidence!) interpretability research that could be used by those considering it as a target for their funding or their time. I also hope that then someone trying to answer the larger question could use my work on interpretability as part of a comparative analysis across different research activities.
If someone is already working on this particular question and I’m duplicating effort, please let me know and perhaps we can sync up. Otherwise, I hope to have something to show on this question in a few/several weeks!
My first 2 posts for this project went live on the Alignment Forum today:
1. Introduction to the sequence: Interpretability Research for the Most Important Century
2. (main post) Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios