New series of posts answering one of Holden’s “Important, actionable research questions”

In February, Holden Karnofsky published the Important, actionable research questions for the most important century.

I’ve been working for a couple months on trying to answer the following of Holden’s questions:

“What relatively well-scoped research activities are particularly likely to be useful for longtermism-oriented AI alignment?”

To answer this question, I’ve started a series of posts exploring the argument that interpretability—that is, research into better understanding what is happening inside machine learning systems - is a high-leverage research activity for solving the AI alignment problem.

I just published the first two posts on Alignment Forum/​LessWrong:

1. Introduction to the sequence: Interpretability Research for the Most Important Century
2. (main post) Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

There will be at least one more post in the series, but in particular post #2 contains a substantial amount of my research on this topic.