Not a direct answer, but you might find the Interpretability (ML & AI) tag on LW relevant. That’s where I found Neel Nanda’s longlist of interpretability theories of impact (published Mar-22 so it may be quite outdated), and Charbel-Raphaël’s Against Almost Every Theory of Impact of Interpretability responding to it (published Aug-23, so much more current).
Not a direct answer, but you might find the Interpretability (ML & AI) tag on LW relevant. That’s where I found Neel Nanda’s longlist of interpretability theories of impact (published Mar-22 so it may be quite outdated), and Charbel-Raphaël’s Against Almost Every Theory of Impact of Interpretability responding to it (published Aug-23, so much more current).