Brilliant, thank you. One of the very long lists of interp work on the forum seemed to have everything as mech interp (or possibly I just don’t recognize alternative key words). Does the EA AI safety community feel particularly strongly about mech interp or is it just my sample size being too small?
Does anyone have a resource that maps out different types/subtypes of AI interpretability work?
E.g. mechanistic interpretability and concept-based interpretability, what other types are there and how are they categorised?
Late to the party here but I’d check out Räuker et al. (2023), which provides one taxonomy of AI interpretability work.
Brilliant, thank you. One of the very long lists of interp work on the forum seemed to have everything as mech interp (or possibly I just don’t recognize alternative key words). Does the EA AI safety community feel particularly strongly about mech interp or is it just my sample size being too small?
Not an expert, but I think your impression is correct. See this post, for example (I recommend the whole sequence).
Not a direct answer, but you might find the Interpretability (ML & AI) tag on LW relevant. That’s where I found Neel Nanda’s longlist of interpretability theories of impact (published Mar-22 so it may be quite outdated), and Charbel-Raphaël’s Against Almost Every Theory of Impact of Interpretability responding to it (published Aug-23, so much more current).