Brilliant, thank you. One of the very long lists of interp work on the forum seemed to have everything as mech interp (or possibly I just don’t recognize alternative key words). Does the EA AI safety community feel particularly strongly about mech interp or is it just my sample size being too small?
Late to the party here but I’d check out Räuker et al. (2023), which provides one taxonomy of AI interpretability work.
Brilliant, thank you. One of the very long lists of interp work on the forum seemed to have everything as mech interp (or possibly I just don’t recognize alternative key words). Does the EA AI safety community feel particularly strongly about mech interp or is it just my sample size being too small?
Not an expert, but I think your impression is correct. See this post, for example (I recommend the whole sequence).