VictorW comments on VictorW’s Quick takes

VictorW 3 Dec 2023 6:10 UTC
2 points
0 ∶ 0
Does anyone have a resource that maps out different types/subtypes of AI interpretability work?

E.g. mechanistic interpretability and concept-based interpretability, what other types are there and how are they categorised?
- ag4000 6 Dec 2023 21:37 UTC
  4 points
  0 ∶ 0
  Parent
  Late to the party here but I’d check out Räuker et al. (2023), which provides one taxonomy of AI interpretability work.
  - VictorW 7 Dec 2023 0:05 UTC
    1 point
    0 ∶ 0
    Parent
    Brilliant, thank you. One of the very long lists of interp work on the forum seemed to have everything as mech interp (or possibly I just don’t recognize alternative key words). Does the EA AI safety community feel particularly strongly about mech interp or is it just my sample size being too small?
    - ag4000 7 Dec 2023 0:23 UTC
      1 point
      0 ∶ 0
      Parent
      Not an expert, but I think your impression is correct. See this post, for example (I recommend the whole sequence).
- Mo Putera 3 Dec 2023 11:49 UTC
  2 points
  0 ∶ 0
  Parent
  Not a direct answer, but you might find the Interpretability (ML & AI) tag on LW relevant. That’s where I found Neel Nanda’s longlist of interpretability theories of impact (published Mar-22 so it may be quite outdated), and Charbel-Raphaël’s Against Almost Every Theory of Impact of Interpretability responding to it (published Aug-23, so much more current).