AI interpretability

TagLast edit: 9 May 2022 10:40 UTC by Leo

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.[1]

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.[1]

Interpretability is a focus of Chris Olah and Anthropic’s work, though most AI alignment organisations work on interpretability to some extent, such as Redwood Research[2].

Related entries

AI risk | AI safety | Artificial intelligence

  1. ^

    Multicore (2020) Transparency /​ Interpretability (ML & AI), AI Alignment Forum, August 1.

  2. ^

Chris Olah on work­ing at top AI labs with­out an un­der­grad degree

80000_Hours10 Sep 2021 20:46 UTC
15 points
0 comments73 min readEA link

The case for be­com­ing a black-box in­ves­ti­ga­tor of lan­guage models

Buck6 May 2022 14:37 UTC
82 points
7 comments3 min readEA link

Chris Olah on what the hell is go­ing on in­side neu­ral networks

80000_Hours4 Aug 2021 15:13 UTC
5 points
0 comments133 min readEA link

Our Cur­rent Direc­tions in Mechanis­tic In­ter­pretabil­ity Re­search (AI Align­ment Speaker Series)

Group Organizer8 Apr 2022 17:08 UTC
3 points
0 comments1 min readEA link

New se­ries of posts an­swer­ing one of Holden’s “Im­por­tant, ac­tion­able re­search ques­tions”

Evan R. Murphy12 May 2022 21:22 UTC
9 points
0 comments1 min readEA link
No comments.