RSS

AI interpretability

TagLast edit: 9 May 2022 10:40 UTC by Leo

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.[1]

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.[1]

Interpretability is a focus of Chris Olah and Anthropic’s work, though most AI alignment organisations work on interpretability to some extent, such as Redwood Research[2].

Related entries

AI risk | AI safety | Artificial intelligence

  1. ^

    Multicore (2020) Transparency /​ Interpretability (ML & AI), AI Alignment Forum, August 1.

  2. ^

Chris Olah on work­ing at top AI labs with­out an un­der­grad degree

80000_Hours10 Sep 2021 20:46 UTC
15 points
0 comments75 min readEA link

In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

Sid Black23 Sep 2022 18:03 UTC
35 points
0 comments1 min readEA link

Sen­tience in Machines—How Do We Test for This Ob­jec­tively?

Mayowa Osibodu20 Mar 2023 5:20 UTC
10 points
0 comments2 min readEA link
(www.researchgate.net)

PhD Po­si­tion: AI In­ter­pretabil­ity in Ber­lin, Germany

Martian Moonshine22 Apr 2023 18:57 UTC
24 points
0 comments1 min readEA link
(stephanw.net)

Against LLM Reductionism

Erich_Grunewald8 Mar 2023 15:52 UTC
32 points
3 comments1 min readEA link

Why mechanis­tic in­ter­pretabil­ity does not and can­not con­tribute to long-term AGI safety (from mes­sages with a friend)

Remmelt19 Dec 2022 12:02 UTC
17 points
3 comments1 min readEA link

Why and When In­ter­pretabil­ity Work is Dangerous

NicholasKross28 May 2023 0:27 UTC
6 points
0 comments1 min readEA link

An­nounc­ing Apollo Research

mariushobbhahn30 May 2023 16:17 UTC
156 points
5 comments1 min readEA link

A Bare­bones Guide to Mechanis­tic In­ter­pretabil­ity Prerequisites

Neel Nanda29 Nov 2022 18:43 UTC
54 points
1 comment3 min readEA link
(neelnanda.io)

Con­crete Steps to Get Started in Trans­former Mechanis­tic Interpretability

Neel Nanda26 Dec 2022 13:00 UTC
18 points
0 comments12 min readEA link

The case for be­com­ing a black-box in­ves­ti­ga­tor of lan­guage models

Buck6 May 2022 14:37 UTC
90 points
7 comments3 min readEA link

[MLSN #8]: Mechanis­tic in­ter­pretabil­ity, us­ing law to in­form AI al­ign­ment, scal­ing laws for proxy gaming

ThomasW20 Feb 2023 16:06 UTC
25 points
0 comments4 min readEA link
(newsletter.mlsafety.org)

The limited up­side of interpretability

Peter S. Park15 Nov 2022 20:22 UTC
23 points
3 comments10 min readEA link

High-level hopes for AI alignment

Holden Karnofsky20 Dec 2022 2:11 UTC
118 points
14 comments19 min readEA link
(www.cold-takes.com)

Join the AI gov­er­nance and in­ter­pretabil­ity hackathons!

Esben Kran23 Mar 2023 14:39 UTC
33 points
1 comment5 min readEA link
(alignmentjam.com)

If in­ter­pretabil­ity re­search goes well, it may get dangerous

So8res3 Apr 2023 21:48 UTC
33 points
0 comments1 min readEA link

Safety-First Agents/​Ar­chi­tec­tures Are a Promis­ing Path to Safe AGI

Brendon_Wong6 Aug 2023 8:00 UTC
6 points
0 comments12 min readEA link

Call for Pythia-style foun­da­tion model suite for al­ign­ment research

Lucretia1 May 2023 20:26 UTC
10 points
0 comments1 min readEA link

Public Call for In­ter­est in Math­e­mat­i­cal Alignment

Davidmanheim22 Nov 2023 13:22 UTC
27 points
3 comments1 min readEA link

AI Align­ment Re­search Eng­ineer Ac­cel­er­a­tor (ARENA): call for applicants

TheMcDouglas7 Nov 2023 9:43 UTC
46 points
3 comments10 min readEA link

Assess­ment of AI safety agen­das: think about the down­side risk

Roman Leventov19 Dec 2023 9:02 UTC
5 points
0 comments1 min readEA link

ML4Good UK—Ap­pli­ca­tions Open

Nia2 Jan 2024 18:20 UTC
21 points
0 comments1 min readEA link

A Selec­tion of Ran­domly Selected SAE Features

TheMcDouglas1 Apr 2024 9:09 UTC
25 points
2 comments1 min readEA link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers

Neel Nanda18 Oct 2022 21:23 UTC
19 points
0 comments12 min readEA link
(www.neelnanda.io)

Safety of Self-Assem­bled Neu­ro­mor­phic Hardware

Can Rager26 Dec 2022 19:10 UTC
8 points
1 comment10 min readEA link

New se­ries of posts an­swer­ing one of Holden’s “Im­por­tant, ac­tion­able re­search ques­tions”

Evan R. Murphy12 May 2022 21:22 UTC
9 points
0 comments1 min readEA link

Chris Olah on what the hell is go­ing on in­side neu­ral networks

80000_Hours4 Aug 2021 15:13 UTC
5 points
0 comments135 min readEA link

Math­e­mat­i­cal Cir­cuits in Neu­ral Networks

Sean Osier22 Sep 2022 2:32 UTC
23 points
2 comments1 min readEA link
(www.youtube.com)

Black Box In­ves­ti­ga­tions Re­search Hackathon

Esben Kran15 Sep 2022 10:09 UTC
23 points
0 comments2 min readEA link

Join the in­ter­pretabil­ity re­search hackathon

Esben Kran28 Oct 2022 16:26 UTC
48 points
0 comments5 min readEA link

Our Cur­rent Direc­tions in Mechanis­tic In­ter­pretabil­ity Re­search (AI Align­ment Speaker Series)

Group Organizer8 Apr 2022 17:08 UTC
3 points
0 comments1 min readEA link

AI al­ign­ment as a trans­la­tion problem

Roman Leventov5 Feb 2024 14:14 UTC
3 points
1 comment1 min readEA link

In­tro­duc­ing Leap Labs, an AI in­ter­pretabil­ity startup

Jessica Rumbelow6 Mar 2023 17:37 UTC
9 points
0 comments1 min readEA link
(www.lesswrong.com)

(In­tro/​1) - My Un­der­stand­ings of Mechanis­tic In­ter­pretabil­ity Note­book

Yadav2 Jul 2023 15:21 UTC
9 points
0 comments2 min readEA link

Con­crete open prob­lems in mechanis­tic in­ter­pretabil­ity: a tech­ni­cal overview

Neel Nanda6 Jul 2023 11:35 UTC
26 points
1 comment29 min readEA link

Don’t Dis­miss Sim­ple Align­ment Approaches

Chris Leong21 Oct 2023 12:31 UTC
12 points
0 comments1 min readEA link

An­nounc­ing Timaeus

Stan van Wingerden22 Oct 2023 13:32 UTC
78 points
0 comments5 min readEA link
(www.lesswrong.com)
No comments.