RSS

AI interpretability

TagLast edit: May 9, 2022, 10:40 AM by Leo

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.[1]

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.[1]

Interpretability is a focus of Chris Olah and Anthropic’s work, though most AI alignment organisations work on interpretability to some extent, such as Redwood Research[2].

Related entries

AI risk | AI safety | Artificial intelligence

  1. ^

    Multicore (2020) Transparency /​ Interpretability (ML & AI), AI Alignment Forum, August 1.

  2. ^

Chris Olah on work­ing at top AI labs with­out an un­der­grad degree

80000_HoursSep 10, 2021, 8:46 PM
15 points
0 comments73 min readEA link

Sen­tience in Machines—How Do We Test for This Ob­jec­tively?

Mayowa OsiboduMar 20, 2023, 5:20 AM
10 points
0 comments2 min readEA link
(www.researchgate.net)

PhD Po­si­tion: AI In­ter­pretabil­ity in Ber­lin, Germany

Martian MoonshineApr 22, 2023, 6:57 PM
24 points
0 comments1 min readEA link
(stephanw.net)

In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

Sid BlackSep 23, 2022, 6:03 PM
35 points
0 comments1 min readEA link

Ra­tional An­i­ma­tions’ in­tro to mechanis­tic interpretability

WriterJun 14, 2024, 4:10 PM
21 points
1 comment1 min readEA link
(youtu.be)

Why and When In­ter­pretabil­ity Work is Dangerous

Nicholas / Heather KrossMay 28, 2023, 12:27 AM
6 points
0 comments1 min readEA link

An­nounc­ing Apollo Research

mariushobbhahnMay 30, 2023, 4:17 PM
158 points
5 comments1 min readEA link

Con­crete Steps to Get Started in Trans­former Mechanis­tic Interpretability

Neel NandaDec 26, 2022, 1:00 PM
18 points
0 comments12 min readEA link

[MLSN #8]: Mechanis­tic in­ter­pretabil­ity, us­ing law to in­form AI al­ign­ment, scal­ing laws for proxy gaming

TW123Feb 20, 2023, 4:06 PM
25 points
0 comments4 min readEA link
(newsletter.mlsafety.org)

The limited up­side of interpretability

Peter S. ParkNov 15, 2022, 8:22 PM
23 points
3 comments10 min readEA link

Why mechanis­tic in­ter­pretabil­ity does not and can­not con­tribute to long-term AGI safety (from mes­sages with a friend)

RemmeltDec 19, 2022, 12:02 PM
17 points
3 comments1 min readEA link

The case for be­com­ing a black-box in­ves­ti­ga­tor of lan­guage models

BuckMay 6, 2022, 2:37 PM
90 points
7 comments3 min readEA link

High-level hopes for AI alignment

Holden KarnofskyDec 20, 2022, 2:11 AM
123 points
14 comments19 min readEA link
(www.cold-takes.com)

Against LLM Reductionism

Erich_Grunewald 🔸Mar 8, 2023, 3:52 PM
32 points
4 comments1 min readEA link

A Bare­bones Guide to Mechanis­tic In­ter­pretabil­ity Prerequisites

Neel NandaNov 29, 2022, 6:43 PM
54 points
1 comment3 min readEA link
(neelnanda.io)

An­nounc­ing Timaeus

Stan van WingerdenOct 22, 2023, 1:32 PM
79 points
0 comments5 min readEA link
(www.lesswrong.com)

Join the AI gov­er­nance and in­ter­pretabil­ity hackathons!

Esben KranMar 23, 2023, 2:39 PM
33 points
1 comment5 min readEA link
(alignmentjam.com)

If in­ter­pretabil­ity re­search goes well, it may get dangerous

So8resApr 3, 2023, 9:48 PM
33 points
0 comments1 min readEA link

Safety-First Agents/​Ar­chi­tec­tures Are a Promis­ing Path to Safe AGI

Brendon_WongAug 6, 2023, 8:00 AM
6 points
0 comments12 min readEA link

Call for Pythia-style foun­da­tion model suite for al­ign­ment research

LucretiaMay 1, 2023, 8:26 PM
10 points
0 comments1 min readEA link

Public Call for In­ter­est in Math­e­mat­i­cal Alignment

DavidmanheimNov 22, 2023, 1:22 PM
27 points
3 comments1 min readEA link

AI Align­ment Re­search Eng­ineer Ac­cel­er­a­tor (ARENA): call for applicants

TheMcDouglasNov 7, 2023, 9:43 AM
46 points
3 comments10 min readEA link

Assess­ment of AI safety agen­das: think about the down­side risk

Roman LeventovDec 19, 2023, 9:02 AM
6 points
0 comments1 min readEA link

ML4Good UK—Ap­pli­ca­tions Open

NiaJan 2, 2024, 6:20 PM
21 points
0 comments1 min readEA link

A Selec­tion of Ran­domly Selected SAE Features

TheMcDouglasApr 1, 2024, 9:09 AM
25 points
2 comments1 min readEA link

AI Align­ment Re­search Eng­ineer Ac­cel­er­a­tor (ARENA): Call for ap­pli­cants v4.0

JamesFoxJul 6, 2024, 11:51 AM
7 points
0 comments5 min readEA link

ML4Good Brasil—Ap­pli­ca­tions Open

NiaMay 3, 2024, 10:39 AM
28 points
1 comment1 min readEA link

5 ways to im­prove CoT faithfulness

CBiddulphOct 8, 2024, 4:17 AM
8 points
0 comments1 min readEA link

MATS Ap­pli­ca­tions + Re­search Direc­tions I’m Cur­rently Ex­cited About

Neel NandaFeb 6, 2025, 11:03 AM
23 points
3 comments1 min readEA link

Takes on “Align­ment Fak­ing in Large Lan­guage Models”

Joe_CarlsmithDec 18, 2024, 6:22 PM
72 points
1 comment1 min readEA link

A Rocket–In­ter­pretabil­ity Analogy

plexOct 21, 2024, 1:55 PM
13 points
1 comment1 min readEA link

Re­in­force­ment Learn­ing: A Non-Tech­ni­cal Primer on o1 and Deep­Seek-R1

AlexChalkFeb 9, 2025, 11:58 PM
4 points
0 comments9 min readEA link
(alexchalk.net)

Beyond Meta: Large Con­cept Models Will Win

Anthony RepettoDec 30, 2024, 12:57 AM
3 points
0 comments3 min readEA link

What are poly­se­man­tic neu­rons?

Vishakha AgrawalJan 8, 2025, 7:39 AM
4 points
0 comments2 min readEA link
(aisafety.info)

Im­pli­ca­tions of the in­fer­ence scal­ing paradigm for AI safety

Ryan KiddJan 15, 2025, 12:59 AM
46 points
5 comments1 min readEA link

Mo­ti­va­tion control

Joe_CarlsmithOct 30, 2024, 5:15 PM
18 points
0 comments1 min readEA link

Bench­mark Perfor­mance is a Poor Mea­sure of Gen­er­al­is­able AI Rea­son­ing Capabilities

James FodorFeb 21, 2025, 4:25 AM
12 points
3 comments24 min readEA link

Wor­ries about la­tent rea­son­ing in LLMs

CBiddulphJan 20, 2025, 9:09 AM
20 points
1 comment1 min readEA link

[Question] What should I read about defin­ing AI “hal­lu­ci­na­tion?”

James-Hartree-LawJan 23, 2025, 1:00 AM
2 points
0 comments1 min readEA link

Give Neo a Chance

ankMar 6, 2025, 2:35 PM
1 point
3 comments7 min readEA link

LLM chat­bots have ~half of the kinds of “con­scious­ness” that hu­mans be­lieve in. Hu­mans should avoid go­ing crazy about that.

Andrew CritchNov 22, 2024, 3:26 AM
11 points
3 comments1 min readEA link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers

Neel NandaOct 18, 2022, 9:23 PM
19 points
0 comments12 min readEA link
(www.neelnanda.io)

Safety of Self-Assem­bled Neu­ro­mor­phic Hardware

Can RagerDec 26, 2022, 7:10 PM
8 points
1 comment10 min readEA link

New se­ries of posts an­swer­ing one of Holden’s “Im­por­tant, ac­tion­able re­search ques­tions”

Evan R. MurphyMay 12, 2022, 9:22 PM
9 points
0 comments1 min readEA link

Chris Olah on what the hell is go­ing on in­side neu­ral networks

80000_HoursAug 4, 2021, 3:13 PM
5 points
0 comments133 min readEA link

Math­e­mat­i­cal Cir­cuits in Neu­ral Networks

Sean OsierSep 22, 2022, 2:32 AM
23 points
2 comments1 min readEA link
(www.youtube.com)

Black Box In­ves­ti­ga­tions Re­search Hackathon

Esben KranSep 15, 2022, 10:09 AM
23 points
0 comments2 min readEA link

Join the in­ter­pretabil­ity re­search hackathon

Esben KranOct 28, 2022, 4:26 PM
48 points
0 comments5 min readEA link

Our Cur­rent Direc­tions in Mechanis­tic In­ter­pretabil­ity Re­search (AI Align­ment Speaker Series)

Group OrganizerApr 8, 2022, 5:08 PM
3 points
0 comments1 min readEA link

Nav­i­gat­ing AI Safety: Ex­plor­ing Trans­parency with CCACS – A Com­pre­hen­si­ble Ar­chi­tec­ture for Discussion

Ihor IvlievMar 12, 2025, 5:51 PM
2 points
0 comments2 min readEA link

Solv­ing ad­ver­sar­ial at­tacks in com­puter vi­sion as a baby ver­sion of gen­eral AI alignment

Stanislav FortAug 31, 2024, 4:15 PM
3 points
1 comment7 min readEA link

AI al­ign­ment as a trans­la­tion problem

Roman LeventovFeb 5, 2024, 2:14 PM
3 points
1 comment1 min readEA link

In­tro­duc­ing Leap Labs, an AI in­ter­pretabil­ity startup

Jessica RumbelowMar 6, 2023, 5:37 PM
11 points
0 comments1 min readEA link
(www.lesswrong.com)

Con­crete open prob­lems in mechanis­tic in­ter­pretabil­ity: a tech­ni­cal overview

Neel NandaJul 6, 2023, 11:35 AM
27 points
1 comment29 min readEA link

Don’t Dis­miss Sim­ple Align­ment Approaches

Chris LeongOct 21, 2023, 12:31 PM
12 points
0 comments1 min readEA link
No comments.