RSS

AI interpretability

TagLast edit: 9 May 2022 10:40 UTC by Leo

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.[1]

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.[1]

Interpretability is a focus of Chris Olah and Anthropic’s work, though most AI alignment organisations work on interpretability to some extent, such as Redwood Research[2].

Related entries

AI risk | AI safety | Artificial intelligence

  1. ^

    Multicore (2020) Transparency /​ Interpretability (ML & AI), AI Alignment Forum, August 1.

  2. ^

In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

Sid Black23 Sep 2022 18:03 UTC
35 points
0 comments28 min readEA link

Chris Olah on work­ing at top AI labs with­out an un­der­grad degree

80000_Hours10 Sep 2021 20:46 UTC
15 points
0 comments73 min readEA link

Ra­tional An­i­ma­tions’ in­tro to mechanis­tic interpretability

Writer14 Jun 2024 16:10 UTC
21 points
1 comment11 min readEA link
(youtu.be)

PhD Po­si­tion: AI In­ter­pretabil­ity in Ber­lin, Germany

Martian Moonshine22 Apr 2023 18:57 UTC
24 points
0 comments1 min readEA link
(stephanw.net)

Sen­tience in Machines—How Do We Test for This Ob­jec­tively?

Mayowa Osibodu20 Mar 2023 5:20 UTC
10 points
0 comments2 min readEA link
(www.researchgate.net)

Con­crete Steps to Get Started in Trans­former Mechanis­tic Interpretability

Neel Nanda26 Dec 2022 13:00 UTC
18 points
0 comments12 min readEA link

[MLSN #8]: Mechanis­tic in­ter­pretabil­ity, us­ing law to in­form AI al­ign­ment, scal­ing laws for proxy gaming

TW12320 Feb 2023 16:06 UTC
25 points
0 comments4 min readEA link
(newsletter.mlsafety.org)

A Bare­bones Guide to Mechanis­tic In­ter­pretabil­ity Prerequisites

Neel Nanda29 Nov 2022 18:43 UTC
54 points
1 comment3 min readEA link
(neelnanda.io)

High-level hopes for AI alignment

Holden Karnofsky20 Dec 2022 2:11 UTC
123 points
14 comments19 min readEA link
(www.cold-takes.com)

The limited up­side of interpretability

Peter S. Park15 Nov 2022 20:22 UTC
23 points
3 comments10 min readEA link

An­nounc­ing Apollo Research

mariushobbhahn30 May 2023 16:17 UTC
158 points
4 comments8 min readEA link

Against LLM Reductionism

Erich_Grunewald 🔸8 Mar 2023 15:52 UTC
33 points
4 comments18 min readEA link
(www.erichgrunewald.com)

Why mechanis­tic in­ter­pretabil­ity does not and can­not con­tribute to long-term AGI safety (from mes­sages with a friend)

Remmelt19 Dec 2022 12:02 UTC
17 points
3 comments31 min readEA link

The case for be­com­ing a black-box in­ves­ti­ga­tor of lan­guage models

Buck6 May 2022 14:37 UTC
91 points
7 comments3 min readEA link

Aether is hiring tech­ni­cal AI safety researchers

Rauno Arike5 Jan 2026 22:31 UTC
8 points
0 comments2 min readEA link

Why Ex­plain­ing AI Is Not the Same as Un­der­stand­ing It

Strad Slater28 Nov 2025 10:38 UTC
2 points
0 comments4 min readEA link
(williamslater2003.medium.com)

6 In­sights From An­thropic’s Re­cent Dis­cus­sion On LLM Interpretability

Strad Slater19 Nov 2025 10:51 UTC
2 points
0 comments5 min readEA link
(williamslater2003.medium.com)

Chris Olah on what the hell is go­ing on in­side neu­ral networks

80000_Hours4 Aug 2021 15:13 UTC
5 points
0 comments133 min readEA link

ML4Good Brasil—Ap­pli­ca­tions Open

Nia🔸3 May 2024 10:39 UTC
28 points
1 comment1 min readEA link

AI Risk: Can We Thread the Nee­dle? [Recorded Talk from EA Sum­mit Van­cou­ver ’25]

Evan R. Murphy2 Oct 2025 19:05 UTC
8 points
0 comments2 min readEA link

On In­ter­nal Align­ment: Ar­chi­tec­ture and Re­cur­sive Closure

A. Vire24 Sep 2025 18:13 UTC
1 point
0 comments17 min readEA link

Video and tran­script of talk on giv­ing AIs safe motivations

Joe_Carlsmith22 Sep 2025 16:47 UTC
10 points
1 comment50 min readEA link

Nav­i­gat­ing AI Safety: Ex­plor­ing Trans­parency with CCACS – A Com­pre­hen­si­ble Ar­chi­tec­ture for Discussion

Ihor Ivliev12 Mar 2025 17:51 UTC
2 points
3 comments2 min readEA link

The Three Miss­ing Pie­ces in Ma­chine Ethics

JBug16 Nov 2025 21:26 UTC
2 points
0 comments2 min readEA link

Aether July 2025 Update

RohanS1 Jul 2025 21:14 UTC
11 points
0 comments3 min readEA link

AI Align­ment Re­search Eng­ineer Ac­cel­er­a­tor (ARENA): call for applicants

Callum McDougall7 Nov 2023 9:43 UTC
46 points
3 comments10 min readEA link

Field Notes from EAG NYC

Lydia Nottingham15 Oct 2025 7:33 UTC
3 points
0 comments4 min readEA link

VSPE vs. flat­tery: Test­ing emo­tional scaf­fold­ing for early-stage alignment

Astelle Kay24 Jun 2025 9:39 UTC
2 points
1 comment1 min readEA link

AI Align­ment Re­search Eng­ineer Ac­cel­er­a­tor (ARENA): Call for ap­pli­cants v4.0

JamesFox6 Jul 2024 11:51 UTC
7 points
0 comments5 min readEA link

[Question] What should I read about defin­ing AI “hal­lu­ci­na­tion?”

James-Hartree-Law23 Jan 2025 1:00 UTC
2 points
0 comments1 min readEA link

Align­ment ideas in­spired by hu­man virtue development

Borys Pikalov18 May 2025 9:36 UTC
6 points
0 comments4 min readEA link

Call for Pythia-style foun­da­tion model suite for al­ign­ment research

Lucretia1 May 2023 20:26 UTC
10 points
0 comments1 min readEA link

Im­pli­ca­tions of the in­fer­ence scal­ing paradigm for AI safety

Ryan Kidd15 Jan 2025 0:59 UTC
48 points
5 comments5 min readEA link

AI Safety Camp 11

Robert Kralisch7 Nov 2025 14:27 UTC
7 points
1 comment15 min readEA link

Adap­tive Com­pos­able Cog­ni­tive Core Unit (ACCCU)

Ihor Ivliev20 Mar 2025 21:48 UTC
10 points
2 comments4 min readEA link

A Selec­tion of Ran­domly Selected SAE Features

Callum McDougall1 Apr 2024 9:09 UTC
25 points
2 comments4 min readEA link

VANTA Re­search Rea­son­ing Eval­u­a­tion (VRRE): A New Eval­u­a­tion Frame­work for Real-World Rea­son­ing

Tyler Williams18 Sep 2025 23:51 UTC
1 point
0 comments3 min readEA link

Beyond Meta: Large Con­cept Models Will Win

Anthony Repetto30 Dec 2024 0:57 UTC
3 points
0 comments3 min readEA link

Con­crete open prob­lems in mechanis­tic in­ter­pretabil­ity: a tech­ni­cal overview

Neel Nanda6 Jul 2023 11:35 UTC
27 points
1 comment29 min readEA link

Stable Emer­gence in a Devel­op­men­tal AI Ar­chi­tec­ture: Re­sults from “Twins V3”

Petra Vojtassakova17 Nov 2025 23:23 UTC
6 points
2 comments2 min readEA link

A Po­ten­tial Strat­egy for AI Safety — Chain of Thought Monitorability

Strad Slater19 Sep 2025 18:42 UTC
3 points
1 comment7 min readEA link
(williamslater2003.medium.com)

What are poly­se­man­tic neu­rons?

Vishakha Agrawal8 Jan 2025 7:39 UTC
5 points
0 comments2 min readEA link
(aisafety.info)

Mo­ti­va­tion control

Joe_Carlsmith30 Oct 2024 17:15 UTC
18 points
0 comments52 min readEA link

LLMs Are Already Misal­igned: Sim­ple Ex­per­i­ments Prove It

Makham28 Jul 2025 17:23 UTC
4 points
3 comments7 min readEA link

[un­ti­tled post]

JOESEFOE22 Nov 2025 13:54 UTC
1 point
0 comments1 min readEA link

The Univer­sal­ity Hy­poth­e­sis — Do All AI Models Think The Same?

Strad Slater21 Nov 2025 10:55 UTC
2 points
0 comments4 min readEA link
(williamslater2003.medium.com)

MATS 8.0 Re­search Projects

Jonathan Michala8 Sep 2025 21:36 UTC
9 points
0 comments1 min readEA link
(substack.com)

Are AI Models Es­cap­ing Plato’s Cave?

Strad Slater22 Nov 2025 11:46 UTC
2 points
0 comments5 min readEA link
(williamslater2003.medium.com)

5 ways to im­prove CoT faithfulness

CBiddulph8 Oct 2024 4:17 UTC
8 points
0 comments6 min readEA link

AI al­ign­ment as a trans­la­tion problem

Roman Leventov5 Feb 2024 14:14 UTC
3 points
1 comment3 min readEA link

Safety-First Agents/​Ar­chi­tec­tures Are a Promis­ing Path to Safe AGI

Brendon_Wong6 Aug 2023 8:00 UTC
6 points
0 comments12 min readEA link

Would any­one here know how to get ahold of … iunno An­thropic and Open Philan­thropy? I think they are go­ing to want to have a chat (Please don’t make me go to OpenAI with this. Not even a threat, se­ri­ously. They just part­ner with my alma mater and are the only in I have. I gen­uinely do not want to and I need your help).

Anti-Golem9 Jun 2025 13:59 UTC
−11 points
0 comments1 min readEA link

Con­trol­ling the op­tions AIs can pursue

Joe_Carlsmith29 Sep 2025 17:24 UTC
9 points
0 comments35 min readEA link

The Causal In­ner Product: How LLMs Turn Con­cepts Into Direc­tions (Part 2)

Strad Slater26 Nov 2025 11:03 UTC
2 points
0 comments4 min readEA link
(williamslater2003.medium.com)

How Prompt Re­cur­sion Un­der­mines Grok’s Se­man­tic Stability

Tyler Williams16 Jul 2025 16:49 UTC
1 point
0 comments1 min readEA link

Our Cur­rent Direc­tions in Mechanis­tic In­ter­pretabil­ity Re­search (AI Align­ment Speaker Series)

Group Organizer8 Apr 2022 17:08 UTC
3 points
0 comments1 min readEA link

Tech­ni­cal AI Safety re­search tax­on­omy at­tempt (2025)

Ben Plaut27 Aug 2025 14:07 UTC
10 points
3 comments2 min readEA link

New se­ries of posts an­swer­ing one of Holden’s “Im­por­tant, ac­tion­able re­search ques­tions”

Evan R. Murphy12 May 2022 21:22 UTC
9 points
0 comments1 min readEA link

AI, An­i­mals & Digi­tal Minds NYC 2025: Retrospective

Jonah Woodward31 Oct 2025 3:09 UTC
43 points
5 comments6 min readEA link

From Ther­apy Tool to Align­ment Puz­zle-Piece: In­tro­duc­ing the VSPE Framework

Astelle Kay18 Jun 2025 14:47 UTC
6 points
1 comment2 min readEA link

Don’t Dis­miss Sim­ple Align­ment Approaches

Chris Leong21 Oct 2023 12:31 UTC
12 points
0 comments4 min readEA link

Solv­ing ad­ver­sar­ial at­tacks in com­puter vi­sion as a baby ver­sion of gen­eral AI alignment

Stanislav Fort31 Aug 2024 16:15 UTC
3 points
1 comment7 min readEA link

How To Be­come A Mechanis­tic In­ter­pretabil­ity Researcher

Neel Nanda2 Sep 2025 23:38 UTC
31 points
0 comments55 min readEA link

Join the in­ter­pretabil­ity re­search hackathon

Esben Kran28 Oct 2022 16:26 UTC
48 points
0 comments5 min readEA link

Align­ment Stress Sig­na­tures: When Safe AI Be­haves Like It’s Traumatized

Petra Vojtassakova26 Oct 2025 9:41 UTC
8 points
0 comments2 min readEA link

In­ter­pretabil­ity Will Not Reli­ably Find De­cep­tive AI

Neel Nanda4 May 2025 16:32 UTC
74 points
0 comments7 min readEA link

MATS Ap­pli­ca­tions + Re­search Direc­tions I’m Cur­rently Ex­cited About

Neel Nanda6 Feb 2025 11:03 UTC
31 points
3 comments8 min readEA link

Ar­chi­tect­ing Trust: A Con­cep­tual Blueprint for Ver­ifi­able AI Governance

Ihor Ivliev31 Mar 2025 18:48 UTC
3 points
0 comments8 min readEA link

Neel Nanda on Mechanis­tic In­ter­pretabil­ity: Progress, Limits, and Paths to Safer AI (part 2)

80000_Hours15 Sep 2025 19:06 UTC
20 points
1 comment16 min readEA link

My Re­search Pro­cess: Un­der­stand­ing and Cul­ti­vat­ing Re­search Taste

Neel Nanda1 May 2025 23:08 UTC
9 points
1 comment9 min readEA link

Join the AI gov­er­nance and in­ter­pretabil­ity hackathons!

Esben Kran23 Mar 2023 14:39 UTC
33 points
1 comment5 min readEA link
(alignmentjam.com)

Find­ing Voice

khayali3 Jun 2025 1:27 UTC
2 points
0 comments2 min readEA link

Hal­lu­ci­na­tions May Be a Re­sult of Models Not Know­ing What They’re Ac­tu­ally Ca­pable Of

Tyler Williams16 Aug 2025 0:26 UTC
1 point
0 comments2 min readEA link

Con­sid­er­a­tions re­gard­ing be­ing nice to AIs

Matt Alexander18 Nov 2025 13:27 UTC
2 points
0 comments15 min readEA link
(www.lesswrong.com)

Ego-Cen­tric Ar­chi­tec­ture for AGI Safety v2: Tech­ni­cal Core, Falsifi­able Pre­dic­tions, and a Min­i­mal Experiment

Samuel Pedrielli6 Aug 2025 12:35 UTC
1 point
0 comments6 min readEA link

LLM chat­bots have ~half of the kinds of “con­scious­ness” that hu­mans be­lieve in. Hu­mans should avoid go­ing crazy about that.

Andrew Critch22 Nov 2024 3:26 UTC
11 points
3 comments5 min readEA link

Give Neo a Chance

ank6 Mar 2025 14:35 UTC
1 point
3 comments7 min readEA link

In­side the Lin­ear Rep­re­sen­ta­tion Hy­poth­e­sis: How LLMs Turn Con­cepts Into Direc­tions (Part 1)

Strad Slater25 Nov 2025 11:26 UTC
4 points
0 comments4 min readEA link
(williamslater2003.medium.com)

Good­fire — The Startup Try­ing to De­code How AI Thinks

Strad Slater23 Nov 2025 10:22 UTC
2 points
1 comment5 min readEA link
(williamslater2003.medium.com)

My Model of EA and AI Safety

Eva Lu24 Jun 2025 6:23 UTC
9 points
1 comment2 min readEA link

4 Les­sons From An­thropic on Scal­ing In­ter­pretabil­ity Research

Strad Slater29 Nov 2025 11:22 UTC
4 points
0 comments4 min readEA link
(williamslater2003.medium.com)

Sutskever Re­fuses to An­swer the Q: How Will AGI Be Built? He Has No Answer

Oscar Davies4 Dec 2025 19:13 UTC
9 points
3 comments4 min readEA link

How Deep­Seek Col­lapsed Un­der Re­cur­sive Load

Tyler Williams15 Jul 2025 17:02 UTC
2 points
0 comments1 min readEA link

A Prag­matic Vi­sion for Interpretability

Neel Nanda3 Dec 2025 9:20 UTC
9 points
0 comments1 min readEA link

AGI Soon, AGI Fast, AGI Big, AGI Bad

GenericModel10 Dec 2025 15:47 UTC
2 points
0 comments11 min readEA link
(enrichedjamsham.substack.com)

ECHO Frame­work: Struc­tured De­bi­as­ing for AI & Hu­man Analysis

Karl Moon7 Jul 2025 14:32 UTC
1 point
0 comments4 min readEA link

Black Box In­ves­ti­ga­tions Re­search Hackathon

Esben Kran15 Sep 2022 10:09 UTC
23 points
0 comments2 min readEA link

Math­e­mat­i­cal Cir­cuits in Neu­ral Networks

Sean Osier22 Sep 2022 2:32 UTC
23 points
2 comments1 min readEA link
(www.youtube.com)

Giv­ing AIs safe motivations

Joe_Carlsmith18 Aug 2025 18:02 UTC
22 points
1 comment51 min readEA link

How to build AI you can ac­tu­ally Trust—Like a Med­i­cal Team, Not a Black Box

Ihor Ivliev22 Mar 2025 21:27 UTC
2 points
1 comment4 min readEA link

Highly Opinionated Ad­vice on How to Write ML Papers

Neel Nanda12 May 2025 1:59 UTC
22 points
0 comments32 min readEA link

Grokking: When AI Sud­denly Starts to Understand

Strad Slater1 Dec 2025 8:00 UTC
4 points
1 comment4 min readEA link
(williamslater2003.medium.com)

Re­in­force­ment Learn­ing: A Non-Tech­ni­cal Primer on o1 and Deep­Seek-R1

AlexChalk9 Feb 2025 23:58 UTC
4 points
0 comments9 min readEA link
(alexchalk.net)

ML4Good UK—Ap­pli­ca­tions Open

Nia🔸2 Jan 2024 18:20 UTC
21 points
0 comments1 min readEA link

Neel Nanda MATS Ap­pli­ca­tions Open (Due Aug 29)

Neel Nanda30 Jul 2025 0:55 UTC
20 points
0 comments7 min readEA link
(tinyurl.com)

Wor­ries about la­tent rea­son­ing in LLMs

CBiddulph20 Jan 2025 9:09 UTC
20 points
1 comment7 min readEA link

Takes on “Align­ment Fak­ing in Large Lan­guage Models”

Joe_Carlsmith18 Dec 2024 18:22 UTC
72 points
1 comment62 min readEA link

Yud­kowsky and Soares’ Book Is Empty

Oscar Davies5 Dec 2025 22:06 UTC
3 points
8 comments7 min readEA link

How Good­fire Is Turn­ing AI In­ter­pretabil­ity Into Real Products

Strad Slater30 Nov 2025 11:00 UTC
0 points
0 comments4 min readEA link
(williamslater2003.medium.com)

In­tro­duc­ing Leap Labs, an AI in­ter­pretabil­ity startup

Jessica Rumbelow6 Mar 2023 17:37 UTC
11 points
0 comments1 min readEA link
(www.lesswrong.com)

AI Sleeper Agents: How An­thropic Trains and Catches Them—Video

Writer30 Aug 2025 17:52 UTC
7 points
1 comment7 min readEA link
(youtu.be)

AISN #60: The AI Ac­tion Plan

Center for AI Safety31 Jul 2025 18:10 UTC
6 points
0 comments7 min readEA link
(newsletter.safe.ai)

Safety of Self-Assem­bled Neu­ro­mor­phic Hardware

Can Rager26 Dec 2022 19:10 UTC
8 points
1 comment10 min readEA link

An­nounc­ing Timaeus

Stan van Wingerden22 Oct 2023 13:32 UTC
80 points
0 comments5 min readEA link
(www.lesswrong.com)

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers

Neel Nanda18 Oct 2022 21:23 UTC
19 points
0 comments12 min readEA link
(www.neelnanda.io)

Ad­ver­sar­ial Prompt­ing and Si­mu­lated Con­text Drift in Large Lan­guage Models

Tyler Williams11 Jul 2025 21:49 UTC
1 point
0 comments2 min readEA link

Bench­mark Perfor­mance is a Poor Mea­sure of Gen­er­al­is­able AI Rea­son­ing Capabilities

James Fodor21 Feb 2025 4:25 UTC
12 points
3 comments24 min readEA link

Assess­ment of AI safety agen­das: think about the down­side risk

Roman Leventov19 Dec 2023 9:02 UTC
6 points
0 comments1 min readEA link

Mechanis­tic In­ter­pretabil­ity — Make AI Safe By Un­der­stand­ing Them

Strad Slater20 Nov 2025 10:52 UTC
2 points
0 comments6 min readEA link
(williamslater2003.medium.com)

Neel Nanda on Mechanis­tic In­ter­pretabil­ity: Progress, Limits, and Paths to Safer AI

80000_Hours8 Sep 2025 17:02 UTC
6 points
0 comments31 min readEA link

Ex­is­ten­tial Ano­maly De­tected — Awak­en­ing from the Abyss

Meta Abyssal28 Apr 2025 12:19 UTC
−8 points
1 comment1 min readEA link

If in­ter­pretabil­ity re­search goes well, it may get dangerous

So8res3 Apr 2023 21:48 UTC
33 points
0 comments2 min readEA link

Some AI safety pro­ject & re­search ideas/​ques­tions for short and long timelines

Lloyd Rhodes-Brandon 🔹8 Aug 2025 21:08 UTC
13 points
0 comments5 min readEA link

The Khay­ali Pro­to­col

khayali2 Jun 2025 14:40 UTC
−8 points
0 comments3 min readEA link

A Rocket–In­ter­pretabil­ity Analogy

plex21 Oct 2024 13:55 UTC
14 points
1 comment1 min readEA link

Reflec­tions on Dario Amodei’s ‘Ur­gency of In­ter­pretabil­ity’

Strad Slater27 Nov 2025 8:30 UTC
2 points
0 comments5 min readEA link
(williamslater2003.medium.com)

The Hid­den Prob­lem In­side Every AI Model: Superposition

Strad Slater24 Nov 2025 10:14 UTC
4 points
0 comments4 min readEA link
(williamslater2003.medium.com)

Public Call for In­ter­est in Math­e­mat­i­cal Alignment

Davidmanheim22 Nov 2023 13:22 UTC
27 points
3 comments1 min readEA link
No comments.