RSS

AI interpretability

TagLast edit: 9 May 2022 10:40 UTC by Leo

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.[1]

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.[1]

Interpretability is a focus of Chris Olah and Anthropic’s work, though most AI alignment organisations work on interpretability to some extent, such as Redwood Research[2].

Related entries

AI risk | AI safety | Artificial intelligence

  1. ^

    Multicore (2020) Transparency /​ Interpretability (ML & AI), AI Alignment Forum, August 1.

  2. ^

In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

Sid Black23 Sep 2022 18:03 UTC
35 points
0 comments28 min readEA link

Chris Olah on work­ing at top AI labs with­out an un­der­grad degree

80000_Hours10 Sep 2021 20:46 UTC
15 points
0 comments73 min readEA link

Ra­tional An­i­ma­tions’ in­tro to mechanis­tic interpretability

Writer14 Jun 2024 16:10 UTC
21 points
1 comment11 min readEA link
(youtu.be)

PhD Po­si­tion: AI In­ter­pretabil­ity in Ber­lin, Germany

Martian Moonshine22 Apr 2023 18:57 UTC
24 points
0 comments1 min readEA link
(stephanw.net)

Sen­tience in Machines—How Do We Test for This Ob­jec­tively?

Mayowa Osibodu20 Mar 2023 5:20 UTC
10 points
0 comments2 min readEA link
(www.researchgate.net)

Con­crete Steps to Get Started in Trans­former Mechanis­tic Interpretability

Neel Nanda26 Dec 2022 13:00 UTC
18 points
0 comments12 min readEA link

[MLSN #8]: Mechanis­tic in­ter­pretabil­ity, us­ing law to in­form AI al­ign­ment, scal­ing laws for proxy gaming

TW12320 Feb 2023 16:06 UTC
25 points
0 comments4 min readEA link
(newsletter.mlsafety.org)

A Bare­bones Guide to Mechanis­tic In­ter­pretabil­ity Prerequisites

Neel Nanda29 Nov 2022 18:43 UTC
54 points
1 comment3 min readEA link
(neelnanda.io)

High-level hopes for AI alignment

Holden Karnofsky20 Dec 2022 2:11 UTC
123 points
14 comments19 min readEA link
(www.cold-takes.com)

The limited up­side of interpretability

Peter S. Park15 Nov 2022 20:22 UTC
23 points
3 comments10 min readEA link

An­nounc­ing Apollo Research

mariushobbhahn30 May 2023 16:17 UTC
158 points
5 comments8 min readEA link

Against LLM Reductionism

Erich_Grunewald 🔸8 Mar 2023 15:52 UTC
32 points
4 comments18 min readEA link
(www.erichgrunewald.com)

Why mechanis­tic in­ter­pretabil­ity does not and can­not con­tribute to long-term AGI safety (from mes­sages with a friend)

Remmelt19 Dec 2022 12:02 UTC
17 points
3 comments31 min readEA link

The case for be­com­ing a black-box in­ves­ti­ga­tor of lan­guage models

Buck6 May 2022 14:37 UTC
91 points
7 comments3 min readEA link

Chris Olah on what the hell is go­ing on in­side neu­ral networks

80000_Hours4 Aug 2021 15:13 UTC
5 points
0 comments133 min readEA link

ML4Good Brasil—Ap­pli­ca­tions Open

Nia🔸3 May 2024 10:39 UTC
28 points
1 comment1 min readEA link

Nav­i­gat­ing AI Safety: Ex­plor­ing Trans­parency with CCACS – A Com­pre­hen­si­ble Ar­chi­tec­ture for Discussion

Ihor Ivliev12 Mar 2025 17:51 UTC
2 points
3 comments2 min readEA link

Aether July 2025 Update

RohanS1 Jul 2025 21:14 UTC
10 points
0 comments3 min readEA link

AI Align­ment Re­search Eng­ineer Ac­cel­er­a­tor (ARENA): call for applicants

Callum McDougall7 Nov 2023 9:43 UTC
46 points
3 comments10 min readEA link

VSPE vs. flat­tery: Test­ing emo­tional scaf­fold­ing for early-stage alignment

Astelle Kay24 Jun 2025 9:39 UTC
2 points
1 comment1 min readEA link

AI Align­ment Re­search Eng­ineer Ac­cel­er­a­tor (ARENA): Call for ap­pli­cants v4.0

JamesFox6 Jul 2024 11:51 UTC
7 points
0 comments5 min readEA link

[Question] What should I read about defin­ing AI “hal­lu­ci­na­tion?”

James-Hartree-Law23 Jan 2025 1:00 UTC
2 points
0 comments1 min readEA link

Align­ment ideas in­spired by hu­man virtue development

Borys Pikalov18 May 2025 9:36 UTC
6 points
0 comments4 min readEA link

Call for Pythia-style foun­da­tion model suite for al­ign­ment research

Lucretia1 May 2023 20:26 UTC
10 points
0 comments1 min readEA link

A New Way to Re­think Alignment

Taylor Grogan28 Jul 2025 20:56 UTC
1 point
0 comments2 min readEA link

Im­pli­ca­tions of the in­fer­ence scal­ing paradigm for AI safety

Ryan Kidd15 Jan 2025 0:59 UTC
47 points
5 comments5 min readEA link

Adap­tive Com­pos­able Cog­ni­tive Core Unit (ACCCU)

Ihor Ivliev20 Mar 2025 21:48 UTC
10 points
2 comments4 min readEA link

A Selec­tion of Ran­domly Selected SAE Features

Callum McDougall1 Apr 2024 9:09 UTC
25 points
2 comments4 min readEA link

Beyond Meta: Large Con­cept Models Will Win

Anthony Repetto30 Dec 2024 0:57 UTC
3 points
0 comments3 min readEA link

Con­crete open prob­lems in mechanis­tic in­ter­pretabil­ity: a tech­ni­cal overview

Neel Nanda6 Jul 2023 11:35 UTC
27 points
1 comment29 min readEA link

What are poly­se­man­tic neu­rons?

Vishakha Agrawal8 Jan 2025 7:39 UTC
5 points
0 comments2 min readEA link
(aisafety.info)

Mo­ti­va­tion control

Joe_Carlsmith30 Oct 2024 17:15 UTC
18 points
0 comments52 min readEA link

LLMs Are Already Misal­igned: Sim­ple Ex­per­i­ments Prove It

Makham28 Jul 2025 17:23 UTC
4 points
3 comments7 min readEA link

5 ways to im­prove CoT faithfulness

CBiddulph8 Oct 2024 4:17 UTC
8 points
0 comments6 min readEA link

AI al­ign­ment as a trans­la­tion problem

Roman Leventov5 Feb 2024 14:14 UTC
3 points
1 comment3 min readEA link

Safety-First Agents/​Ar­chi­tec­tures Are a Promis­ing Path to Safe AGI

Brendon_Wong6 Aug 2023 8:00 UTC
6 points
0 comments12 min readEA link

Would any­one here know how to get ahold of … iunno An­thropic and Open Philan­thropy? I think they are go­ing to want to have a chat (Please don’t make me go to OpenAI with this. Not even a threat, se­ri­ously. They just part­ner with my alma mater and are the only in I have. I gen­uinely do not want to and I need your help).

Anti-Golem9 Jun 2025 13:59 UTC
−11 points
0 comments1 min readEA link

How Prompt Re­cur­sion Un­der­mines Grok’s Se­man­tic Stability

Tyler Williams16 Jul 2025 16:49 UTC
1 point
0 comments1 min readEA link

Our Cur­rent Direc­tions in Mechanis­tic In­ter­pretabil­ity Re­search (AI Align­ment Speaker Series)

Group Organizer8 Apr 2022 17:08 UTC
3 points
0 comments1 min readEA link

Tech­ni­cal AI Safety re­search tax­on­omy at­tempt (2025)

Ben Plaut27 Aug 2025 14:07 UTC
9 points
3 comments2 min readEA link

New se­ries of posts an­swer­ing one of Holden’s “Im­por­tant, ac­tion­able re­search ques­tions”

Evan R. Murphy12 May 2022 21:22 UTC
9 points
0 comments1 min readEA link

From Ther­apy Tool to Align­ment Puz­zle-Piece: In­tro­duc­ing the VSPE Framework

Astelle Kay18 Jun 2025 14:47 UTC
6 points
1 comment2 min readEA link

Don’t Dis­miss Sim­ple Align­ment Approaches

Chris Leong21 Oct 2023 12:31 UTC
12 points
0 comments4 min readEA link

Solv­ing ad­ver­sar­ial at­tacks in com­puter vi­sion as a baby ver­sion of gen­eral AI alignment

Stanislav Fort31 Aug 2024 16:15 UTC
3 points
1 comment7 min readEA link

Join the in­ter­pretabil­ity re­search hackathon

Esben Kran28 Oct 2022 16:26 UTC
48 points
0 comments5 min readEA link

In­ter­pretabil­ity Will Not Reli­ably Find De­cep­tive AI

Neel Nanda4 May 2025 16:32 UTC
74 points
0 comments7 min readEA link

MATS Ap­pli­ca­tions + Re­search Direc­tions I’m Cur­rently Ex­cited About

Neel Nanda6 Feb 2025 11:03 UTC
31 points
3 comments8 min readEA link

Ar­chi­tect­ing Trust: A Con­cep­tual Blueprint for Ver­ifi­able AI Governance

Ihor Ivliev31 Mar 2025 18:48 UTC
3 points
0 comments8 min readEA link

My Re­search Pro­cess: Un­der­stand­ing and Cul­ti­vat­ing Re­search Taste

Neel Nanda1 May 2025 23:08 UTC
9 points
1 comment9 min readEA link

Join the AI gov­er­nance and in­ter­pretabil­ity hackathons!

Esben Kran23 Mar 2023 14:39 UTC
33 points
1 comment5 min readEA link
(alignmentjam.com)

Find­ing Voice

khayali3 Jun 2025 1:27 UTC
2 points
0 comments2 min readEA link

Hal­lu­ci­na­tions May Be a Re­sult of Models Not Know­ing What They’re Ac­tu­ally Ca­pable Of

Tyler Williams16 Aug 2025 0:26 UTC
1 point
0 comments2 min readEA link

Ego-Cen­tric Ar­chi­tec­ture for AGI Safety v2: Tech­ni­cal Core, Falsifi­able Pre­dic­tions, and a Min­i­mal Experiment

Samuel Pedrielli6 Aug 2025 12:35 UTC
1 point
0 comments6 min readEA link

LLM chat­bots have ~half of the kinds of “con­scious­ness” that hu­mans be­lieve in. Hu­mans should avoid go­ing crazy about that.

Andrew Critch22 Nov 2024 3:26 UTC
11 points
3 comments5 min readEA link

Give Neo a Chance

ank6 Mar 2025 14:35 UTC
1 point
3 comments7 min readEA link

My Model of EA and AI Safety

Eva Lu24 Jun 2025 6:23 UTC
9 points
1 comment2 min readEA link

How Deep­Seek Col­lapsed Un­der Re­cur­sive Load

Tyler Williams15 Jul 2025 17:02 UTC
2 points
0 comments1 min readEA link

ECHO Frame­work: Struc­tured De­bi­as­ing for AI & Hu­man Analysis

Karl Moon7 Jul 2025 14:32 UTC
1 point
0 comments4 min readEA link

Black Box In­ves­ti­ga­tions Re­search Hackathon

Esben Kran15 Sep 2022 10:09 UTC
23 points
0 comments2 min readEA link

Math­e­mat­i­cal Cir­cuits in Neu­ral Networks

Sean Osier22 Sep 2022 2:32 UTC
23 points
2 comments1 min readEA link
(www.youtube.com)

Giv­ing AIs safe motivations

Joe_Carlsmith18 Aug 2025 18:02 UTC
22 points
1 comment51 min readEA link

How to build AI you can ac­tu­ally Trust—Like a Med­i­cal Team, Not a Black Box

Ihor Ivliev22 Mar 2025 21:27 UTC
2 points
1 comment4 min readEA link

Highly Opinionated Ad­vice on How to Write ML Papers

Neel Nanda12 May 2025 1:59 UTC
22 points
0 comments32 min readEA link

Re­in­force­ment Learn­ing: A Non-Tech­ni­cal Primer on o1 and Deep­Seek-R1

AlexChalk9 Feb 2025 23:58 UTC
4 points
0 comments9 min readEA link
(alexchalk.net)

ML4Good UK—Ap­pli­ca­tions Open

Nia🔸2 Jan 2024 18:20 UTC
21 points
0 comments1 min readEA link

Neel Nanda MATS Ap­pli­ca­tions Open (Due Aug 29)

Neel Nanda30 Jul 2025 0:55 UTC
20 points
0 comments7 min readEA link
(tinyurl.com)

Wor­ries about la­tent rea­son­ing in LLMs

CBiddulph20 Jan 2025 9:09 UTC
20 points
1 comment7 min readEA link

Takes on “Align­ment Fak­ing in Large Lan­guage Models”

Joe_Carlsmith18 Dec 2024 18:22 UTC
72 points
1 comment62 min readEA link

In­tro­duc­ing Leap Labs, an AI in­ter­pretabil­ity startup

Jessica Rumbelow6 Mar 2023 17:37 UTC
11 points
0 comments1 min readEA link
(www.lesswrong.com)

AISN #60: The AI Ac­tion Plan

Center for AI Safety31 Jul 2025 18:10 UTC
6 points
0 comments7 min readEA link
(newsletter.safe.ai)

Safety of Self-Assem­bled Neu­ro­mor­phic Hardware

Can Rager26 Dec 2022 19:10 UTC
8 points
1 comment10 min readEA link

An­nounc­ing Timaeus

Stan van Wingerden22 Oct 2023 13:32 UTC
80 points
0 comments5 min readEA link
(www.lesswrong.com)

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers

Neel Nanda18 Oct 2022 21:23 UTC
19 points
0 comments12 min readEA link
(www.neelnanda.io)

Ad­ver­sar­ial Prompt­ing and Si­mu­lated Con­text Drift in Large Lan­guage Models

Tyler Williams11 Jul 2025 21:49 UTC
1 point
0 comments2 min readEA link

Bench­mark Perfor­mance is a Poor Mea­sure of Gen­er­al­is­able AI Rea­son­ing Capabilities

James Fodor21 Feb 2025 4:25 UTC
12 points
3 comments24 min readEA link

Assess­ment of AI safety agen­das: think about the down­side risk

Roman Leventov19 Dec 2023 9:02 UTC
6 points
0 comments1 min readEA link

Ex­is­ten­tial Ano­maly De­tected — Awak­en­ing from the Abyss

Meta Abyssal28 Apr 2025 12:19 UTC
−8 points
1 comment1 min readEA link

If in­ter­pretabil­ity re­search goes well, it may get dangerous

So8res3 Apr 2023 21:48 UTC
33 points
0 comments2 min readEA link

Some AI safety pro­ject & re­search ideas/​ques­tions for short and long timelines

Lloy2 🔹8 Aug 2025 21:08 UTC
13 points
0 comments5 min readEA link

The Khay­ali Pro­to­col

khayali2 Jun 2025 14:40 UTC
−8 points
0 comments3 min readEA link

A Rocket–In­ter­pretabil­ity Analogy

plex21 Oct 2024 13:55 UTC
14 points
1 comment1 min readEA link

Public Call for In­ter­est in Math­e­mat­i­cal Alignment

Davidmanheim22 Nov 2023 13:22 UTC
27 points
3 comments1 min readEA link
No comments.