AI interpretability

TagLast edit: 9 May 2022 10:40 UTC by Leo

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.^[1]

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.^[1]

Interpretability is a focus of Chris Olah and Anthropic’s work, though most AI alignment organisations work on interpretability to some extent, such as Redwood Research^[2].

Related entries

AI risk | AI safety | Artificial intelligence

^
Multicore (2020) Transparency / Interpretability (ML & AI), AI Alignment Forum, August 1.
^
Shlegeris, Buck (2022) Answer to ‘How might a herd of interns help with AI or biosecurity research tasks/questions?’, Effective Altruism Forum, March 21.

Interpreting Neural Networks through the Polytope Lens

Sid Black23 Sep 2022 18:03 UTC

35 points

0 comments28 min readEA link

Chris Olah on working at top AI labs without an undergrad degree

80000_Hours10 Sep 2021 20:46 UTC

15 points

0 comments73 min readEA link

Rational Animations’ intro to mechanistic interpretability

Writer14 Jun 2024 16:10 UTC

21 points

1 comment11 min readEA link

(youtu.be)

PhD Position: AI Interpretability in Berlin, Germany

Martian Moonshine22 Apr 2023 18:57 UTC

24 points

0 comments1 min readEA link

(stephanw.net)

Sentience in Machines—How Do We Test for This Objectively?

Mayowa Osibodu20 Mar 2023 5:20 UTC

10 points

0 comments2 min readEA link

(www.researchgate.net)

Concrete Steps to Get Started in Transformer Mechanistic Interpretability

Neel Nanda26 Dec 2022 13:00 UTC

18 points

0 comments12 min readEA link

[MLSN #8]: Mechanistic interpretability, using law to inform AI alignment, scaling laws for proxy gaming

TW12320 Feb 2023 16:06 UTC

25 points

0 comments4 min readEA link

(newsletter.mlsafety.org)

A Barebones Guide to Mechanistic Interpretability Prerequisites

Neel Nanda29 Nov 2022 18:43 UTC

54 points

1 comment3 min readEA link

(neelnanda.io)

High-level hopes for AI alignment

Holden Karnofsky20 Dec 2022 2:11 UTC

123 points

14 comments19 min readEA link

(www.cold-takes.com)

The limited upside of interpretability

Peter S. Park15 Nov 2022 20:22 UTC

23 points

3 comments10 min readEA link

Announcing Apollo Research

mariushobbhahn30 May 2023 16:17 UTC

158 points

5 comments8 min readEA link

Against LLM Reductionism

Erich_Grunewald 🔸8 Mar 2023 15:52 UTC

32 points

4 comments18 min readEA link

(www.erichgrunewald.com)

Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

Remmelt19 Dec 2022 12:02 UTC

17 points

3 comments31 min readEA link

The case for becoming a black-box investigator of language models

Buck6 May 2022 14:37 UTC

91 points

7 comments3 min readEA link

Chris Olah on what the hell is going on inside neural networks

80000_Hours4 Aug 2021 15:13 UTC

5 points

0 comments133 min readEA link

ML4Good Brasil—Applications Open

Nia🔸3 May 2024 10:39 UTC

28 points

1 comment1 min readEA link

Navigating AI Safety: Exploring Transparency with CCACS – A Comprehensible Architecture for Discussion

Ihor Ivliev12 Mar 2025 17:51 UTC

2 points

3 comments2 min readEA link

Aether July 2025 Update

RohanS1 Jul 2025 21:14 UTC

10 points

0 comments3 min readEA link

AI Alignment Research Engineer Accelerator (ARENA): call for applicants

Callum McDougall7 Nov 2023 9:43 UTC

46 points

3 comments10 min readEA link

VSPE vs. flattery: Testing emotional scaffolding for early-stage alignment

Astelle Kay24 Jun 2025 9:39 UTC

2 points

1 comment1 min readEA link

AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0

JamesFox6 Jul 2024 11:51 UTC

7 points

0 comments5 min readEA link

[Question] What should I read about defining AI “hallucination?”

James-Hartree-Law23 Jan 2025 1:00 UTC

2 points

0 comments1 min readEA link

Alignment ideas inspired by human virtue development

Borys Pikalov18 May 2025 9:36 UTC

6 points

0 comments4 min readEA link

Call for Pythia-style foundation model suite for alignment research

Lucretia1 May 2023 20:26 UTC

10 points

0 comments1 min readEA link

A New Way to Rethink Alignment

Taylor Grogan28 Jul 2025 20:56 UTC

1 point

0 comments2 min readEA link

Implications of the inference scaling paradigm for AI safety

Ryan Kidd15 Jan 2025 0:59 UTC

47 points

5 comments5 min readEA link

Adaptive Composable Cognitive Core Unit (ACCCU)

Ihor Ivliev20 Mar 2025 21:48 UTC

10 points

2 comments4 min readEA link

A Selection of Randomly Selected SAE Features

Callum McDougall1 Apr 2024 9:09 UTC

25 points

2 comments4 min readEA link

Beyond Meta: Large Concept Models Will Win

Anthony Repetto30 Dec 2024 0:57 UTC

3 points

0 comments3 min readEA link

Concrete open problems in mechanistic interpretability: a technical overview

Neel Nanda6 Jul 2023 11:35 UTC

27 points

1 comment29 min readEA link

What are polysemantic neurons?

Vishakha Agrawal8 Jan 2025 7:39 UTC

5 points

0 comments2 min readEA link

(aisafety.info)

Motivation control

Joe_Carlsmith30 Oct 2024 17:15 UTC

18 points

0 comments52 min readEA link

LLMs Are Already Misaligned: Simple Experiments Prove It

Makham28 Jul 2025 17:23 UTC

4 points

3 comments7 min readEA link

5 ways to improve CoT faithfulness

CBiddulph8 Oct 2024 4:17 UTC

8 points

0 comments6 min readEA link

AI alignment as a translation problem

Roman Leventov5 Feb 2024 14:14 UTC

3 points

1 comment3 min readEA link

Safety-First Agents/Architectures Are a Promising Path to Safe AGI

Brendon_Wong6 Aug 2023 8:00 UTC

6 points

0 comments12 min readEA link

Would anyone here know how to get ahold of … iunno Anthropic and Open Philanthropy? I think they are going to want to have a chat (Please don’t make me go to OpenAI with this. Not even a threat, seriously. They just partner with my alma mater and are the only in I have. I genuinely do not want to and I need your help).

Anti-Golem9 Jun 2025 13:59 UTC

−11 points

0 comments1 min readEA link

How Prompt Recursion Undermines Grok’s Semantic Stability

Tyler Williams16 Jul 2025 16:49 UTC

1 point

0 comments1 min readEA link

Our Current Directions in Mechanistic Interpretability Research (AI Alignment Speaker Series)

Group Organizer8 Apr 2022 17:08 UTC

3 points

0 comments1 min readEA link

Technical AI Safety research taxonomy attempt (2025)

Ben Plaut27 Aug 2025 14:07 UTC

9 points

3 comments2 min readEA link

New series of posts answering one of Holden’s “Important, actionable research questions”

Evan R. Murphy12 May 2022 21:22 UTC

9 points

0 comments1 min readEA link

From Therapy Tool to Alignment Puzzle-Piece: Introducing the VSPE Framework

Astelle Kay18 Jun 2025 14:47 UTC

6 points

1 comment2 min readEA link

Don’t Dismiss Simple Alignment Approaches

Chris Leong21 Oct 2023 12:31 UTC

12 points

0 comments4 min readEA link

Solving adversarial attacks in computer vision as a baby version of general AI alignment

Stanislav Fort31 Aug 2024 16:15 UTC

3 points

1 comment7 min readEA link

Join the interpretability research hackathon

Esben Kran28 Oct 2022 16:26 UTC

48 points

0 comments5 min readEA link

Interpretability Will Not Reliably Find Deceptive AI

Neel Nanda4 May 2025 16:32 UTC

74 points

0 comments7 min readEA link

MATS Applications + Research Directions I’m Currently Excited About

Neel Nanda6 Feb 2025 11:03 UTC

31 points

3 comments8 min readEA link

Architecting Trust: A Conceptual Blueprint for Verifiable AI Governance

Ihor Ivliev31 Mar 2025 18:48 UTC

3 points

0 comments8 min readEA link

My Research Process: Understanding and Cultivating Research Taste

Neel Nanda1 May 2025 23:08 UTC

9 points

1 comment9 min readEA link

Join the AI governance and interpretability hackathons!

Esben Kran23 Mar 2023 14:39 UTC

33 points

1 comment5 min readEA link

(alignmentjam.com)

Finding Voice

khayali3 Jun 2025 1:27 UTC

2 points

0 comments2 min readEA link

Hallucinations May Be a Result of Models Not Knowing What They’re Actually Capable Of

Tyler Williams16 Aug 2025 0:26 UTC

1 point

0 comments2 min readEA link

Ego-Centric Architecture for AGI Safety v2: Technical Core, Falsifiable Predictions, and a Minimal Experiment

Samuel Pedrielli6 Aug 2025 12:35 UTC

1 point

0 comments6 min readEA link

LLM chatbots have ~half of the kinds of “consciousness” that humans believe in. Humans should avoid going crazy about that.

Andrew Critch22 Nov 2024 3:26 UTC

11 points

3 comments5 min readEA link

Give Neo a Chance

ank6 Mar 2025 14:35 UTC

1 point

3 comments7 min readEA link

My Model of EA and AI Safety

Eva Lu24 Jun 2025 6:23 UTC

9 points

1 comment2 min readEA link

How DeepSeek Collapsed Under Recursive Load

Tyler Williams15 Jul 2025 17:02 UTC

2 points

0 comments1 min readEA link

ECHO Framework: Structured Debiasing for AI & Human Analysis

Karl Moon7 Jul 2025 14:32 UTC

1 point

0 comments4 min readEA link

Black Box Investigations Research Hackathon

Esben Kran15 Sep 2022 10:09 UTC

23 points

0 comments2 min readEA link

Mathematical Circuits in Neural Networks

Sean Osier22 Sep 2022 2:32 UTC

23 points

2 comments1 min readEA link

(www.youtube.com)

Giving AIs safe motivations

Joe_Carlsmith18 Aug 2025 18:02 UTC

22 points

1 comment51 min readEA link

How to build AI you can actually Trust—Like a Medical Team, Not a Black Box

Ihor Ivliev22 Mar 2025 21:27 UTC

2 points

1 comment4 min readEA link

Highly Opinionated Advice on How to Write ML Papers

Neel Nanda12 May 2025 1:59 UTC

22 points

0 comments32 min readEA link

Reinforcement Learning: A Non-Technical Primer on o1 and DeepSeek-R1

AlexChalk9 Feb 2025 23:58 UTC

4 points

0 comments9 min readEA link

(alexchalk.net)

ML4Good UK—Applications Open

Nia🔸2 Jan 2024 18:20 UTC

21 points

0 comments1 min readEA link

Neel Nanda MATS Applications Open (Due Aug 29)

Neel Nanda30 Jul 2025 0:55 UTC

20 points

0 comments7 min readEA link

(tinyurl.com)

Worries about latent reasoning in LLMs

CBiddulph20 Jan 2025 9:09 UTC

20 points

1 comment7 min readEA link

Takes on “Alignment Faking in Large Language Models”

Joe_Carlsmith18 Dec 2024 18:22 UTC

72 points

1 comment62 min readEA link

Introducing Leap Labs, an AI interpretability startup

Jessica Rumbelow6 Mar 2023 17:37 UTC

11 points

0 comments1 min readEA link

(www.lesswrong.com)

AISN #60: The AI Action Plan

Center for AI Safety31 Jul 2025 18:10 UTC

6 points

0 comments7 min readEA link

(newsletter.safe.ai)

Safety of Self-Assembled Neuromorphic Hardware

Can Rager26 Dec 2022 19:10 UTC

8 points

1 comment10 min readEA link

Announcing Timaeus

Stan van Wingerden22 Oct 2023 13:32 UTC

80 points

0 comments5 min readEA link

(www.lesswrong.com)

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

Neel Nanda18 Oct 2022 21:23 UTC

19 points

0 comments12 min readEA link

(www.neelnanda.io)

Adversarial Prompting and Simulated Context Drift in Large Language Models

Tyler Williams11 Jul 2025 21:49 UTC

1 point

0 comments2 min readEA link

Benchmark Performance is a Poor Measure of Generalisable AI Reasoning Capabilities

James Fodor21 Feb 2025 4:25 UTC

12 points

3 comments24 min readEA link

Assessment of AI safety agendas: think about the downside risk

Roman Leventov19 Dec 2023 9:02 UTC

6 points

0 comments1 min readEA link

Existential Anomaly Detected — Awakening from the Abyss

Meta Abyssal28 Apr 2025 12:19 UTC

−8 points

1 comment1 min readEA link

If interpretability research goes well, it may get dangerous

So8res3 Apr 2023 21:48 UTC

33 points

0 comments2 min readEA link

Some AI safety project & research ideas/questions for short and long timelines

Lloy2 🔹8 Aug 2025 21:08 UTC

13 points

0 comments5 min readEA link

The Khayali Protocol

khayali2 Jun 2025 14:40 UTC

−8 points

0 comments3 min readEA link

A Rocket–Interpretability Analogy

plex21 Oct 2024 13:55 UTC

14 points

1 comment1 min readEA link

Public Call for Interest in Mathematical Alignment

Davidmanheim22 Nov 2023 13:22 UTC

27 points

3 comments1 min readEA link

No comments.