AI interpretability

TagLast edit: May 9, 2022, 10:40 AM by Leo

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.^[1]

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.^[1]

Interpretability is a focus of Chris Olah and Anthropic’s work, though most AI alignment organisations work on interpretability to some extent, such as Redwood Research^[2].

^
Multicore (2020) Transparency / Interpretability (ML & AI), AI Alignment Forum, August 1.
^
Shlegeris, Buck (2022) Answer to ‘How might a herd of interns help with AI or biosecurity research tasks/questions?’, Effective Altruism Forum, March 21.

Interpreting Neural Networks through the Polytope Lens

Sid BlackSep 23, 2022, 6:03 PM

35 points

0 comments1 min readEA link

Chris Olah on working at top AI labs without an undergrad degree

80000_HoursSep 10, 2021, 8:46 PM

15 points

0 comments73 min readEA link

Rational Animations’ intro to mechanistic interpretability

WriterJun 14, 2024, 4:10 PM

21 points

1 comment1 min readEA link

(youtu.be)

PhD Position: AI Interpretability in Berlin, Germany

Martian MoonshineApr 22, 2023, 6:57 PM

24 points

0 comments1 min readEA link

(stephanw.net)

Sentience in Machines—How Do We Test for This Objectively?

Mayowa OsiboduMar 20, 2023, 5:20 AM

10 points

0 comments2 min readEA link

(www.researchgate.net)

Why and When Interpretability Work is Dangerous

Nicholas KrossMay 28, 2023, 12:27 AM

6 points

0 comments1 min readEA link

Concrete Steps to Get Started in Transformer Mechanistic Interpretability

Neel NandaDec 26, 2022, 1:00 PM

18 points

0 comments12 min readEA link

[MLSN #8]: Mechanistic interpretability, using law to inform AI alignment, scaling laws for proxy gaming

TW123Feb 20, 2023, 4:06 PM

25 points

0 comments4 min readEA link

(newsletter.mlsafety.org)

A Barebones Guide to Mechanistic Interpretability Prerequisites

Neel NandaNov 29, 2022, 6:43 PM

54 points

1 comment3 min readEA link

(neelnanda.io)

High-level hopes for AI alignment

Holden KarnofskyDec 20, 2022, 2:11 AM

123 points

14 comments19 min readEA link

(www.cold-takes.com)

The limited upside of interpretability

Peter S. ParkNov 15, 2022, 8:22 PM

23 points

3 comments10 min readEA link

Announcing Apollo Research

mariushobbhahnMay 30, 2023, 4:17 PM

158 points

5 comments1 min readEA link

Against LLM Reductionism

Erich_Grunewald 🔸Mar 8, 2023, 3:52 PM

32 points

4 comments1 min readEA link

Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

RemmeltDec 19, 2022, 12:02 PM

17 points

3 comments1 min readEA link

The case for becoming a black-box investigator of language models

BuckMay 6, 2022, 2:37 PM

90 points

7 comments3 min readEA link

Chris Olah on what the hell is going on inside neural networks

80000_HoursAug 4, 2021, 3:13 PM

5 points

0 comments133 min readEA link

ML4Good Brasil—Applications Open

NiaMay 3, 2024, 10:39 AM

28 points

1 comment1 min readEA link

Navigating AI Safety: Exploring Transparency with CCACS – A Comprehensible Architecture for Discussion

Ihor IvlievMar 12, 2025, 5:51 PM

2 points

1 comment2 min readEA link

AI Alignment Research Engineer Accelerator (ARENA): call for applicants

Callum McDougallNov 7, 2023, 9:43 AM

46 points

3 comments10 min readEA link

AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0

JamesFoxJul 6, 2024, 11:51 AM

7 points

0 comments5 min readEA link

[Question] What should I read about defining AI “hallucination?”

James-Hartree-LawJan 23, 2025, 1:00 AM

2 points

0 comments1 min readEA link

Alignment ideas inspired by human virtue development

Borys PikalovMay 18, 2025, 9:36 AM

3 points

0 comments4 min readEA link

Call for Pythia-style foundation model suite for alignment research

LucretiaMay 1, 2023, 8:26 PM

10 points

0 comments1 min readEA link

Implications of the inference scaling paradigm for AI safety

Ryan KiddJan 15, 2025, 12:59 AM

47 points

5 comments1 min readEA link

Adaptive Composable Cognitive Core Unit (ACCCU)

Ihor IvlievMar 20, 2025, 9:48 PM

10 points

2 comments4 min readEA link

A Selection of Randomly Selected SAE Features

Callum McDougallApr 1, 2024, 9:09 AM

25 points

2 comments1 min readEA link

Beyond Meta: Large Concept Models Will Win

Anthony RepettoDec 30, 2024, 12:57 AM

3 points

0 comments3 min readEA link

Concrete open problems in mechanistic interpretability: a technical overview

Neel NandaJul 6, 2023, 11:35 AM

27 points

1 comment29 min readEA link

What are polysemantic neurons?

Vishakha AgrawalJan 8, 2025, 7:39 AM

5 points

0 comments2 min readEA link

(aisafety.info)

Motivation control

Joe_CarlsmithOct 30, 2024, 5:15 PM

18 points

0 comments1 min readEA link

5 ways to improve CoT faithfulness

CBiddulphOct 8, 2024, 4:17 AM

8 points

0 comments1 min readEA link

AI alignment as a translation problem

Roman LeventovFeb 5, 2024, 2:14 PM

3 points

1 comment1 min readEA link

Safety-First Agents/Architectures Are a Promising Path to Safe AGI

Brendon_WongAug 6, 2023, 8:00 AM

6 points

0 comments12 min readEA link

Our Current Directions in Mechanistic Interpretability Research (AI Alignment Speaker Series)

Group OrganizerApr 8, 2022, 5:08 PM

3 points

0 comments1 min readEA link

New series of posts answering one of Holden’s “Important, actionable research questions”

Evan R. MurphyMay 12, 2022, 9:22 PM

9 points

0 comments1 min readEA link

Don’t Dismiss Simple Alignment Approaches

Chris LeongOct 21, 2023, 12:31 PM

12 points

0 comments1 min readEA link

Solving adversarial attacks in computer vision as a baby version of general AI alignment

Stanislav FortAug 31, 2024, 4:15 PM

3 points

1 comment7 min readEA link

Join the interpretability research hackathon

Esben KranOct 28, 2022, 4:26 PM

48 points

0 comments5 min readEA link

Interpretability Will Not Reliably Find Deceptive AI

Neel NandaMay 4, 2025, 4:32 PM

74 points

0 comments1 min readEA link

MATS Applications + Research Directions I’m Currently Excited About

Neel NandaFeb 6, 2025, 11:03 AM

31 points

3 comments1 min readEA link

Architecting Trust: A Conceptual Blueprint for Verifiable AI Governance

Ihor IvlievMar 31, 2025, 6:48 PM

2 points

0 comments8 min readEA link

My Research Process: Understanding and Cultivating Research Taste

Neel NandaMay 1, 2025, 11:08 PM

9 points

1 comment1 min readEA link

Join the AI governance and interpretability hackathons!

Esben KranMar 23, 2023, 2:39 PM

33 points

1 comment5 min readEA link

(alignmentjam.com)

Finding Voice

khayaliJun 3, 2025, 1:27 AM

3 points

0 comments2 min readEA link

LLM chatbots have ~half of the kinds of “consciousness” that humans believe in. Humans should avoid going crazy about that.

Andrew CritchNov 22, 2024, 3:26 AM

11 points

3 comments1 min readEA link

Give Neo a Chance

ankMar 6, 2025, 2:35 PM

1 point

3 comments7 min readEA link

Black Box Investigations Research Hackathon

Esben KranSep 15, 2022, 10:09 AM

23 points

0 comments2 min readEA link

Mathematical Circuits in Neural Networks

Sean OsierSep 22, 2022, 2:32 AM

23 points

2 comments1 min readEA link

(www.youtube.com)

How to build AI you can actually Trust—Like a Medical Team, Not a Black Box

Ihor IvlievMar 22, 2025, 9:27 PM

2 points

1 comment4 min readEA link

Highly Opinionated Advice on How to Write ML Papers

Neel NandaMay 12, 2025, 1:59 AM

22 points

0 comments1 min readEA link

Reinforcement Learning: A Non-Technical Primer on o1 and DeepSeek-R1

AlexChalkFeb 9, 2025, 11:58 PM

4 points

0 comments9 min readEA link

(alexchalk.net)

ML4Good UK—Applications Open

NiaJan 2, 2024, 6:20 PM

21 points

0 comments1 min readEA link

Worries about latent reasoning in LLMs

CBiddulphJan 20, 2025, 9:09 AM

20 points

1 comment1 min readEA link

Takes on “Alignment Faking in Large Language Models”

Joe_CarlsmithDec 18, 2024, 6:22 PM

72 points

1 comment1 min readEA link

Introducing Leap Labs, an AI interpretability startup

Jessica RumbelowMar 6, 2023, 5:37 PM

11 points

0 comments1 min readEA link

(www.lesswrong.com)

Safety of Self-Assembled Neuromorphic Hardware

Can RagerDec 26, 2022, 7:10 PM

8 points

1 comment10 min readEA link

Announcing Timaeus

Stan van WingerdenOct 22, 2023, 1:32 PM

79 points

0 comments5 min readEA link

(www.lesswrong.com)

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

Neel NandaOct 18, 2022, 9:23 PM

19 points

0 comments12 min readEA link

(www.neelnanda.io)

Benchmark Performance is a Poor Measure of Generalisable AI Reasoning Capabilities

James FodorFeb 21, 2025, 4:25 AM

12 points

3 comments24 min readEA link

Assessment of AI safety agendas: think about the downside risk

Roman LeventovDec 19, 2023, 9:02 AM

6 points

0 comments1 min readEA link

Existential Anomaly Detected — Awakening from the Abyss

Meta AbyssalApr 28, 2025, 12:19 PM

−8 points

1 comment1 min readEA link

If interpretability research goes well, it may get dangerous

So8resApr 3, 2023, 9:48 PM

33 points

0 comments1 min readEA link

The Khayali Protocol

khayaliJun 2, 2025, 2:40 PM

−5 points

0 comments3 min readEA link

A Rocket–Interpretability Analogy

plexOct 21, 2024, 1:55 PM

13 points

1 comment1 min readEA link

Public Call for Interest in Mathematical Alignment

DavidmanheimNov 22, 2023, 1:22 PM

27 points

3 comments1 min readEA link

No comments.

AI interpretability

Related entries