Mechanistic Interpretability Demo

Join us for a one hour presentation by the author of this post: How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!

You’ll get the most out of Stefan’s talk if you meet the following prerequisites:

  1. Understanding the Transformer architecture: Know what the residual stream is, how attention layers and MLPs work, and how logits & predictions work. For future sections familiarity with multi-head attention is useful. Here’s a link to Neel’s glossary which provides excellent explanations for most terms I might use!
    If you’re not familiar with Transformers you can check out Step 2 (6) on Neel’s guide or any of the other explanations online, I recommend Jay Alammar’s The Illustrated Transformer and/​or Milan Straka’s lecture series.

  2. Some overview of Mechanistic Interpretability is helpful: See e.g. any of Neel’s talks, or look at the results in the IOI paper /​ walkthrough.

  3. Basic Python: Familiarity with arrays (as in NumPy or PyTorch, for indices) is useful; but explicitly no PyTorch knowledge required!

No comments.