Mechanistic Interpretability — Make AI Safe By Understanding Them

Link post

Quick Intro: My name is Strad and I am a new grad working in tech wanting to learn and write more about AI safety and how tech will effect our future. I’m trying to challenge myself to write a short article a day to get back into writing. Would love any feedback on the article and any advice on writing in this field!

Large language models (LLMs) such as ChatGPT and Claude are making billions of decisions everyday. Millions of people converse with these tools daily, resulting in a constant pressure for them to figure out how to decide the best responses. The problem is, we don’t understand how these tools are actually making these decisions.

LLMs start off as neural networks with billions of neurons, and, through a training process, they learn how to alter these neurons in order to generate useful responses. This is different from normal programming in that no human is manually setting these neurons themselves. Because of this, the mechanisms as to how these neurons produce useful responses is quite unclear to us.

Mechanistic Interpretability (MI) is a field of research that aims to understand these mechanisms. In the same way we try to reverse engineer how the brain works by mapping neural activity to human behavior, MI research tries to do the same thing with LLMs.

For example, we might ask, where in the LLM is the concept of a dog stored? Or how does the model figure out when to use addition vs subtraction in math problem? More broadly, what algorithms or “thinking” processes are LLMs using to figure out what response to generate?

Figuring this out might be crucial for the ability to keep AI models safe as they continue to outpace our abilities. Since we plan to offload many of our responsibilities onto these models in the future, understanding how they reason would greatly increase our ability to trust them with these tasks.

So, given how important and useful MI research could be, how do we actually begin to understand such a complex technology?

The Building Blocks of Mechanistic Interpretability

The two basic building blocks in the study of MI are features and circuits. Features are the concepts that an LLM knows and stores in their neurons. For example, an LLM might store the concept of “dogs” in one neuron and the concept of “Paris” in another. These neurons, and their associated concepts they store, are features.

Circuits are the connections between certain neurons that act as an algorithm or “thinking” process. For example, imagine you have separate neurons representing concepts such as windows, wheels, and doors. The specific combination of these three neurons might connect to another neuron that represents the concept of a car. This connection between the neurons is a circuit. It represents an algorithm that essentially converts the concepts of windows, wheels and doors into the concept of a car.

The goal of MI is to essentially map all the neurons in a model to their associated features and circuits. By doing that, we interpret how a model is reasoning just by looking at which neurons activate during a given task.

How to Find Features and Circuits in LLMs

There are a few general methods used to map groupings of neurons in LLMs to concepts and algorithms.

Probing is a technique used to observe the model’s neurons while it executes a specific task. The method utilizes a smaller model to see whether the activation of a section of neurons in the LLM can be used to predict a specific property, such as an image having a window in it or not. If the smaller model can achieve a high accuracy predicting this property just based on the activation patterns of that section of neurons, then there is a high likelihood that that section stores the process for determining that property (i.e. the appearance of a window in an image or not).

Another way to analyze models is by using a method called Activation Patching which actually alters the neurons in the model to see how doing so effects the output. This method involves taking two instances of an LLM, one where a specified property was observed and one where it wasn’t . The activation pattern of the section of neurons being tested is recorded in both instances. Then the activation pattern from the instance that observed the property is inserted into the one that didn’t. If this causes the property to arise in the new instance, then this section of neurons is likely storing the information for that property.

When features are found within an LLM, researchers can then begin circuit analysis which works to identify which features are connected to each other and how they relate. This can be determined by seeing which feature’s neurons tend to activate together and then mapping their connections to one another.

Obstacles to Mechanistic Interpretability Research

We have been talking as if LLMs store one concept per neuron. While this would be ideal for MI research, it is far from the case. LLMs experience superposition, where one neuron can store info for multiple different concepts, and polysemanticity, where multiple neurons can hold information for one concept. These properties make MI research much more difficult since the meanings behind different concepts are tangled up within all the neurons rather than being nicely separated.

Another obstacle to MI research is scalability. Most of the successful research done on interpretability utilizes smaller, less capable models when compared to the models at the forefront of big AI companies. As models get larger there are just so much more neurons, and therefore possible combinations of neurons, to analyze that it seems unrealistic for any group of humans to be able to parse through it all. On top of that, their is a vast amount of possible concepts a model could know.

Given the complexity of mapping neurons to their tangled representation of concepts, and the sheer scale of neurons and concepts to be analyzed, it seems that advancements in MI research will likely require help from automation and even AI itself.

One potential solution that is currently being researched is the sparse autoencoder (SAE). At a high level, this is a neural network-based architecture which aims to rewrite the LLM’s activations into a more interpretable form where concepts are more distinguishable from each other. This is also an automated process which would help with scalability.

The Importance and Benefits from Mechanistic Interpretability

One of the most important impacts of MI research would be its ability to help with AI safety and alignment. This is a big reason why AI safety-focused company, Anthropic, heavily contributes to research in this field. (They recently had a discussion with their interpretability team which I summarized into 6 key insights that can be found here).

By being able to truly understand an AI’s reasoning process, we could quickly detect whether an AI has developed any misaligned goals. The hope is that, even if an AI tries to hide its misaligned behaviors from its outputs, it wouldn’t be able to hide its internal reasoning containing its true intentions to deceive. So an understanding of this internal reasoning might act as a useful fail safe in the case of AI that vastly exceeds our own intelligence.

By knowing which part of the AI caused these misaligned goals to arise, we could also potentially remove these goals by altering that part. This concept of altering the inner workings of an AI to change its behavior is known as steering and has obvious benefits that go beyond AI safety.

We could emphasize positive behaviors seen in a model by altering the parts responsible for that behavior. If there are any bugs in the model, we could also turn them off by knowing which part of the model the bug actually lies in.

Potential Problems in The Field of Mechanistic Interpretability

Due to the problems of superposition and polysemanticity, it is very possible for a researcher to think they have found a proper mapping of neurons to a concept, only to find that this mapping falls apart with the introduction of new test cases. This, combined with the safety pressure to produce positive outcomes, makes it possible that successes in the field are often cherry picked.

One way people are combating this is by creating more quantitative evaluations of interpretability that build in sanity checks and counterexamples to better test the robustness of a proposed mapping.

Another problem in the field is just the amount of expertise needed to effectively work on MI relative to the amount of people who have this expertise. MI research is quite interdisciplinary, taking knowledge and skills from fields such as neuroscience, machine learning, software engineering, and math. Finding people with a sufficient combination of these skills can be challenging. This is a tough problem for the field to have, especially given how time intensive progression in the field is due to scalability issues.

There is also the worry that there might be inherent limits to how much we can truly understand a model whose intelligence far surpasses us. In the same way an ant could never understand the complex concepts of physics or politics that we have discovered, it might be possible that a super intelligence would reason and think in ways we could never comprehend. This might make it difficult for us to map neurons to concepts in any useful way.

Even if their are no inherent limits and we are able to effectively interpret any AI’s reasoning, then there is the problem dual-use. MI research can definitely help make AI safe, but it could also be used for nefarious purposes. In the same way we could alter parts of an AI to decrease misaligned behavior, a bad actor could do the reverse and exacerbate these behaviors.

So, despite MI research being a very appealing way to foster safe AI systems, it’s worth keeping in mind its potential drawbacks to safety as well as current challenges in the field limiting its progress.

Takeaway

As we go into a future where these models take on more and more responsibilities and control of society, it makes sense that we should work to better understand them. Mechanistic interpretability is a very exciting field of research for that reason. However, it is still in its infancy.

Current research has allowed us to get a better idea of how model’s think, but we are still far from the complete understanding of AI systems that would be necessary to see long term safety benefits. So, while the road ahead is a challenging one, it is definitely a worthwhile one for our future.