The Universality Hypothesis — Do All AI Models Think The Same?

Link post

Quick Intro: My name is Strad and I am a new grad working in tech wanting to learn and write more about AI safety and how tech will effect our future. I’m trying to challenge myself to write a short article a day to get back into writing. Would love any feedback on the article and any advice on writing in this field!

In order for Large Language Models (LLMs) such as ChatGPT or Claude to produce useful outputs, they need to go through a training process that alters their inner circuitry of neurons. This alteration is what allows them to convert inputs from a user into a useful response.

Within this formation of neurons lies the processes an LLM uses to reason. We currently do not have a good understanding of these processes. As a result, the field of mechanistic interpretability (MI) research was created in order to create methods for understanding how LLMs reason. An overview of this field can be seen in my article on the topic here.

As progress has been made in the field, an interesting hypothesis arose called the “Universality Hypothesis.” The feasibility of MI research becomes much greater if this hypothesis turns out to be true.

What is the Universality Hypothesis?

The Universality Hypothesis predicts that different AI models tend to converge on the same reasoning processes during training. To better understand, some basics of mechanistic interpretability could be useful.

The two basic building blocks of MI research are features and circuits. Features are essentially the concepts (i.e. dogs, cars, Paris, etc.) stored in the model’s neurons. Circuits are the connections between these features which represents processes of thinking (i.e. the combination of features representing windows, doors and wheels connecting to the feature representing cars).

The Universality Hypothesis essentially predicts that the same type of features and circuits show up across different models.

Across “different” models can mean a lot of things. Do these concepts generalize across models trained on different datasets? Do they generalize across models of different sizes? Do they generalize across different model architectures? What about models that work with different languages, or even different modalities such as audio and video?

All of these questions try to determine if there are general principles that models converge on for solving problems as well how much these general principles hold as the variance between models increases.

Why is Universality Important?

The truth of the Universality Hypothesis largely shapes the strategy for MI research. If it’s true, than that would mean principles we learn from studying one model, would transfer to the study of other models.

It is similar to how universal principles in biology allow us to study a large variety of different species. While species vary, they are all broadly made up of the same components such as cells and organs. Species also share a lot of the same processes such as protein synthesis, neural signaling, etc. This universality allows us to take knowledge learned from studying one species and apply it to the study of another species. The same could be true for AI models.

Of course, the Universality Hypothesis could be wrong. It could be the case that different models develop drastically different concepts and processes for achieving the same goal. If this is the case, than the scope of models we can reliably interpret becomes much smaller since each new model would require its own theory of reasoning. It’d be like having to create a different theory of biology for every species.

In this scenario, MI researchers would have to be a lot more strategic about which models they choose to study as the time investment for studying each model would be very high. This becomes especially important when thinking about AI safety, which is one of the driving motivations for the pursuit of MI research.

We want to understand how models reason so we can better detect and stop misalignment. If we have to create a whole new theory of interpretability for each model, than determining which models are most capable of large-scale misalignment becomes a much greater priority in MI research.

Evidence for the Universality Hypothesis

Given that universality could be a game changer for MI research, is there any evidence that the hypothesis is true? While there is no definitive answer on the truth of the hypothesis yet, there seems to be a lot of promising research that it is at least sometimes true at low levels.

For example, it has been shown that some of the filters and strategies useful for classifying images (Gabor filter, high-low frequency detector, curve detector) have been found to be learned and utilized by many different architectures.

In a recent study, they even found that certain activation patterns of neurons where found across image models meant for different purposes. For example, features in a generation model where also found in a classification model.

Even more exciting, was that the researchers where able to take these neurons in an image generation model and have them match the same neurons in an image classification model, effectively allowing the generation model to “see” the image through the conceptual lens of the classification model. The fact that two models with very different purposes where able to utilize the same feature suggests the features universal importance to all image processing tasks.

One paper makes the argument that even models utilizing different modalities (audio, video, text) are starting to converge in their reasoning processes. They argue that the representation of the same concept in models using different modalities are growing in similarity. For example, the internal representation a model has for an image of a red ball would be similar to a model’s internal representation for the piece of text “red ball.” The paper even goes as far as saying that it’s possible these models are all converging towards a shared representation of reality.

Limitations on The Universality Hypothesis

Despite some of the promising results described above, it is still unclear whether the Universality Hypothesis holds. It is likely that the answer will not be a clear yes or no, but rather somewhere in the middle.

For example, one study distinguishes between the idea of weak and strong universality. Weak universality is when only high level processes are generalizable across multiple models. Strong universality is when both the high level processes and the low level processes that make them up are generalizable.

For example, in image classifiers, a high level process could be the detection of cars in an image. This process could be built out of many low level processes such as edge detectors, curve detectors, color gradients or small part detectors (wheels, windows, etc.). Weak universality would entail that different models share the high level process of detecting cars, but differ in the combination of low level processes they use to detect cars.

The study tested for weak vs strong universality on various architectures trained to solve problems in mathematical representation theory. While weak universality could be seen across the architectures (utilized same general algorithm to solve problems), strong universality could not (each used different methods/combination of methods to solve the same algorithm).

More broadly, limitations in MI research itself can hold back progress on determining the truth of the Universality Hypothesis. Understanding how models reason is necessary in order to compare them and see whether or not they share reasoning processes. Since we don’t have the capabilities to fully understand the inner workings of AI models yet, we lack the proper prerequisites to fully test the Universality Hypothesis.

Takeaway

While the final verdict on the Universality Hypothesis is far from out, it’s clear that at least some aspects of model reasoning seem to generalize across different models. This gives hope that we could reap some of the benefits universality would bring to the field of mechanistic interpretability. In the mean time, we will have to keep working to better understand these models, and note any evidence of universality when we see them.