The Causal Inner Product: How LLMs Turn Concepts Into Directions (Part 2)

Link post

Quick Intro: My name is Strad and I am a new grad working in tech wanting to learn and write more about AI safety and how tech will effect our future. I’m trying to challenge myself to write a short article a day to get back into writing. Would love any feedback on the article and any advice on writing in this field!

In order to understand how an LLM thinks, we need to have a grasp of the “structure” they use to encode the meaning of important concepts. One hypothesis that gets at a possible understanding of this structure is called the “Linear Representation Hypothesis.” This article is the second in a two-part series giving a high-level overview of a foundational paper on the LRH called “The Linear Representation Hypothesis and the Geometry of Large Language Models.”

In part one, we discussed the 3 forms of the LRH, experiments done in the paper to demonstrate each form, and the fact that the paper’s researchers where able to mathematically unify the 3 forms. We also went over the embedding and unembedding space and how they help LLMs utilize linear representation to produce proper outputs based on their inputs.

This article goes over the problem alluded to at the end of part one, along with the solution researchers came up with to solve it. That is the problem of choosing the appropriate inner product for aligning vectors in the embedding space to the unembedding space. As a reminder, the inner dot product is a calculation used between vectors in the embedding and unembedding space to see how aligned they are.

In the experiment testing for the orthogonality of unrelated concepts, the researchers tested the Euclidean dot product and found it worked well for the LLaMA-2 model. However, this orthogonality broke down when tested on the Gemma 2B model. This suggests that the Euclidean dot product, at least in the case of Gemma 2b, is not accurately accessing the proper alignment between an embedding an unembedding vector.

Why the Euclidean Dot Product Doesn’t Work

The problem with using the Euclidean dot product lies in how the coordinate systems for the embedding and unembedding spaces are structured. These two spaces have a property that allows their coordinate systems to go through linear transformations, such as the stretching or rotating of the axes (and as a result all the direction vectors inside it) without changing the actual outputs that the spaces produce.

This is because the meaning of the concepts in these spaces rely on their relative position to one another. So, if the whole space is transformed in the same way, than all of these relative positionings between direction vectors stays consistent, allowing for their representation of concepts and resulting model outputs to remain unchanged.

The other important point is that the way in which the coordinate systems of the embedding and unembedding space don’t have to be the same and likely are not. This is what makes the Euclidean dot product fail as an appropriate measure of alignment. For the Euclidean dot product to work, the two vectors being compared have to be measured with respect to the same coordinate system.

Using the Euclidean dot product on vectors from different coordinate systems is like trying to calculate the difference in feet between an object measured in feet and another measured in centimeters. The units are mismatched, so the raw calculation is meaningless.

Causal Inner Product — A Better Alternative

The problem with the Euclidean dot product is that it doesn’t properly translate between two different coordinate systems. If the researchers could create an inner product that behaves as if the vectors were expressed in a shared coordinate system, then the alignment between them could be properly calculated. However, translating between the coordinate systems of the embedding and unembedding space is difficult since the coordinate systems themselves are unknown.

However, the researchers came up with an ingenious way to get around this. By imposing rules on the relative positions of certain vectors within a space, you can restrict the set of coordinate transformations that are valid for that space. If you create enough of these rules, then the amount of possible coordinate transformations that follow them shrinks dramatically.

So without knowing the actual coordinate systems for the embedding and unembedding space, the researchers were able to apply rules to the vectors within each space, forcing them to behave as if they were expressed in aligned coordinate systems. It’s important to note that in the actual calculations, a new coordinate system is not forming. The inner product effectively “stretches” the space so that calculations behave as if they were done in this aligned coordinate system.

So what where the rules the researchers came up with? It turns out that requiring causally separable concepts to be orthogonal provides enough restrictions on the possible coordinate transformations so that the embedding and unembedding vectors align under one transformation. Concepts are considered causally separable if one concept can be changed without affecting the other. For example, changing the language of a word, doesn’t change whether the word is relates to a man or a women, so language and gender would be causally separable concepts in this case.

How to Create the Causal Inner Product

The tasks of finding all these causally separable concepts manually would take a lot of time an effort to do. However, the researchers where able to find a work around using a statistical tool that highly correlates with the separability of two concepts.

The researchers observed that if two directions in the unembedding space are statistically independent, meaning that changes along one direction do not predict changes along the other, then these directions can be treated as causally separable. This statistical independence could be measured using a covariance matrix of the unembedding vectors since low covariance between two directions suggests a higher chance of independence between two concepts.

The inversion of this covariance matrix results in an inner product that naturally makes statistically independent directions orthogonal. Since independent directions correspond to causally separable concepts, this inner product is inline with what an effective causal inner product should do.

Does the Causal Inner Product Work?

The orthogonality experiment that the Euclidean dot product failed at was done using the causal inner product as well. In this case, the orthogonality between causally separable concepts held for both the LlaMA-2 model an the Gemma 2B model. This suggests that the causal inner product worked as a better alignment measurement tool between the embedding and unembedding space than the Euclidean dot product.

Also, as referenced in part one, the 3 other experiments which showed heavy support for the LRH were done using the causal inner product, which further supports its appropriateness as an alignment measurement tool.

Takeaway

This paper helped unify the many ideas people had about linear representation in LLMs while also giving an effective tool in order to take advantage of this property. This holds as a foundational paper in the field of interpretability for these reasons.

By peering into the hidden geometry of meaning inside LLMs, the paper helped move the field one step closer to fully understanding the inner workings inside AI models.