Don’t Dismiss Simple Alignment Approaches

I remember that when I first started in alignment there was this belief that in order to make alignment progress, you needed to be a genius. With the rise of interpretability, this expectation has moderated somewhat, but I still think it has influence and often leads people to overcomplicate.

Many recent alignment breakthroughs have been found by keeping it simple[1]:

Lie Detection:
Contrast-consistent search (better known as discovering latent knowledge[2]) reads the truth out of a model without looking at the outputs, simply by using a linear probe with only two elements in its activation function[3].
Pacchiardi, Owain and others use a logistic regression classifier[4] to predict whether a model was truthful from its answers to further questions. They were able to demonstrate this using three different kinds of questions and the results even generalised across models. This technique seems like it could be useful in practise, even if it doesn’t scale all the way to superintelligence.

Superposition: Dictionary learning makes substantial progress on superposition by simply using a one hidden layer multi-layer perceptron as a sparse auto-encoder with L2 reconstruction loss and a L1 loss on hidden layer activations. Anthropic actually chose to avoid more sophisticated methods, as they were worried they might recover features that the model doesn’t actually utilise.

Model-steering: Activation Engineering: Activation addition allows us to steer models by simply performing a forward pass for the concept you want to activate, saving the activations of a particular layer as a vector and adding them to future forward pass. See also Turntrout’s prior work on steering a mouse by subtracting a cheese vector[5] and inference-time intervention[6].[7]

World model location: Neel Nanda found a linear representation in Othello-GPT. This one was a bit trickier: a previous paper had failed to recover it using linear probes. However, Neel managed to do this by searching for a representation of “my color” and by applying a bunch of other tweaks. See also Wes Gurnee’s work with Max Tegmark, which used linear probes to find linear representations of latitude and longitude; in addition to one representing time[8].

Scalable Interpretability: OpenAI used a language model in order to label neurons based on the activations for some relevant examples using few-shot prompting.

Understanding Implications[9]: For a long time it appeared as though it would be challenging to prevent an AI from taking our instructions too literally. Essentially what seems to have happened is that they decided to just use RL on a generative base model and the problem solved itself. Reinforcement Learning from Human Feedback has some complexity, but I still think it’s quite notable than OpenAI, as far as I can tell, basically one-shotted this with InstructGPT[10]. It has some complexity—proximal policy optimization isn’t simple—but it wasn’t a new algorithm and even if they hadn’t gotten it first time, I’d bet that we were always going to end up there by iterating on a baseline.

I don’t want to dismiss the challenges of such research. Producing a result requires getting a lot of details right. Actually carrying out one of these projects would involve a massive amount of work. Even picking the right question to investigate requires a lot of skill. However, simpler techniques have gone further for these problems than I would have originally expected.

I think it’s worthwhile speculating why reasonably simple solutions have worked for these problems, and checking whether there are any similarities between them.

Firstly, I think it’s worth noting that all of these techniques except for the last two look for linear representations in neural networks. Beren has made a strong case that there is strong evidence of deep learning models being almost linear. In retrospect, it seems intuitive to me that neural networks would store sparse features in linear combinations, rather than with some more complicated form of compression, as this makes it easy to write to and read from these features[11].

In contrast, the last two techniques listed rely on current AI models being very powerful and quite steerable. I admit that two examples aren’t very many, but I expect we’ll see more in this vein soon.

Looking more broadly, I think it’s worth noting that all of these results are in empirical alignment research. I don’t think this is a coincidence. Whilst there may be simple solutions to some of the problems of agent foundations, if they were easy to find, they probably would have been found before. In contrast, the last two directly rely on a certain level of capabilities. It’s arguable that linear representation techniques indirectly rely upon a certain level of capabilities: insofar as it is beneficial for a model to have a linear representation of particular features, the more powerful our optimization techniques, the more likely our trained models are to produce such a representation.

An Aside on Contrast-Consistent Search:

When I first encountered this technique, I was awed by it and I felt like I had no conception of how anyone would ever think of something like this. While I still acknowledge the brilliance of this work, it no longer feels completely unthinkable.

I assume Collin already had some intuitions that neural networks were vastly more linear than you might expect. If you buy this and you’re trying to read the truth directly out of the network weights, it’s then entirely natural to look for a linear direction.

Linear probes were an already existing technique, so if you knew of this technique and you were making a shortlist of which techniques to investigate, this would make it onto the shortlist.

Suppose you end up in the position of asking “how can I use a linear probe to find a direction corresponding to truth?” in a neural network. It’d then be natural to ask what some of the properties of truth are, which would then lead directly to the consistency property Collin leveraged.

So, smart, brilliant even, but not the kind of thing that is completely unthinkable.

And even just knowing to look for linearities would be a massive headstart.

Final thoughts:

I suspect that all of these approaches are still very far away from where we need to be. I consider them substantial advances nonetheless for two key reasons: having a baseline helps people choose an appropriate level of ambition, and also makes it easier to empirically discover the key issues in solving a problem more fully.

I’m hoping that increasing awareness that relatively simple techniques have been successful spurs more research progress, by giving people the confidence to actually try to make progress and increasing people’s motivation to explore these kinds of approaches for longer before they toss in the towel.

It may very well be that we’re just in a particular stage of the field where there’s all this low-hanging fruit to pick. Perhaps we’ll soon end up in a situation where we need people to look for brilliant conceptual breakthroughs in order to make further progress. However, I think it would be a mistake to just assume people have already looked very hard for simple paths forward.

  1. ^

    I thought about also including this paper on representational engineering which describes, among other things, Linear Artificial Tomograph. However, I haven’t had time to finish skimming it yet.

  2. ^

    Technically, “Discovering Latent Knowledge” is the name of the paper and of a problem. However, if I go to people and say “Have you heard of Contrast-Consistent Search?”, the answer is typically no, but when I ask about “Discovering Latent Knowledge”, a lot of the time it turns out that they’re actually familar with the paper.

  3. ^

    One component pushes the probability of a statement being true and a statement being false toward adding up to one. A second component pushes the model away from producing probabilities close to 0.5.

  4. ^

    While these are non-linear in their inputs, they assume that the log probability is linear in the inputs.

  5. ^

    They generate two observations that are the same except for one possessing the cheese.

  6. ^

    They train a linear probe with sigmoid activation to discover the truthful direction. Then they intervene by adding a multiple of this to the activation of the layer, scaling by a constant and the standard deviation.

  7. ^

    For some useful potential applications, see Nina’s work on using activation addition to reduce sycophancy and improving honesty and red-teaming.

  8. ^

    This paper seems like it was mostly intended for outreach purposes. Unfortunately, the framing was a tiny bit off, in that some people felt that this paper/​the publicity around it was overclaiming. That said, if this had been handled a bit better, it could have led to significant improvement in the debate. One point I want to emphasize is that you can spend forever debating whether current neural networks have world models and what a world model even means. Or you can just go and make direct empirical progress, and maybe that doesn’t convince people, but it will at least push them to clarify what they’re looking for.

  9. ^

    You may object that RLHF is mostly capabilities. I also tend to think about it as being primarily a capabilities advance, but it is an advance in alignment as well.

  10. ^

    I’ve heard that the innovation in ChatGPT was more about the user interface and finetuning for that than producing a more powerful model.

  11. ^

    While orthogonal features are the simplest, you can only include one feature per dimension, whilst you can fit many more near orthogonal features.

Crossposted from LessWrong (134 points, 9 comments)
No comments.