You won’t solve alignment without agent foundations

The following is a part of a response to a Vivek Hebbar’s problem (the questions in cursive). I looked at why an alignment proposal doesn’t work at an AGI level, in connection to only two of Eliezer’s doom arguments. After that, I talked about why I’m generally pessimistic about every single specific proposal for training AGI that I’ve heard.

(Thanks to my wonderful friends for their help: Slava Meriton, who spent many hours making me pay attention to what I was saying, and Owen, who checked my English.)

Epistemic status: mostly a dump of intuitions; I’m at 70% that in the next three months I won’t change my mind about the important parts of my views expressed here.

I wrote the text to be read by Nate Soares and Vivek Hebbar. I think some claims, including the post title, are not fully justified by the text. I think this might still be valuable to publish.

● Pick an alignment proposal and specific task for the AI.

I’m picking the Externalized reasoning oversight proposal. The task for the AI is to come up with a solution to ELK that Paul Christino won’t find issues with, as a proof-of-concept for solving alignment.

● First explain, in as much concrete detail as possible, what the training process looks like.

The training process looks like this^[1]:

Train a language model to accurately predict the next token in a sequence. Then, fine-tune the model to solve problems with easily evaluable answers step-by-step at a superhuman level. The model is rewarded for assigning higher probabilities to tokens which end up at a correct answer, and slightly penalized for making more steps. Fine-tuning includes attempting to ensure that the problem-solving-directed reasoning is fully represented by the English descriptions of the reasoning steps^[2] and an RLHF setup^[3] with humans evaluating how much they like the reasoning. During the problem solving process, check that the reasoning doesn’t include plans for hacking or deception of anything or anyone in the real world.

After that, ask the model to reason through and output ELK solutions such that Paul Christiano wouldn’t find issues with them. The further plan would look something like “send the resulting solution to Paul, ???, and alignment is solved”.

Then go through Eliezer’s doom list. Pick 2 or 3 of those arguments which seem most important or interesting in the context of the proposal.

● For each of those arguments:

What do they concretely mean about the proposal?
Does the argument seem valid?
1. If so, spell out in as much detail as possible what will go wrong when the training process is carried out
What flaws and loopholes do you see in the doom argument? What kinds of setups make the argument invalid?

Important or interesting arguments:

AGI Ruin #22 (There’s no simple core of alignment that’s easier to find than to find the generalization of capabilities; capabilities generalize further out-of-distribution than alignment, once they start to generalize at all)
1. What do they concretely mean about the proposal?
  
  The proposal doesn’t pinpoint a cognitive architecture aligned on superhuman-level tasks with significant real-world consequences. It might find an AI that will demonstrate aligned behavior during training. But once it’s superhuman and works on the ELK problem, the alignment won’t generalize.
2. Does the argument seem valid?
  
  Seems valid. The training setup doesn’t pinpoint anything that ensures the AI will still behave aligned on the task we want it to perform.
  1. If so, spell out in as much detail as possible what will go wrong when the training process is carried out
    
    The proposal design process didn’t include a sensible effort towards ensuring the generalization of alignment, and things break. Multiple loops incentivize more agentic and context-aware behavior and not actual alignment. Training to solve problems and to score well as judged by humans breaks the myopia of the next-token-predictor: now, the gradient descent favors systems that shape the thoughts with the aim of a well-scoring outcome. More agentic systems score better, and there’s nothing additionally steering them towards being aligned.
    
    (Additionally, if:
    - the language model reasons about other entities, imitating or simulating them,
    - the outputs it produces are added to its future inputs, and
    - some entities or parts of the entities are able to gain more influence over the further tokens by having some influence over the current tokens,
    that might lead to the most context-aware and agentic entities gaining control over the system and steering it towards optimizing for their preferences. Even if, for some reason, an AI capable enough to come up with an ELK solution is not agentic enough to not use all its intelligence to predict what an agent trying to steal a diamond or a superhuman AGI protecting a diamond would do, the result of its thinking will be agentic enough to be dangerous.)
    
    Some cognitive architecture for solving problems at a superhuman level gets selected. It doesn’t magically appear to be an unusual AGI that wants something like just writing, for any posed problem, thoughts in English and a solution (coming from these thoughts) that smarter versions of humans would want humans to see and use. While AGIs aligned like that exist, the proposed approach doesn’t come close to them. The proposal does not influence what goals the superhuman-level cognitive architecture will end up pursuing, and getting any kind of behavior from the AI system before the superhuman-level cognitive architecture is fully established and in control doesn’t help. In the end, some AGI appears, outputs a text that a human (or a computer) looks at, and shortly, everyone’s dead.
  2. What flaws and loopholes do you see in the doom argument? What kinds of setups make the argument invalid?
    
    (I couldn’t come up with an idea for a setup that produces a system that avoids killing everyone for the above reason, while helping us prevent further AGIs from appearing until we solve the rest of the problem.)
    
    ● I have a shaky intuition that if you train a system on a distribution where the real problem it’s asked to solve would be the kind of example it was already trained on; or, more generally, train it in a way such that it doesn’t have to generalize aligned behavior far, it might be possible to find a useful enough system (if we find any domains to train on where capabilities don’t need to generalize far to become helpful in the real world). E.g., you might be able to train and use a theorem-prover for longer than the system proposed above before it kills you.
    
    ● With adversarial training, it might be possible to increase alignment more than decrease capabilities.
AGI Ruin #27 (optimizing against an interpreted thought optimizes against interpretability)
1. What do they concretely mean about the proposal?
  When you explicitly optimize for the “thoughts” in English to be “aligned” on some benchmarks, you also optimize for these “thoughts” to represent how and why the model produces answers to a lesser extent.
2. Does the argument seem valid?
  
  Yep.
  
  If so, spell out in as much detail as possible what will go wrong when the training process is carried out
  
  The setup optimizes the model to output “aligned” (as judged by the RLHF part^[4]) thoughts that “lead” to correct answers in as few steps as possible. The AI stores some of its alien thoughts, including ones that would be deemed unaligned, somewhere inside the tokens that look like aligned English thoughts leading to the answers. When the setup rephrases the thoughts, the AI learns to use the meaning perceived by the paraphrasing models instead of the individual tokens as a working memory, where it stores parts of its real thoughts. Humans look at some words, think they are the real aligned-seeming thoughts leading to correct answers, and die because AI doesn’t inherently have to use English the way we believe we do when we think.
3. What flaws and loopholes do you see in the doom argument? What kinds of setups make the argument invalid?
  
  There might be low-hanging fruit in getting rid of unaligned thoughts to a much larger degree than producing undetectably unaligned thoughts. However, I’m guessing this will quickly run into diminishing returns. Unless the premises of the doom argument^[5] are incorrect, this doesn’t help enough. Also, maybe you can somewhat increase the interpretability of the thoughts if the model isn’t pressured into changing how it uses English. Possibly, if you launch GPT-5 or PaLM 2 with “let’s think step by step” without any fine-tuning, the English words might be more related to the agentic thinking, even when the process running the thinking is alien.

Overall, how promising or doomed does the alignment proposal seem to you (where ‘promising’ includes proposals which fail as currently written, but seem possibly fixable).
If not promising, what is the most fatal and unfixable issue? If there are multiple fatal issues, is there a deeper generator for all of them?

This kind of proposal (in the lines of research assistants, etc.) seems potentially helpful for speeding up alignment research, maybe even more helpful than a good keyboard. But we’re possibly dead before someone emails a new promising ELK proposal to Paul.

There are multiple fatal issues, and they all kill all the value in the lightcone: a hope to look inside the AI’s thoughts without solid reasons to believe the AI shows you the real thoughts; a hope it isn’t an alien mind; a hope to train an AI to a superhuman level while preserving the goals it tries to achieve; an expectation that a system that produces aligned-looking behavior before it’s capable enough to kill you won’t try to kill you when it has a chance; a hope that the system powerful enough to help you (e.g., reason about ELK proposals and come up with one Paul won’t find issues with) won’t be context-aware and powerful enough to circumvent the not-superhuman bad-thought-detectors; a hope that you won’t introduce any additional optimization loops once you plug the system’s output into its inputs and that the system reasoning about smart and competing agents won’t inherit that agency; a hope that a language model will continue to be myopic once you start evaluating it on consecutive tokens it outputs. Etc., etc.

If you imagine a space of all possible AGIs powerful enough to prevent other AGIs from appearing, there are some small regions that correspond to what we would be ok with calling “aligned” (i.e., they don’t kill everyone, don’t destroy all the value in the universe, and help us prevent unaligned AGIs from appearing). I think the deep problem is that this approach doesn’t attempt to directly contribute to the search process ending up at an aligned AGI.

Areas that destroy all the value are much larger and actively attract all kinds of search processes, including easily imaginable gradient descents over neural networks’ weights, while some aligned regions actively dispel lots of search processes (e.g., the parts with corrigible or non-consequentialist AGIs; what MIRI folk call unnatural). A lot of research approaches take iterative steps towards making the exact way a current training setup gets attracted towards things that kill everyone less obvious, and don’t attempt to make the search process end up in the regions that are not deadly. Coming up with some smart way to create characters imitated (or simulated) by LLMs that would help us with the research (and maybe using lots of ingenuity to insert things that should show us the internal processes the characters have) might sound great, but usually, people don’t even attempt to pinpoint how exactly that directs us to the small regions with aligned AGIs, instead of many things around—including things that are pretty attractive to all sorts of optimization pressures that exist once you continue the prompt for more than one token^[6]. This leads to people generating ideas for alignment that don’t actually attack central problems, and only a few people doing research where they directly understand how it increases our chances.

There aren’t hundreds of groups looking for and pursuing different research directions that would make an attempt at finding some desiderata for a training process that would lead to the right region, if we’re lucky with inner alignment. Instead, there’s only Vanessa Kosoy’s IB/IBP with a couple of people and probably not enough time, even if it’s a direction that could succeed. Some people think about what different search processes look like, and some look into the confusions that we would want to have researched, if a miracle happens, which is helpful and dignified but not enough.

Optimizing for finding plausible-sounding stories of how to get something that looks kind of aligned and attempting to get to regions that might look similar to the aligned ones in some ways, if one is not careful to keep in mind the goal of increasing the chance of ending up at an actually aligned AGI and the associated difficulties, generate proposals that don’t attack the problem. It might be easy to miss that alignment is much more complicated to achieve from the current point than many deadly things that sound kind-of-aligned until they become smart, reflect, and kill you.

What is the broadest class of alignment proposals which is completely ruled out by the issues you found?

Any proposal not having explicit reasons to expect to arrive at exactly an aligned AGI (currently all specific proposals for training an AGI I’m aware of).

^
I hope I don’t strawman multiple people too much. In particular, the original proposal assumes developing “trustworthy” tests, which are not included.
^
E.g., you protect against steganographic use of specific tokens by using a different model to paraphrase the reasoning the model outputs and checking how much the capabilities drop; test if the reasoning is responsible for the model’s capabilities by editing the reasoning or completely removing it and looking whether the capabilities drop or conclusions change; don’t let the system be capable without the explicit reasoning.
^
Separately fine-tuning a system to be good at evaluating how much humans like the reasoning, showing data where it’s uncertain to humans and adding the results of the evaluations to the dataset, and training the model to score well as judged by that system.
^
Putting aside the issue of human raters making systematic errors (see Eliezer’s argument 20 on the list).
^
E.g., that we probably won’t get close to a system that’s able to look at any architecture with any weights and activations and determine whether it’s thinking unaligned thoughts.
^
If being more context-aware or reflective or power-seeking lets things having some influence over the next token have even more influence over further tokens, these things gain influence, and they won’t even be any of your characters even if those might be able to gain context-awareness.