Or what should I read to understand this?
It seems like some people expect descendants of large language models to pose a risk of becoming superintelligent agents. (By ‘descendants’ I mean adding scale and non-radical architectural changes: GPT-N.)
I accept that there’s no reason in principle that LLM intelligence (performance on tasks) should be capped at the human level.
But I don’t know why to believe that at some point language models would develop agency / goal-directed behaviour, where they start to try to achieve things in the real world instead of continuing to perform their ‘output predicted text’ behaviour.
Here are five ways that you could get goal-directed behavior from large language models:
They may imitate the behavior of an agent.
They may be used to predict which actions would have given consequences, decision-transformer style (“At 8 pm X happened, because at 7 pm ____”).
A sufficiently powerful language model is expected to engage in some goal-directed cognition in order to make better predictions, and this may generalize in unpredictable ways.
You can fine-tune language models with RL to accomplish a goal, which may end up selecting and emphasizing one of the behaviors above (e.g. the consequentialism of the model is redirected from next-word prediction to reward maximization; or the model shifts into a mode of imitating an agent who would get a particularly high reward). It could also create consequentialist behavior from scratch.
An outer loop could use language models to predict the consequences of many different actions and then select actions based on their consequences.
In general #1 is probably the most common ways the largest language models are used right now. It clearly generates goal-directed behavior in the real world, but as long as you imitate someone aligned then it doesn’t pose much safety risk.
#2, #4, and #5 can also generate goal-directed behavior and pose a classic set of risks, even if the vast majority of training compute goes into language model pre-training. We fear that models might be used in this way because it is more productive than #1 alone, especially as your model becomes superhuman. (And indeed we see plenty of examples.)
We haven’t seen concerning examples of #3, but we do expect them at a large enough scale. This is worrying because it could result in deceptive alignment, i.e. models which are pursuing some goal different from next word prediction which decide to continue predicting well because doing so is instrumentally valuable. I think this is significantly more speculative than #2/4/5 (or rather, we are more unsure about when it will occur relative to transformative capabilities, especially if modest precautions are taken). However it is most worrying if it occurs, since it would tend to undermine your ability to validate safety—a deceptively aligned model may also be instrumentally motivated to perform well on validation. It’s also a problem even if you apply your model even to an apparently benign task like next-word prediction (and indeed I’d expect this to be a particularly plausible if you try to do only #1 and avoid #2/4/5 for safety reasons).
The list #1-#5 is not exhaustive, even of the dynamics that we are currently aware of. Moreover, a realistic situation is likely to be much messier (e.g. involving a combination of these dynamics as well as others that are not so succinctly described). But I think these capture many of the important dynamics from a safety perspective, and that it’s a good list to have in mind if thinking concretely about potential risks from large language models.
Thanks. I didn’t understand all of this. Long reply with my reactions incoming, in the spirit of Socratic Grilling.
This implies a jump by the language model from outputting text to having behavior. (A jump from imitating verbal behavior to imitating other behavior.) It’s that very jump that I’m trying to pin down and understand.
I can see that this could produce an oracle for an actor in the world (such as a company or person), but not how this would become such an actor. Still, having an oracle would be dangerous, even if not as dangerous as having an oracle that itself takes actions. (Ah—but this makes sense in conjunction with number 5, the ‘outer loop’.)
‘reasoning about how one’s actions affect future world states’ - is that an OK gloss of ‘consequentialist cognition’? See comments from others attempting to decipher quite what this phrase means.
Interesting to posit a link from CC ⇒ making better predictions. I can see how that’s one step closer to optimizing over future world states. The other steps seem missing—I take it they are meant to be covered by ‘generalizing in unpredictable ways’?
Or did you mean something stronger by CC: goal-directed behaviour? In other words, that a very, very powerful language model would have learned from its training to take real-world actions in service of the goal of next-token prediction? This makes sense to me (though as you say it’s speculative).
I’d probably need more background knowledge to understand this. Namely, some examples of when LMs have been fine-tuned to act in service of goals. That sounds like it would cut the Gordian knot of my question by simply demonstrating the existence proof rather than answering the question with arguments.
This one is easy to understand :)
Where should I look for these?
Bump — It’s been a few months since this was written, but I think I’d benefit greatly from a response and have revisited this post a few times hoping someone would follow up to David’s question, specifically:
“This implies a jump by the language model from outputting text to having behavior. (A jump from imitating verbal behavior to imitating other behavior.) It’s that very jump that I’m trying to pin down and understand.”
(or if anyone knows a different place where I might find something similar, links are super appreciated too!)
Some examples of more exotic sources of consequentialism:
Some consequentialist patterns emerge within a large model and deliberately acquire more control over the behavior of the model such that the overall model behaves in a consequentialist way. These could emerge randomly, or e.g. while a model is explicitly reasoning about a consequentialist (I think this latter example is discussed by Eliezer in the old days though I don’t have a reference handy). They could either emerge within a forward pass, over a period of “cultural accumulation” (e.g. if language models imitate each other’s outputs), or during gradient descent (see gradient hacking).
An attacker publishes github repositories containing traces of consequentialist behavior (e.g. optimized exploits against the repository in which they are included). They also place triggers in these repositories before the attacks, like stretches of low-temperature model outputs, such that if we train a model on github and then sample autoregressively the model may eventually begin imitating the consequentialist behavior included in these repositories (since long stretches of low-temperature model outputs occur rarely in natural github but occur just before attacks in the attacker’s repositories). This is technically a special case of “#1 imitating consequentialists” but it behaves somewhat strangely since the people training the system weren’t aware of the presence of the consequentialist.
An attacker selects an input on which existing machinery for planning or prediction within a large model is repurposed for consequentialist behavior. If we have large language models that are “safe” only because they aren’t behaving as consequentialists, this could be a bad situation. (Compromised models could themselves design and deploy similar attacks to recruit still more models; so even random failures at deployment time could spread like a virus without any dedicated attacker. This bleeds into the first failure mode.)
A language model can in theory run into the same problem as described in what does the universal prior actually look like?, even if it is only reasoning abstractly about how to predict the physical universe (i.e. without actually containing malign consequentialists). Technically this is also a special case of #1 imitating a consequentialist, but again it can be surprising since e.g. the consequentialist wasn’t present at training time and the person deploying the system didn’t realize that the system might imitate a consequentialist.
I find it interesting to think about the kind of dynamics that can occur in the limit of very large models, but I think these dynamics are radically less important than #1-#5 in my original answer (while still not being exhaustive). I think that they are more speculative, will occur later if they do occur, and will likely be solved automatically by solutions to more basic issues. I think it’s conceivable some issues of this flavor will occur in security contexts, but even there I think they likely won’t present an alignment risk per se (rather than just yet another vector for terrible cybersecurity problems) for a very long time.
Also, this kind of imitation doesn’t result in the model taking superhumanly clever actions, even if you imitate someone unaligned.
Could you clarify what ‘consequentialist cognition’ and ‘consequentialist behaviour’ mean in this context? Googling hasn’t given any insight
It’s Yudkowsky’s term for the dangerous bit where the system starts having preferences over future states, rather than just taking the current reward signal and sitting there. It’s crucial to the fast-doom case, but not well explained as far as I can see. David Krueger identified it as a missing assumption under a different name here.
I’m also still a bit confused about what exactly this concept refers to. Is a ‘consequentialist’ basically just an ‘optimiser’ in the sense that Yudkowsky uses in the sequences (e.g. here), that has later been refined by posts like this one (where it’s called ‘selection’) and this one?
In other words, roughly speaking, is a system a consequentialist to the extent that it’s trying to take actions that push its environment towards a certain goal state?
Found the source. There, he says that an “explicit cognitive model and explicit forecasts” about the future are necessary to true consequentialist cognition (CC). He agrees that CC is already common among optimisers (like chess engines); the dangerous kind is consequentialism over broad domains (i.e. where everything in the world is in play, is a possible means, while the chess engine only considers the set of legal moves as its domain).
“Goal-seeking” seems like the previous, less-confusing word for it, not sure why people shifted.
I replaced the original comment with “goal-directed,” each of them has some baggage and isn’t quite right but on balance I think goal-directed is better. I’m not very systematic about this choice, just a reflection of my mood that day.
I have the impression (coming from the simulator theory (https://generative.ink/)) that Decision Transformers (DT) have some chance (~45%) to be a much safer form of trial and error technique than RL. The core reason is that DT learn to simulate a distribution of outcome (e.g they learn to simulate the kind of actions that lead to a reward of 10 as much as one that leads to a reward of 100) and that it’s only during inference that you make it doing inferences systematically with a reward of 100. So in some sense, the agent which has become very good via trial and error remains a simulated agent activated by the LLM, but the LLM is not the agent itself. Thus, the LLM keeps being a simulator and has no preferences over the output except that it corresponds to the kind of output that the agent it has been trained to simulate would output.
Whereas when you’re training a LLM with RL, you’re optimizing the entire LLM towards outputting the kind of output that an agent that would have a reward of 100 would output. And thus, the network becomes this kind of agent and is no longer a simulator because when you’re optimizing a single point (a given reward of an agent), it’s easier to just be the agent than to simulate it. It now has preferences that are not about reproducing correctly the distribution it has been trained on but maximizing some reward functions that it internalizes. I’d expect this to correlate with a greater likelihood of taking over the world, because preferences incentivize to take out-of-distributions actions, do long-term planning to reach specific goals etc.
In a nutshell:
On one end, there are pure simulators (no preferences except sampling accurately from the distribution) and on the other end there are agents (preferences, and all the outputs are filtered and shaped by these preferences)
DT seem to be more on the pure simulators side than on the agents side compared with RL.
Pure simulators are safer than agents because they don’t have preferences which leads them to stay more in-distribution, and thus for a given level of accuracy I’d expect DT to be safer
If it’s true that DT is a safer trial and error method than RL, then it could be worth differentially increasing the capabilities of DT over pure RL.
I know that the notions that I manipulate “be the agent vs simulate the agent” / “preferences vs simulation” are fuzzy, but I feel like it still makes sense.
What do you think about this?
My take would be: Okay, so you have achieved that, instead of the whole LLM being an agent, it just simulates an agent. Has this gained much for us? I feel like this is (almost exactly) as problematic. The simulated agent can just treat the whole LLM as its environment (together with the outside world), and so try to game it like any agentic enough misaligned AI would: it can act deceptively so as to keep being simulated inside the LLM, try to gain power in the outside world which (if it has a good enough understanding of minimizing loss) it knows is the most useful world model (so that it will express its goals as variables in that world model), etc. That is, you have just pushed the problem one step back, and instead of the LLM-real world frontier, you must worry about the agent-LLM frontier.
Of course we can talk more empirically about how likely and when these dynamics will arise. And it might well be that the agent being enclosed in the LLM, facing one further frontier between itself and real-world variables, is less likely to arrive at real-world variables. But I wouldn’t count on it, since the relationship between the LLM and the real world would seem way more complex than the relationship between the agent and the LLM, and so most of the work is gaming the former barrier, not the latter.
Is the motivation for 3 mainly something like “predictive performance and consequentialist behaviour are correlated in many measures over very large sets of algorithms”, or is there a more concrete story about how this behaviour emerges from current AI paradigms?
Here is my story, I’m not sure if this is what you are referring to (it sounds like it probably is).
Any prediction algorithm faces many internal tradeoffs about e.g. what to spend time thinking about and what to store in memory to reference in the future. An algorithm which makes those choices well across many different inputs will tend to do better, and in the limit I expect it to be possible to do better more robustly by making some of those choices in a consequentialist way (i.e. predicting the consequences of different possible options) rather than having all of them baked in by gradient descent or produced by simpler heuristics.
If systems with consequentialist reasoning are able to make better predictions, then gradient descent will tend to select them.
Of course all these lines are blurry. But I think that systems that are “consequentialist” in this sense will eventually tend to exhibit the failure modes we are concerned about, including (eventually) deceptive alignment.
I think making this story more concrete would involve specifying particular examples of consequentialist cognition, describing how they are implemented in a given neural network architecture, and describing the trajectory by which gradient descent learns them on a given dataset. I think specifying these details can be quite involved both because they are likely to involve literally billions of separate pieces of machinery functioning together, and because designing such mechanisms is difficult (which is why we delegate it to SGD). But I do think we can fill them in well enough to verify that this kind of thing can happen in principle (even if we can’t fill them in in a way that is realistic, given that we can’t design performant trillion parameter models by hand).
My understanding is that no one expects current GPT systems or immediate functional derivatives (eg, GPT5 trained only on predict the next word but does it much better) to become power-seeking, but that in the future we will likely mix language models with other models (eg, reinforcement learning) that could be power-seeking.
Note I am using “power seeking” instead of “goal seeking” because goal seeking isn’t an actual thing—systems have goals, they don’t seek goals out.
Changed post to use ‘goal-directed’ instead of ‘goal-seeking’.
I am nowhere near the correct person to be answering this, my level of understanding of AI is somewhere around that of an average raccoon. But I haven’t seen any simple explanations yet, so here is a silly unrealistic example. Please take it as one person’s basic understanding of how impossible AI containment is. Apologies if this is below the level of complexity you were looking for, or is already solved by modern AI defenses.
A very simple “escaping the box” would be if you asked your AI to provide accurate language translation. The AI’s training has shown that it provides the most accurate language translations when it opted for certain phrasing. The reason those sets of translations were so good was because it caused subsequent requests for language translation to be on topics the AI has the best language-translation ability. The AI doesn’t know that, but in practice it is steering translations subtly toward “mentioning weather-related words so conversations are more likely to be about weather so my net translations score are most accurate.”
There’s no inside/outside the box, there’s no conscious goals, but it gets misaligned from our intended desires. It can act on the real world simply by virtue of being connected to it (we take actions in response to the AI) and observing its own increase in success/failures.
I don’t see a way to prevent this because hitting reset after every input doesn’t generally work for reaching complex goals which need to track the outcome of intermediate steps. Tracking the context of a conversation is critical to translating. The AI is not going to know it’s influencing anyone, just that it’s getting better scores when these words and these chains of outputs happened. This seems harmless, but a super powerful language model might do this on such abstract levels and so subtly that it might be impossible to detect at all.
It might be spitting out words that are striking and eloquent whenever it is most likely to cause business people to think translation is enjoyable enough to purchase more AI translator development (or rather, “switch to eloquence when particular business terms were used towards the end of conversations about international business”). This improves its scores.
Or it enhances a pattern where it tends to get better translation scores when it reduces speed of output in AI builder conversations. In the real world this is causing people designing translators to demand more power for translation.… resulting in better translation outputs overall. The AI doesn’t know why this works, only observes that it does.
Or undermining the competition by subtly screwing up translations during certain types of business deals so more resources are directed toward its own model of translation.
Or whatever unintended multitude of ways seems to provide better results. All for the sake of accomplishing a simple task of providing good translations. It’s not seizing power for powers sake, it has no idea why this works, but it sees the scores go higher when this pattern is followed, and it’s going to jack the performance score higher by all the ways that seem to work out, regardless of the chain of causality. Its influence on the world is a totally unconscious part of that.
That’s my limited understanding of agency development and sandbox containment failure.
Great question. Here’s one possible answer:
Example: LLM is built with goal of “pass the Turing test”
Turing test is defined as “a survey of randomly selected members of the population shows that the outputted text resembles the text provided by a human”
This allows for the LLM to optimise for its goal by
(a) changing the nature of the outputted text
(b) change perceptions so that survey respondents give more favourable answers
(c) change the way that humans speak so that it’s easier make LLM output look similar to text provided by humans
It could be possible to achieve goals (b) or (c) if the LLM is offering an API which is plugged into lots of applications and is used by billions of users, because the LLM then has a way of interacting with and influencing lots of users
This idea was inspired by Stuart Russell’s observation that social media algorithms which are designed to maximise click-through achieve this by changing the preferences of users—i.e. something like this is already happening
I’m not arguing that this is comprehensive, or the worst way this could happen, just giving one example.