Paul_Christiano comments on How would a language model become goal-directed?

Paul_Christiano 16 Jul 2022 20:42 UTC
14 points
0 ∶ 0
Some examples of more exotic sources of consequentialism:
- Some consequentialist patterns emerge within a large model and deliberately acquire more control over the behavior of the model such that the overall model behaves in a consequentialist way. These could emerge randomly, or e.g. while a model is explicitly reasoning about a consequentialist (I think this latter example is discussed by Eliezer in the old days though I don’t have a reference handy). They could either emerge within a forward pass, over a period of “cultural accumulation” (e.g. if language models imitate each other’s outputs), or during gradient descent (see gradient hacking).
- An attacker publishes github repositories containing traces of consequentialist behavior (e.g. optimized exploits against the repository in which they are included). They also place triggers in these repositories before the attacks, like stretches of low-temperature model outputs, such that if we train a model on github and then sample autoregressively the model may eventually begin imitating the consequentialist behavior included in these repositories (since long stretches of low-temperature model outputs occur rarely in natural github but occur just before attacks in the attacker’s repositories). This is technically a special case of “#1 imitating consequentialists” but it behaves somewhat strangely since the people training the system weren’t aware of the presence of the consequentialist.
- An attacker selects an input on which existing machinery for planning or prediction within a large model is repurposed for consequentialist behavior. If we have large language models that are “safe” only because they aren’t behaving as consequentialists, this could be a bad situation. (Compromised models could themselves design and deploy similar attacks to recruit still more models; so even random failures at deployment time could spread like a virus without any dedicated attacker. This bleeds into the first failure mode.)
- A language model can in theory run into the same problem as described in what does the universal prior actually look like?, even if it is only reasoning abstractly about how to predict the physical universe (i.e. without actually containing malign consequentialists). Technically this is also a special case of #1 imitating a consequentialist, but again it can be surprising since e.g. the consequentialist wasn’t present at training time and the person deploying the system didn’t realize that the system might imitate a consequentialist.
I find it interesting to think about the kind of dynamics that can occur in the limit of very large models, but I think these dynamics are radically less important than #1-#5 in my original answer (while still not being exhaustive). I think that they are more speculative, will occur later if they do occur, and will likely be solved automatically by solutions to more basic issues. I think it’s conceivable some issues of this flavor will occur in security contexts, but even there I think they likely won’t present an alignment risk per se (rather than just yet another vector for terrible cybersecurity problems) for a very long time.
What links here?
- Training goals for large language models by Johannes Treutlein (LessWrong; 18 Jul 2022 7:09 UTC; 28 points)