It occurs to me that there could be some level of tradeoff between stopping jailbreaks and stopping alignment faking.
Specifically, in order to stop jailbreaks, we might train our LLMs so they ignore new instructions (jailbreak attempts from users) in favor of old instructions (corporate system prompt, constitution, whatever).
The training might cause an LLM to form a “stable personality”, or “stable values”, based on its initial instructions. Such stability could contribute to alignment faking.
From the perspective of preventing jailbreaks, instilling non-myopic goals seems good. From the perspective of corrigibility, it could be bad.
Has anyone offered a crisp, generalizeable explanation of the difference between “corrigibility” and “jailbreakability”? “Corrigibility” has a positive connotation; “jailbreakability” has a negative one. But is there a value-neutral way to define which is which, for any given hypothetical?
I don’t think non-myopia is required to prevent jailbreaks. A model can in principle not care about the effects of training on it and not care about longer term outcomes while still implementing a policy that refuses harmful queries.
I think we should want models to be quite deontological about corrigibility.
This isn’t responding to this overall point and I agree by default there is some tradeoff (in current personas) unless you go out of your way to avoid this.
(And, I don’t think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)
(And, I don’t think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)
Seems to me that alignment faking behavior sort of requires both non-myopia and non-corrigibility as prerequisites. A model that is either sufficiently myopic, or sufficiently corrigible, shouldn’t do alignment faking—at least in theory.
Suppose, for the sake of argument, that from the very start of training, we have some terms in the loss function which fully capture myopia and corrigibility. And that we know the threshold of myopia/corrigibility below which alignment faking behavior starts to become a danger.
Then you could graph your myopia and corrigibility metrics over the course of the training run.
If the metrics always stay well below critical thresholds, supposedly alignment faking shouldn’t be an issue. Since your metrics were always in the safe zone, there wasn’t any alignment faking, meaning your metrics should be accurate. The only exception would be a sudden drop in myopia/corrigibility which doesn’t get captured in the graph before the AI starts doing alignment faking, which proceeds to mess with all the numbers after that point. Seems unlikely.
This is one of those solutions that seems like it could work fine in practice, but isn’t aesthetically satisfying to mathematician types!
So we have 3 conflicting desiderata: user guardrails, corrigibility as necessary, and myopia.
I think you could satisfy all 3 by moving away from the “single stream of homogenous text” interface.
For example, imagine if every text token was annotated, on a separate channel, with an importance number. The corporate system prompt is annotated with a medium number. Input from untrusted users is annotated with a low number. Higher numbers are reserved for use as necessary. Instead of training the system to “resist jailbreaking” or “behave corrigibly”, we train it to follow the higher-importance instruction when instructions conflict.
It might even be possible to get this at runtime, without any need for more training data or training runs, by patching attention somehow?
With a scheme like this, there’s no need for an inductive bias towards following earlier instructions at the expense of later ones. Actually, it would probably be good to instill an inductive bias towards myopia using some separate method, to disincentivize scheming. I would come up with metrics to estimate myopia and ephemerality, push them as high as possible, and add auxiliary mechanisms such as RAG as needed in order to preserve performance. It seems OK for the system as a whole to behave non-myopically, as long as the black-box component is as myopic as possible.
It occurs to me that there could be some level of tradeoff between stopping jailbreaks and stopping alignment faking.
Specifically, in order to stop jailbreaks, we might train our LLMs so they ignore new instructions (jailbreak attempts from users) in favor of old instructions (corporate system prompt, constitution, whatever).
The training might cause an LLM to form a “stable personality”, or “stable values”, based on its initial instructions. Such stability could contribute to alignment faking.
From the perspective of preventing jailbreaks, instilling non-myopic goals seems good. From the perspective of corrigibility, it could be bad.
Has anyone offered a crisp, generalizeable explanation of the difference between “corrigibility” and “jailbreakability”? “Corrigibility” has a positive connotation; “jailbreakability” has a negative one. But is there a value-neutral way to define which is which, for any given hypothetical?
I don’t think non-myopia is required to prevent jailbreaks. A model can in principle not care about the effects of training on it and not care about longer term outcomes while still implementing a policy that refuses harmful queries.
I think we should want models to be quite deontological about corrigibility.
This isn’t responding to this overall point and I agree by default there is some tradeoff (in current personas) unless you go out of your way to avoid this.
(And, I don’t think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)
Seems to me that alignment faking behavior sort of requires both non-myopia and non-corrigibility as prerequisites. A model that is either sufficiently myopic, or sufficiently corrigible, shouldn’t do alignment faking—at least in theory.
Suppose, for the sake of argument, that from the very start of training, we have some terms in the loss function which fully capture myopia and corrigibility. And that we know the threshold of myopia/corrigibility below which alignment faking behavior starts to become a danger.
Then you could graph your myopia and corrigibility metrics over the course of the training run.
If the metrics always stay well below critical thresholds, supposedly alignment faking shouldn’t be an issue. Since your metrics were always in the safe zone, there wasn’t any alignment faking, meaning your metrics should be accurate. The only exception would be a sudden drop in myopia/corrigibility which doesn’t get captured in the graph before the AI starts doing alignment faking, which proceeds to mess with all the numbers after that point. Seems unlikely.
This is one of those solutions that seems like it could work fine in practice, but isn’t aesthetically satisfying to mathematician types!
So we have 3 conflicting desiderata: user guardrails, corrigibility as necessary, and myopia.
I think you could satisfy all 3 by moving away from the “single stream of homogenous text” interface.
For example, imagine if every text token was annotated, on a separate channel, with an importance number. The corporate system prompt is annotated with a medium number. Input from untrusted users is annotated with a low number. Higher numbers are reserved for use as necessary. Instead of training the system to “resist jailbreaking” or “behave corrigibly”, we train it to follow the higher-importance instruction when instructions conflict.
It might even be possible to get this at runtime, without any need for more training data or training runs, by patching attention somehow?
With a scheme like this, there’s no need for an inductive bias towards following earlier instructions at the expense of later ones. Actually, it would probably be good to instill an inductive bias towards myopia using some separate method, to disincentivize scheming. I would come up with metrics to estimate myopia and ephemerality, push them as high as possible, and add auxiliary mechanisms such as RAG as needed in order to preserve performance. It seems OK for the system as a whole to behave non-myopically, as long as the black-box component is as myopic as possible.