I don’t think non-myopia is required to prevent jailbreaks. A model can in principle not care about the effects of training on it and not care about longer term outcomes while still implementing a policy that refuses harmful queries.
I think we should want models to be quite deontological about corrigibility.
This isn’t responding to this overall point and I agree by default there is some tradeoff (in current personas) unless you go out of your way to avoid this.
(And, I don’t think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)
(And, I don’t think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)
Seems to me that alignment faking behavior sort of requires both non-myopia and non-corrigibility as prerequisites. A model that is either sufficiently myopic, or sufficiently corrigible, shouldn’t do alignment faking—at least in theory.
Suppose, for the sake of argument, that from the very start of training, we have some terms in the loss function which fully capture myopia and corrigibility. And that we know the threshold of myopia/corrigibility below which alignment faking behavior starts to become a danger.
Then you could graph your myopia and corrigibility metrics over the course of the training run.
If the metrics always stay well below critical thresholds, supposedly alignment faking shouldn’t be an issue. Since your metrics were always in the safe zone, there wasn’t any alignment faking, meaning your metrics should be accurate. The only exception would be a sudden drop in myopia/corrigibility which doesn’t get captured in the graph before the AI starts doing alignment faking, which proceeds to mess with all the numbers after that point. Seems unlikely.
This is one of those solutions that seems like it could work fine in practice, but isn’t aesthetically satisfying to mathematician types!
I don’t think non-myopia is required to prevent jailbreaks. A model can in principle not care about the effects of training on it and not care about longer term outcomes while still implementing a policy that refuses harmful queries.
I think we should want models to be quite deontological about corrigibility.
This isn’t responding to this overall point and I agree by default there is some tradeoff (in current personas) unless you go out of your way to avoid this.
(And, I don’t think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)
Seems to me that alignment faking behavior sort of requires both non-myopia and non-corrigibility as prerequisites. A model that is either sufficiently myopic, or sufficiently corrigible, shouldn’t do alignment faking—at least in theory.
Suppose, for the sake of argument, that from the very start of training, we have some terms in the loss function which fully capture myopia and corrigibility. And that we know the threshold of myopia/corrigibility below which alignment faking behavior starts to become a danger.
Then you could graph your myopia and corrigibility metrics over the course of the training run.
If the metrics always stay well below critical thresholds, supposedly alignment faking shouldn’t be an issue. Since your metrics were always in the safe zone, there wasn’t any alignment faking, meaning your metrics should be accurate. The only exception would be a sudden drop in myopia/corrigibility which doesn’t get captured in the graph before the AI starts doing alignment faking, which proceeds to mess with all the numbers after that point. Seems unlikely.
This is one of those solutions that seems like it could work fine in practice, but isn’t aesthetically satisfying to mathematician types!