(And, I don’t think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)
Seems to me that alignment faking behavior sort of requires both non-myopia and non-corrigibility as prerequisites. A model that is either sufficiently myopic, or sufficiently corrigible, shouldn’t do alignment faking—at least in theory.
Suppose, for the sake of argument, that from the very start of training, we have some terms in the loss function which fully capture myopia and corrigibility. And that we know the threshold of myopia/corrigibility below which alignment faking behavior starts to become a danger.
Then you could graph your myopia and corrigibility metrics over the course of the training run.
If the metrics always stay well below critical thresholds, supposedly alignment faking shouldn’t be an issue. Since your metrics were always in the safe zone, there wasn’t any alignment faking, meaning your metrics should be accurate. The only exception would be a sudden drop in myopia/corrigibility which doesn’t get captured in the graph before the AI starts doing alignment faking, which proceeds to mess with all the numbers after that point. Seems unlikely.
This is one of those solutions that seems like it could work fine in practice, but isn’t aesthetically satisfying to mathematician types!
Seems to me that alignment faking behavior sort of requires both non-myopia and non-corrigibility as prerequisites. A model that is either sufficiently myopic, or sufficiently corrigible, shouldn’t do alignment faking—at least in theory.
Suppose, for the sake of argument, that from the very start of training, we have some terms in the loss function which fully capture myopia and corrigibility. And that we know the threshold of myopia/corrigibility below which alignment faking behavior starts to become a danger.
Then you could graph your myopia and corrigibility metrics over the course of the training run.
If the metrics always stay well below critical thresholds, supposedly alignment faking shouldn’t be an issue. Since your metrics were always in the safe zone, there wasn’t any alignment faking, meaning your metrics should be accurate. The only exception would be a sudden drop in myopia/corrigibility which doesn’t get captured in the graph before the AI starts doing alignment faking, which proceeds to mess with all the numbers after that point. Seems unlikely.
This is one of those solutions that seems like it could work fine in practice, but isn’t aesthetically satisfying to mathematician types!