Or another rephrase. How is the “secretly is planning to murder all humans” improving the models scores on a benchmark? If you think about it first, what gradient from the training set even led to this capability of an inner cognitive process looking for a chance to betray. What force is causing this cognitive process to come out of the random initial weights?
Humans seem to have such a force but it’s because modeling “if I kill this rival then the reward for me is...” was evolutionarily useful. Also it’s probably a behavior that is learned.
And second, yeah, SGD should push “neutral” weights inside the network that are not contributing to correct answers towards weights that do increase the odds of a correct output distribution. So it should actively destroy “unnecessary ” cognitive processes inside the model.
You could prove this. Make a psychopathic model designed to “betray” in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.
How is the “secretly is planning to murder all humans” improving the models scores on a benchmark?
(I personally don’t find this likely, so this might accidentally be a strawman)
For example: planning and gaining knowledge are incentivized on many benchmarks → instrumental convergence makes model instrumentally value power among other things → a very advanced system that is great at long-term planning might conclude that “murdering all humans” is useful for power or other instrumentally convergent goals
You could prove this. Make a psychopathic model designed to “betray” in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.
I think with our current interpretability techniques we wouldn’t be able to robustly distinguish between a model that generalized to behave well in any reasonable environment vs a model that learned to behave well in that specific environment but would turn back to betray in many other environments
Or another rephrase. How is the “secretly is planning to murder all humans” improving the models scores on a benchmark? If you think about it first, what gradient from the training set even led to this capability of an inner cognitive process looking for a chance to betray. What force is causing this cognitive process to come out of the random initial weights?
Humans seem to have such a force but it’s because modeling “if I kill this rival then the reward for me is...” was evolutionarily useful. Also it’s probably a behavior that is learned.
And second, yeah, SGD should push “neutral” weights inside the network that are not contributing to correct answers towards weights that do increase the odds of a correct output distribution. So it should actively destroy “unnecessary ” cognitive processes inside the model.
You could prove this. Make a psychopathic model designed to “betray” in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.
(I personally don’t find this likely, so this might accidentally be a strawman)
For example: planning and gaining knowledge are incentivized on many benchmarks → instrumental convergence makes model instrumentally value power among other things → a very advanced system that is great at long-term planning might conclude that “murdering all humans” is useful for power or other instrumentally convergent goals
I think with our current interpretability techniques we wouldn’t be able to robustly distinguish between a model that generalized to behave well in any reasonable environment vs a model that learned to behave well in that specific environment but would turn back to betray in many other environments