How is the “secretly is planning to murder all humans” improving the models scores on a benchmark?
(I personally don’t find this likely, so this might accidentally be a strawman)
For example: planning and gaining knowledge are incentivized on many benchmarks → instrumental convergence makes model instrumentally value power among other things → a very advanced system that is great at long-term planning might conclude that “murdering all humans” is useful for power or other instrumentally convergent goals
You could prove this. Make a psychopathic model designed to “betray” in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.
I think with our current interpretability techniques we wouldn’t be able to robustly distinguish between a model that generalized to behave well in any reasonable environment vs a model that learned to behave well in that specific environment but would turn back to betray in many other environments
(I personally don’t find this likely, so this might accidentally be a strawman)
For example: planning and gaining knowledge are incentivized on many benchmarks → instrumental convergence makes model instrumentally value power among other things → a very advanced system that is great at long-term planning might conclude that “murdering all humans” is useful for power or other instrumentally convergent goals
I think with our current interpretability techniques we wouldn’t be able to robustly distinguish between a model that generalized to behave well in any reasonable environment vs a model that learned to behave well in that specific environment but would turn back to betray in many other environments