“So far, we haven’t found any way to achieve all three goals at once. As an example, we can try to remove any incentive on the system’s part to control whether its suspend button is pushed by giving the system a switching objective function that always assigns the same expected utility to the button being on or off”
Wouldn’t this potentially have another negative effect of giving the system an incentive to “expect” an unjustifiably high probability of successfully filling the cauldron? That way if the button is pressed and it’s suspended, it gets a higher reward than if it expected a lower chance of success. This is basically an example of reward hacking.
“So far, we haven’t found any way to achieve all three goals at once. As an example, we can try to remove any incentive on the system’s part to control whether its suspend button is pushed by giving the system a switching objective function that always assigns the same expected utility to the button being on or off”
Wouldn’t this potentially have another negative effect of giving the system an incentive to “expect” an unjustifiably high probability of successfully filling the cauldron? That way if the button is pressed and it’s suspended, it gets a higher reward than if it expected a lower chance of success. This is basically an example of reward hacking.