Stephen McAleese comments on AI Pause Will Likely Backfire

Stephen McAleese 16 Oct 2023 19:49 UTC
1 point
0 ∶ 0
“In brief, the book [Superintelligence] mostly assumed we will manually program a set of values into an AGI, and argued that since human values are complex, our value specification will likely be wrong, and will cause a catastrophe when optimized by a superintelligence”
Superintelligence describes exploiting hard-coded goals as one failure mode which we would probably now call specification gaming. But the book is quite comprehensive, other failure modes are described and I think the book is still relevant.
For example, the book describes what we would now call deceptive alignment:
“A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later”
And reward tampering:
“The proposal fails when the AI achieves a decisive strategic advantage at which point the action which maximizes reward is no longer one that pleases the trainer but one that involves seizing control of the reward mechanism.”
And reward hacking:
“The perverse instantiation—manipulating facial nerves—realizes the final goal to a greater degree than the methods we would normally use.”
I don’t think incorrigibility due to the ‘goal-content integrity’ instrumental goal has been observed in current ML systems yet but it could happen given the robust theoretical argument behind it:
If an agent retains its present goals into the future, then its present goals will be more likely to be achieved by its future self. This gives the agent a present instrumental reason to prevent alternations of its final goals.”