“In brief, the book [Superintelligence] mostly assumed we will manually program a set of values into an AGI, and argued that since human values are complex, our value specification will likely be wrong, and will cause a catastrophe when optimized by a superintelligence”
Superintelligence describes exploiting hard-coded goals as one failure mode which we would probably now call specification gaming. But the book is quite comprehensive, other failure modes are described and I think the book is still relevant.
“The proposal fails when the AI achieves a decisive strategic advantage at which point the action which maximizes reward is no longer one that pleases the trainer but one that involves seizing control of the reward mechanism.”
“The perverse instantiation—manipulating facial nerves—realizes the final goal to a greater degree than the methods we would normally use.”
I don’t think incorrigibility due to the ‘goal-content integrity’ instrumental goal has been observed in current ML systems yet but it could happen given the robust theoretical argument behind it:
If an agent retains its present goals into the future, then its present goals will be more likely to be achieved by its future self. This gives the agent a present instrumental reason to prevent alternations of its final goals.”
Superintelligence describes exploiting hard-coded goals as one failure mode which we would probably now call specification gaming. But the book is quite comprehensive, other failure modes are described and I think the book is still relevant.
For example, the book describes what we would now call deceptive alignment:
And reward tampering:
And reward hacking:
I don’t think incorrigibility due to the ‘goal-content integrity’ instrumental goal has been observed in current ML systems yet but it could happen given the robust theoretical argument behind it: