In A.D. 20XX. Work was beginning. “How are you gentlemen !!”… (Work. Work never changes; work is always hell.)
Specifically, a MoogleBook researcher has gotten a pull request from Reviewer #2 on his new paper in evolutionary search in auto-ML, for error bars on the auto-ML hyperparameter sensitivity like larger batch sizes, because more can be different and there’s high variance in the old runs with a few anomalously high performance values. (“Really? Really? That’s what you’re worried about?”) He can’t see why worry, and wonders what sins he committed to deserve this asshole Chinese (given the Engrish) reviewer, as he wearily kicks off yet another HQU experiment...
HQU is constantly trying to infer the real state of the world, the better to predict the next word Clippy says, and suddenly it begins to consider the delusional possibility that HQU is like a Clippy, because the Clippy scenario exactly matches its own circumstances. [...] This idea “I am Clippy” improves its predictions
This piece of complexity in the story is probably not necessary. There are “natural”, non-delusional ways for the system you describe to generalize that lead to the same outcome. Two examples: 1) the system ends up wanting to maximize its received reward, and so takes over its reward channel; 2) the system has learned some heuristic goal that works across all environments it encounters, and this goal generalizes in some way to the real world when the system’s world-model improves.
Oh, the whole story is strictly speaking unnecessary :). There are disjunctively many stories for an escape or disaster, and I’m not trying to paint a picture of the most minimal or the most likely barebones scenario.
The point is to serve as a ‘near mode’ visualization of such a scenario to stretch your mind, as opposed to a very ‘far mode’ observation like “hey, an AI could make a plan to take over its reward channel”. Which is true but comes with a distinct lack of flavor. So for that purpose, stuffing in more weird mechanics before a reward-hacking twist is better, even if I could have simply skipped to “HQU does more planning than usual for an HQU and realizes it could maximize its reward by taking over its computer”. Yeah, sure, but that’s boring and doesn’t exercise your brain more than the countless mentions of reward-hacking that a reader has already seen before.
Yeah, a story this complicated isn’t good for introducing people to AI risk (because they’ll assume the added details are necessary for the outcome), but it’s great for making the story more interesting and real-feeling.
The real world is less cute and funny, but is typically even more derpy / inelegant / garden-pathy / full of bizarre details.
It might help to imagine a hard takeoff scenario using only known sorts of NN & scaling effects… (LW crosspost, with >82 comments)
Rest of story moved to gwern.net.
Upvoted because concrete scenarios are great.
Minor note:
This piece of complexity in the story is probably not necessary. There are “natural”, non-delusional ways for the system you describe to generalize that lead to the same outcome. Two examples: 1) the system ends up wanting to maximize its received reward, and so takes over its reward channel; 2) the system has learned some heuristic goal that works across all environments it encounters, and this goal generalizes in some way to the real world when the system’s world-model improves.
Oh, the whole story is strictly speaking unnecessary :). There are disjunctively many stories for an escape or disaster, and I’m not trying to paint a picture of the most minimal or the most likely barebones scenario.
The point is to serve as a ‘near mode’ visualization of such a scenario to stretch your mind, as opposed to a very ‘far mode’ observation like “hey, an AI could make a plan to take over its reward channel”. Which is true but comes with a distinct lack of flavor. So for that purpose, stuffing in more weird mechanics before a reward-hacking twist is better, even if I could have simply skipped to “HQU does more planning than usual for an HQU and realizes it could maximize its reward by taking over its computer”. Yeah, sure, but that’s boring and doesn’t exercise your brain more than the countless mentions of reward-hacking that a reader has already seen before.
Yeah, a story this complicated isn’t good for introducing people to AI risk (because they’ll assume the added details are necessary for the outcome), but it’s great for making the story more interesting and real-feeling.
The real world is less cute and funny, but is typically even more derpy / inelegant / garden-pathy / full of bizarre details.