Humans need not be around to give a penalty at inference time, just like how GPT4 is not penalized by individual humans, but that the reward is learned /ā programmed. Even if all humans are sleeping /ā dead today, GPT can run inference according to the reward we preprogrammed. They are not doing pure online learning.
I was also confused by this at first. But I donāt think Rob is saying āan AI that learned ādonāt kill everyoneā during training would immediately start killing everyone as soon as it can get away with itā, I think heās saying āeven if an AI picks up what seems like a ādonāt kill everyoneā heuristic during training, that doesnāt mean this heuristic will always hold out-of-distributionā. In particular, undergoing training is a different environment than being deployed, so picking up a ādonāt kill everyone in training (but do whatever when deployed)ā heuristic is just as good during training as ādonāt kill everyone everā, but the former allows the AI more freedom to pursue its other objectives when deployed.
(Iām hoping Rob can correct me if Iām wrong and/āor you can reply if Iām mistaken, per Cunninghamās Law.)
I was also confused by this at first. But I donāt think Rob is saying āan AI that learned ādonāt kill everyoneā during training would immediately start killing everyone as soon as it can get away with itā, I think heās saying āeven if an AI picks up what seems like a ādonāt kill everyoneā heuristic during training, that doesnāt mean this heuristic will always hold out-of-distributionā. In particular, undergoing training is a different environment than being deployed, so picking up a ādonāt kill everyone in training (but do whatever when deployed)ā heuristic is just as good during training as ādonāt kill everyone everā, but the former allows the AI more freedom to pursue its other objectives when deployed.
(Iām hoping Rob can correct me if Iām wrong and/āor you can reply if Iām mistaken, per Cunninghamās Law.)