Humans need not be around to give a penalty at inference time, just like how GPT4 is not penalized by individual humans, but that the reward is learned / programmed. Even if all humans are sleeping / dead today, GPT can run inference according to the reward we preprogrammed. They are not doing pure online learning.
I was also confused by this at first. But I don’t think Rob is saying “an AI that learned ‘don’t kill everyone’ during training would immediately start killing everyone as soon as it can get away with it”, I think he’s saying “even if an AI picks up what seems like a ‘don’t kill everyone’ heuristic during training, that doesn’t mean this heuristic will always hold out-of-distribution”. In particular, undergoing training is a different environment than being deployed, so picking up a “don’t kill everyone in training (but do whatever when deployed)” heuristic is just as good during training as “don’t kill everyone ever”, but the former allows the AI more freedom to pursue its other objectives when deployed.
(I’m hoping Rob can correct me if I’m wrong and/or you can reply if I’m mistaken, per Cunningham’s Law.)
I was also confused by this at first. But I don’t think Rob is saying “an AI that learned ‘don’t kill everyone’ during training would immediately start killing everyone as soon as it can get away with it”, I think he’s saying “even if an AI picks up what seems like a ‘don’t kill everyone’ heuristic during training, that doesn’t mean this heuristic will always hold out-of-distribution”. In particular, undergoing training is a different environment than being deployed, so picking up a “don’t kill everyone in training (but do whatever when deployed)” heuristic is just as good during training as “don’t kill everyone ever”, but the former allows the AI more freedom to pursue its other objectives when deployed.
(I’m hoping Rob can correct me if I’m wrong and/or you can reply if I’m mistaken, per Cunningham’s Law.)