In particular, I am persuaded by the argument that, because evaluation is usually easier than generation, it should be feasible to accurately evaluate whether a slightly-smarter-than-human AI is taking unethical actions, allowing us to shape its rewards during training accordingly. After we’ve aligned a model that’s merely slightly smarter than humans, we can use it to help us align even smarter AIs, and so on, plausibly implying that alignment will scale to indefinitely higher levels of intelligence, without necessarily breaking down at any physically realistic point.
This reasoning seems to imply that you could use GPT-2 to oversee GPT-4 by bootstrapping from a chain of models of scales between GPT-2 and GPT-4. However, this isn’t true, the weak-to-strong generalization paper finds that this doesn’t work and indeed bootstrapping like this doesn’t help at all for ChatGPT reward modeling (it helps on chess puzzles and for nothing else they investigate I believe).
I think this sort of bootstrapping argument might work if we could ensure that each model in the chain was sufficiently aligned and capable of reasoning such that it would carefully reason about what humans would want if they were more knowledgeable and then rate outputs based on this. However, I don’t think GPT-4 is either aligned enough or capable enough that we see this behavior. And I still think it’s unlikely it works under these generous assumptions (though I won’t argue for this here).
This reasoning seems to imply that you could use GPT-2 to oversee GPT-4 by bootstrapping from a chain of models of scales between GPT-2 and GPT-4. However, this isn’t true, the weak-to-strong generalization paper finds that this doesn’t work and indeed bootstrapping like this doesn’t help at all for ChatGPT reward modeling (it helps on chess puzzles and for nothing else they investigate I believe).
I think this sort of bootstrapping argument might work if we could ensure that each model in the chain was sufficiently aligned and capable of reasoning such that it would carefully reason about what humans would want if they were more knowledgeable and then rate outputs based on this. However, I don’t think GPT-4 is either aligned enough or capable enough that we see this behavior. And I still think it’s unlikely it works under these generous assumptions (though I won’t argue for this here).