I think this post provides some pretty useful arguments about the downsides of pausing AI development. I feel noticeably more pessimistic about a pause going well having read this.
However, I don’t agree with some of the arguments about alignment optimism and think they’re a fair bit weaker
When it comes to AIs, we are the innate reward system
Sure, we can use RLHF/related techniques to steer AI behavior. Further,
Sure, unlike in most cases in biology, ANN updates do act on the whole model without noise etc.
But there are worries about what happens when AIs get predictably harder to evaluate as they reach superhuman performance on more tasks that are still very real given all of this! You mention scalable oversight research so it’s clear you are aware that this is an open problem, but I don’t think this post emphasises enough how most alignment work recognises a pretty big difference between aligning subhuman systems and superhuman systems, which limits the optimism you can get from GPT-4 seeming basically aligned. I think it’s possible that with tons of compute and aligned weaker AIs (as you touch upon) we can generalize to aligned GPT-5, GPT-6 etc. But this feels like a pretty different paradigm to the various analogies to the natural world and the current state of alignment!
I think this post provides some pretty useful arguments about the downsides of pausing AI development. I feel noticeably more pessimistic about a pause going well having read this.
However, I don’t agree with some of the arguments about alignment optimism and think they’re a fair bit weaker
Sure, we can use RLHF/related techniques to steer AI behavior. Further,
Sure, unlike in most cases in biology, ANN updates do act on the whole model without noise etc.
But there are worries about what happens when AIs get predictably harder to evaluate as they reach superhuman performance on more tasks that are still very real given all of this! You mention scalable oversight research so it’s clear you are aware that this is an open problem, but I don’t think this post emphasises enough how most alignment work recognises a pretty big difference between aligning subhuman systems and superhuman systems, which limits the optimism you can get from GPT-4 seeming basically aligned. I think it’s possible that with tons of compute and aligned weaker AIs (as you touch upon) we can generalize to aligned GPT-5, GPT-6 etc. But this feels like a pretty different paradigm to the various analogies to the natural world and the current state of alignment!