Looks like outer alignment is actually more difficult than I thought. Sherjil Ozair, a former Deepmind employee writes:
”From my experience doing early RLHF work for Gemini, larger models exploit the reward model more. You need to constantly keep collecting more preferences and retraining reward models to make it not exploitable. Otherwise you get nonsensical responses which have exploited the idiosyncracy of your preferences data. There is a reason few labs have done RLHF successfully”
In other words, even though we look at things like ChatGPT and go, “Wow, this is surprisingly aligned, I guess alignment is easier than we thought”, we don’t see all of the hard work that had to go into making it aligned. And perhaps as AI’s become more powerful the amount of work required to align it will exceed what is humanly possible.
Looks like outer alignment is actually more difficult than I thought. Sherjil Ozair, a former Deepmind employee writes:
”From my experience doing early RLHF work for Gemini, larger models exploit the reward model more. You need to constantly keep collecting more preferences and retraining reward models to make it not exploitable. Otherwise you get nonsensical responses which have exploited the idiosyncracy of your preferences data. There is a reason few labs have done RLHF successfully”
In other words, even though we look at things like ChatGPT and go, “Wow, this is surprisingly aligned, I guess alignment is easier than we thought”, we don’t see all of the hard work that had to go into making it aligned. And perhaps as AI’s become more powerful the amount of work required to align it will exceed what is humanly possible.