Thanks for writing this upâI think âyou donât need to worry about reward hacking in powerful AI because solving reward hacking will be necessary for developing powerful AIâ is an important topic. (Although your frame is more âwe will fail to solve reward hacking and therefore fail to develop powerful AI,â IIUC.)
I would find it helpful if you reacted more to the existing literature. E.g. I donât think anyone disagrees with your high-level point that itâs hard to accurately supervise models, particularly as they get more capable, but also we have empirical evidence that weak models can successfully supervise stronger models and the stronger model wonât just naively copy the mistakes of the weak supervisor to maximize its reward. Is your objection to this that you donât think that these techniques wonât scale to more powerful AI, or that even if they do scale it wonât be good enough, or something else?
Thanks for writing this upâI think âyou donât need to worry about reward hacking in powerful AI because solving reward hacking will be necessary for developing powerful AIâ is an important topic. (Although your frame is more âwe will fail to solve reward hacking and therefore fail to develop powerful AI,â IIUC.)
I would find it helpful if you reacted more to the existing literature. E.g. I donât think anyone disagrees with your high-level point that itâs hard to accurately supervise models, particularly as they get more capable, but also we have empirical evidence that weak models can successfully supervise stronger models and the stronger model wonât just naively copy the mistakes of the weak supervisor to maximize its reward. Is your objection to this that you donât think that these techniques wonât scale to more powerful AI, or that even if they do scale it wonât be good enough, or something else?