Ben_West🔸 comments on How AI may become deceitful, sycophantic… and lazy

Ben_West🔸 8 Oct 2025 18:53 UTC
6 points
1 ∶ 0
Thanks for writing this up—I think “you don’t need to worry about reward hacking in powerful AI because solving reward hacking will be necessary for developing powerful AI” is an important topic. (Although your frame is more “we will fail to solve reward hacking and therefore fail to develop powerful AI,” IIUC.)
I would find it helpful if you reacted more to the existing literature. E.g. I don’t think anyone disagrees with your high-level point that it’s hard to accurately supervise models, particularly as they get more capable, but also we have empirical evidence that weak models can successfully supervise stronger models and the stronger model won’t just naively copy the mistakes of the weak supervisor to maximize its reward. Is your objection to this that you don’t think that these techniques won’t scale to more powerful AI, or that even if they do scale it won’t be good enough, or something else?