Greg_Colbourn comments on A freshman year during the AI midgame: my approach to the next year

Greg_Colbourn 18 Apr 2023 7:41 UTC
5 points
2 ∶ 0
Generalizing is one thing, but how can scalable alignment ever be watertight? Have you seen all the GPT-4 jailbreaks!? How can every single one be patched using this paradigm? There needs to be an ever decreasing number of possible failure modes, as power level increases, to the limit of 0 failure modes for a superintelligent AI. I don’t see how scalable alignment can possibly work that well.

Open AI says in their GPT-4 release announcement that “GPT-4 responds to sensitive requests (e.g., medical advice and self-harm) in accordance with our policies 29% more often.” A 29% reduction of harm. This is the opposite of reassuring when thinking about x-risk.

(And all this is not even addressing inner alignment!)