What kind of a breakthrough are you envisaging? How do we get from here to 100% watertight alignment of an arbitrarily capable AGI?
Scalable alignment is the biggest way to align a smarter intelligence.
Now, Pretraining from Human Feedback showed that at least for one of the subproblems of alignment, outer alignment, we managed to make the AI more aligned as it gets more data.
If this generalizes, it’s huge news, as it implies we can at least align an AI’s goals with human goals as we get more data. This matters because it means that scalable alignment isn’t as doomed as we thought.
Re ARC Evals, on the flip side, they aren’t factoring in humans doing things that make things worse - chatGPT plugins, AutoGPT, BabyAGI, ChaosGPT etc all showing that this is highly likely to happen!
The point is that ARC Evals will be an upper bound mostly, not a lower bound, given that it generally makes very optimistic assumptions for the AI under testing. Maybe they’re right, but the key here is that they are closer to a maximum for an AI’s capabilities than a minimum for analysis, which means that the most likely bet is on reduced impact, but there’s a possibility that the ARC Evals are close to what happens in real life.
It’s possible for an early takeoff of AI to happen in 2 years, I just don’t consider that possibility very likely right now.
Generalizing is one thing, but how can scalable alignment ever be watertight? Have you seen all the GPT-4 jailbreaks!? How can every single one be patched using this paradigm? There needs to be an ever decreasing number of possible failure modes, as power level increases, to the limit of 0 failure modes for a superintelligent AI. I don’t see how scalable alignment can possibly work that well.
Scalable alignment is the biggest way to align a smarter intelligence.
Now, Pretraining from Human Feedback showed that at least for one of the subproblems of alignment, outer alignment, we managed to make the AI more aligned as it gets more data.
If this generalizes, it’s huge news, as it implies we can at least align an AI’s goals with human goals as we get more data. This matters because it means that scalable alignment isn’t as doomed as we thought.
The point is that ARC Evals will be an upper bound mostly, not a lower bound, given that it generally makes very optimistic assumptions for the AI under testing. Maybe they’re right, but the key here is that they are closer to a maximum for an AI’s capabilities than a minimum for analysis, which means that the most likely bet is on reduced impact, but there’s a possibility that the ARC Evals are close to what happens in real life.
It’s possible for an early takeoff of AI to happen in 2 years, I just don’t consider that possibility very likely right now.
Generalizing is one thing, but how can scalable alignment ever be watertight? Have you seen all the GPT-4 jailbreaks!? How can every single one be patched using this paradigm? There needs to be an ever decreasing number of possible failure modes, as power level increases, to the limit of 0 failure modes for a superintelligent AI. I don’t see how scalable alignment can possibly work that well.
Open AI says in their GPT-4 release announcement that “GPT-4 responds to sensitive requests (e.g., medical advice and self-harm) in accordance with our policies 29% more often.” A 29% reduction of harm. This is the opposite of reassuring when thinking about x-risk.
(And all this is not even addressing inner alignment!)