What kind of a breakthrough are you envisaging? How do we get from here to 100% watertight alignment of an arbitrarily capable AGI? Climate change is very different in that the totality of all emissions reductions / clean tech development can all add up to solving the problem. AI Alignment is much more all or nothing. For the analogy to hold it would be like emissions rising on a Moore’s Law (or faster) trajectory, and the threshold for runaway climate change reducing each year (cf algorithm improvements / hardware overhang), to a point where even a single start up company’s emissions (Open AI; X.AI) could cause the end of the world.
Re ARC Evals, on the flip side, they aren’t factoring in humans doing things that make things worse - chatGPT plugins, AutoGPT, BabyAGI, ChaosGPT etc all showing that this is highly likely to happen!
We may never get a Fire Alarm of sufficient intensity to jolt everyone into high gear. But I think GPT-4 is it for me and many others. I think this is a Risk Aware Moment (Ram).
What kind of a breakthrough are you envisaging? How do we get from here to 100% watertight alignment of an arbitrarily capable AGI?
Scalable alignment is the biggest way to align a smarter intelligence.
Now, Pretraining from Human Feedback showed that at least for one of the subproblems of alignment, outer alignment, we managed to make the AI more aligned as it gets more data.
If this generalizes, it’s huge news, as it implies we can at least align an AI’s goals with human goals as we get more data. This matters because it means that scalable alignment isn’t as doomed as we thought.
Re ARC Evals, on the flip side, they aren’t factoring in humans doing things that make things worse - chatGPT plugins, AutoGPT, BabyAGI, ChaosGPT etc all showing that this is highly likely to happen!
The point is that ARC Evals will be an upper bound mostly, not a lower bound, given that it generally makes very optimistic assumptions for the AI under testing. Maybe they’re right, but the key here is that they are closer to a maximum for an AI’s capabilities than a minimum for analysis, which means that the most likely bet is on reduced impact, but there’s a possibility that the ARC Evals are close to what happens in real life.
It’s possible for an early takeoff of AI to happen in 2 years, I just don’t consider that possibility very likely right now.
Generalizing is one thing, but how can scalable alignment ever be watertight? Have you seen all the GPT-4 jailbreaks!? How can every single one be patched using this paradigm? There needs to be an ever decreasing number of possible failure modes, as power level increases, to the limit of 0 failure modes for a superintelligent AI. I don’t see how scalable alignment can possibly work that well.
What kind of a breakthrough are you envisaging? How do we get from here to 100% watertight alignment of an arbitrarily capable AGI? Climate change is very different in that the totality of all emissions reductions / clean tech development can all add up to solving the problem. AI Alignment is much more all or nothing. For the analogy to hold it would be like emissions rising on a Moore’s Law (or faster) trajectory, and the threshold for runaway climate change reducing each year (cf algorithm improvements / hardware overhang), to a point where even a single start up company’s emissions (Open AI; X.AI) could cause the end of the world.
Re ARC Evals, on the flip side, they aren’t factoring in humans doing things that make things worse - chatGPT plugins, AutoGPT, BabyAGI, ChaosGPT etc all showing that this is highly likely to happen!
We may never get a Fire Alarm of sufficient intensity to jolt everyone into high gear. But I think GPT-4 is it for me and many others. I think this is a Risk Aware Moment (Ram).
Scalable alignment is the biggest way to align a smarter intelligence.
Now, Pretraining from Human Feedback showed that at least for one of the subproblems of alignment, outer alignment, we managed to make the AI more aligned as it gets more data.
If this generalizes, it’s huge news, as it implies we can at least align an AI’s goals with human goals as we get more data. This matters because it means that scalable alignment isn’t as doomed as we thought.
The point is that ARC Evals will be an upper bound mostly, not a lower bound, given that it generally makes very optimistic assumptions for the AI under testing. Maybe they’re right, but the key here is that they are closer to a maximum for an AI’s capabilities than a minimum for analysis, which means that the most likely bet is on reduced impact, but there’s a possibility that the ARC Evals are close to what happens in real life.
It’s possible for an early takeoff of AI to happen in 2 years, I just don’t consider that possibility very likely right now.
Generalizing is one thing, but how can scalable alignment ever be watertight? Have you seen all the GPT-4 jailbreaks!? How can every single one be patched using this paradigm? There needs to be an ever decreasing number of possible failure modes, as power level increases, to the limit of 0 failure modes for a superintelligent AI. I don’t see how scalable alignment can possibly work that well.
Open AI says in their GPT-4 release announcement that “GPT-4 responds to sensitive requests (e.g., medical advice and self-harm) in accordance with our policies 29% more often.” A 29% reduction of harm. This is the opposite of reassuring when thinking about x-risk.
(And all this is not even addressing inner alignment!)