The “Sparks” paper; chatGPT plugins, AutoGPTs and other scaffolding to make LLM’s more agent-like. Given these, I think there’s way too much risk for comfort of GPT-5 being able to make GPT-6 (with a little human direction that would be freely given), leading to a foom. Re physical infrastructure, to see how this isn’t a barrier, consider that a superintelligence could easily manipulate humans into doing things as a first (easy) step. And such an architecture, especially given the current progress on AI Alignment, would be defaultunaligned and lethal to the planet.
The reason I’m so confident in this is that right now is because I assess a significant probability of the AI developments starting with GPT-4 to be a hype cycle, and in particular I am probably >50% confident in the idea that most of the flashiest stuff on AI will prove to be over hyped.
In particular, I am skeptical of the general hype on AI right now, and that a lot of capabilities tests essentially test it on paper tests, not real world tasks, which are much less Goodhartable than paper tests
Now I’d agree with you conditioning on the end game being 2 years or less that surprisingly extreme actions would have to be taken, but even then I assess the current techniques for alignment as quite a bit better than you imply.
I also am 40% confident in a model where the capabilities for human level AI in almost every human domain is in the 2030s.
Given this, I think you’re being overly alarmed right now, and that we probably have at least some chance of a breakthrough in AI alignment/safety comparable to what happened to the climate warming problem.
Also, the evals ARC is doing is essentially a best case scenario for the AI, as it has access to the weights, and most importantly no human resistance is modeled, like say blocking the ability of the AI to get GPUs, and it assumes that the AI can set up arbitrarily scalable ways of gaining power. We should expect the true capabilities of an AI attempting to gain control as reliably less than what ARC evals show.
What kind of a breakthrough are you envisaging? How do we get from here to 100% watertight alignment of an arbitrarily capable AGI? Climate change is very different in that the totality of all emissions reductions / clean tech development can all add up to solving the problem. AI Alignment is much more all or nothing. For the analogy to hold it would be like emissions rising on a Moore’s Law (or faster) trajectory, and the threshold for runaway climate change reducing each year (cf algorithm improvements / hardware overhang), to a point where even a single start up company’s emissions (Open AI; X.AI) could cause the end of the world.
Re ARC Evals, on the flip side, they aren’t factoring in humans doing things that make things worse - chatGPT plugins, AutoGPT, BabyAGI, ChaosGPT etc all showing that this is highly likely to happen!
We may never get a Fire Alarm of sufficient intensity to jolt everyone into high gear. But I think GPT-4 is it for me and many others. I think this is a Risk Aware Moment (Ram).
What kind of a breakthrough are you envisaging? How do we get from here to 100% watertight alignment of an arbitrarily capable AGI?
Scalable alignment is the biggest way to align a smarter intelligence.
Now, Pretraining from Human Feedback showed that at least for one of the subproblems of alignment, outer alignment, we managed to make the AI more aligned as it gets more data.
If this generalizes, it’s huge news, as it implies we can at least align an AI’s goals with human goals as we get more data. This matters because it means that scalable alignment isn’t as doomed as we thought.
Re ARC Evals, on the flip side, they aren’t factoring in humans doing things that make things worse - chatGPT plugins, AutoGPT, BabyAGI, ChaosGPT etc all showing that this is highly likely to happen!
The point is that ARC Evals will be an upper bound mostly, not a lower bound, given that it generally makes very optimistic assumptions for the AI under testing. Maybe they’re right, but the key here is that they are closer to a maximum for an AI’s capabilities than a minimum for analysis, which means that the most likely bet is on reduced impact, but there’s a possibility that the ARC Evals are close to what happens in real life.
It’s possible for an early takeoff of AI to happen in 2 years, I just don’t consider that possibility very likely right now.
Generalizing is one thing, but how can scalable alignment ever be watertight? Have you seen all the GPT-4 jailbreaks!? How can every single one be patched using this paradigm? There needs to be an ever decreasing number of possible failure modes, as power level increases, to the limit of 0 failure modes for a superintelligent AI. I don’t see how scalable alignment can possibly work that well.
The “Sparks” paper; chatGPT plugins, AutoGPTs and other scaffolding to make LLM’s more agent-like. Given these, I think there’s way too much risk for comfort of GPT-5 being able to make GPT-6 (with a little human direction that would be freely given), leading to a foom. Re physical infrastructure, to see how this isn’t a barrier, consider that a superintelligence could easily manipulate humans into doing things as a first (easy) step. And such an architecture, especially given the current progress on AI Alignment, would be default unaligned and lethal to the planet.
The reason I’m so confident in this is that right now is because I assess a significant probability of the AI developments starting with GPT-4 to be a hype cycle, and in particular I am probably >50% confident in the idea that most of the flashiest stuff on AI will prove to be over hyped.
In particular, I am skeptical of the general hype on AI right now, and that a lot of capabilities tests essentially test it on paper tests, not real world tasks, which are much less Goodhartable than paper tests
Now I’d agree with you conditioning on the end game being 2 years or less that surprisingly extreme actions would have to be taken, but even then I assess the current techniques for alignment as quite a bit better than you imply.
I also am 40% confident in a model where the capabilities for human level AI in almost every human domain is in the 2030s.
Given this, I think you’re being overly alarmed right now, and that we probably have at least some chance of a breakthrough in AI alignment/safety comparable to what happened to the climate warming problem.
Also, the evals ARC is doing is essentially a best case scenario for the AI, as it has access to the weights, and most importantly no human resistance is modeled, like say blocking the ability of the AI to get GPUs, and it assumes that the AI can set up arbitrarily scalable ways of gaining power. We should expect the true capabilities of an AI attempting to gain control as reliably less than what ARC evals show.
What kind of a breakthrough are you envisaging? How do we get from here to 100% watertight alignment of an arbitrarily capable AGI? Climate change is very different in that the totality of all emissions reductions / clean tech development can all add up to solving the problem. AI Alignment is much more all or nothing. For the analogy to hold it would be like emissions rising on a Moore’s Law (or faster) trajectory, and the threshold for runaway climate change reducing each year (cf algorithm improvements / hardware overhang), to a point where even a single start up company’s emissions (Open AI; X.AI) could cause the end of the world.
Re ARC Evals, on the flip side, they aren’t factoring in humans doing things that make things worse - chatGPT plugins, AutoGPT, BabyAGI, ChaosGPT etc all showing that this is highly likely to happen!
We may never get a Fire Alarm of sufficient intensity to jolt everyone into high gear. But I think GPT-4 is it for me and many others. I think this is a Risk Aware Moment (Ram).
Scalable alignment is the biggest way to align a smarter intelligence.
Now, Pretraining from Human Feedback showed that at least for one of the subproblems of alignment, outer alignment, we managed to make the AI more aligned as it gets more data.
If this generalizes, it’s huge news, as it implies we can at least align an AI’s goals with human goals as we get more data. This matters because it means that scalable alignment isn’t as doomed as we thought.
The point is that ARC Evals will be an upper bound mostly, not a lower bound, given that it generally makes very optimistic assumptions for the AI under testing. Maybe they’re right, but the key here is that they are closer to a maximum for an AI’s capabilities than a minimum for analysis, which means that the most likely bet is on reduced impact, but there’s a possibility that the ARC Evals are close to what happens in real life.
It’s possible for an early takeoff of AI to happen in 2 years, I just don’t consider that possibility very likely right now.
Generalizing is one thing, but how can scalable alignment ever be watertight? Have you seen all the GPT-4 jailbreaks!? How can every single one be patched using this paradigm? There needs to be an ever decreasing number of possible failure modes, as power level increases, to the limit of 0 failure modes for a superintelligent AI. I don’t see how scalable alignment can possibly work that well.
Open AI says in their GPT-4 release announcement that “GPT-4 responds to sensitive requests (e.g., medical advice and self-harm) in accordance with our policies 29% more often.” A 29% reduction of harm. This is the opposite of reassuring when thinking about x-risk.
(And all this is not even addressing inner alignment!)