AI NotKillEveryoneism is the first order approximation of x-risk work.
I think we probably will manage to make enough AI alignment progress to avoid extinction. AI capabilities advancement seems to be on a relatively good path (less foomy) and AI Safety work is starting to make real progress for avoiding the worst outcomes (although a new RL paradigm, illegible/unfaithful CoT could make this more scary).
Yet gradual disempowerment risks seem extremely hard to mitigate, very important and pretty neglected. The AI Alignment/Safety bar for good outcomes could be significantly higher than avoiding extinction.
Most fundamentally human welfare currently seems highly contingent on our productivity and decoupling that could be very hard.
AI Safety work is starting to make real progress for avoiding the worst outcomes
What makes you think this? Every technique there is is statistical in nature (due to the nature of the deep learning paradigm), and none are even approaching 3 9s of safety and we need something like 13 9s if we are going to survive more than a few years of ASI.
AI capabilities advancement seems to be on a relatively good path (less foomy)
I also don’t see how it’s less foomy. SWE bench and ML researcher automation are still improving—what happens when the models are drop in replacements for top researchers?
Yet gradual disempowerment risks seem extremely hard to mitigate
What is the eventual end result after total disempowerment? Extinction, right?
What makes you think this? Every technique there is is statistical in nature (due to the nature of the deep learning paradigm), and none are even approaching 3 9s of safety and we need something like 13 9s if we are going to survive more than a few years of ASI.
I also don’t see how it’s less foomy. SWE bench and ML researcher automation are still improving—what happens when the models are drop in replacements for top researchers?
The gap between weak AGI and strong AGI/ASI timeline predictions seems to have ticked up a bit. It doesn’t seem like the intra-token reasoning/capabilities is scaling as hard as I’d previously feared. The models themselves are not getting so scarily capable and agentic in each forward pass, instead we are increasingly eliciting those capabilities/agency in context with the models remaining myopic and largely indifferent.
If the new paradigm holds with a significant focus on scaling inference it seems to both be less aggressive (in terms of scaling intelligence) and more conducive to ‘passing’ safety.
The current paradigm likely places a much lower burden on hard interpretability than I expected ~1 year ago, it feels much more like a verification problem than a full solve. With current rates of interpretability progress (and AI accelerating safety ~inline with capabilities) we could actually be able to verify that a CoT is faithful and legible and that might be ~sufficient.
Agreed, I still think there’s a reasonable chance that ML research does fall within the set of capabilities that quickly reach superhuman levels and foom is still on the cards, also more RL in general is just inherently quite scary.
The 9s of safety makes sense from a control perspective but I think there’s another angle, which is the possibility of a model that is aligned-enough to actually not want to pursue human extinction.
What is the eventual end result after total disempowerment? Extinction, right?
Potentially, but I think there’s still room for scenarios where humans are broadly disempowered yet not extinct—worlds where we get a passing grade on safety. Where we effectively avoid strongly-agentic systems and achieve sufficient alignment such that human lives are valued, but fall short of the full fidelity necessary for a flourishing future.
Still this point has updated me slightly, I’ve reduced my disagreement.
My model looks something like this:
There are a bunch of increasingly hard questions on the Alignment Test. We need to get enough of the core questions right to avoid the ASI → everyone quickly dies scenario. This is the ‘passing grade’. There are some bonus/extra credit questions that we need to also get right to get an A (a flourishing future).
We don’t know exactly which questions will be included or in which section. We also don’t know the thresholds for these grades and we are (rightly) focusing the vast majority of our efforts on the expected fundamental questions to maximise our chance of the passing grade.
Relatively to ~1 year ago the ‘passing grade’ for alignment feels a bit easier and we’ve got a bit more study time. I’ve also become aware of just how much more difficult the A grade might be and that a pass might not be very valuable at all—I don’t think anything has changed there, I was just somewhat ignorant of risks from gradual disempowerment.
It might make sense to dedicate say 5-20% of our effort to study for questions we expect in the bonus/extra credit section. I think we currently do less than that (perhaps 1-5%). So I think the vast majority of the effort should be spent on avoiding extinction, but I’m less sure about effort at the margin.
There are a bunch of increasingly hard questions on the Alignment Test. We need to get enough of the core questions right to avoid the ASI → everyone quickly dies scenario. This is the ‘passing grade’. There are some bonus/extra credit questions that we need to also get right to get an A (a flourishing future).
I think the bonus/extra credit questions are part of the main test—if you don’t get them right everyone still dies, but maybe a bit more slowly.
All the doom flows through the cracks of imperfect alignment/control. And we can asymptote toward, but never reach, existential safety[1].
Of course this applies to all other x-risks too. It’s just that ASI x-risk is very near term and acute (in absolute terms, and relative to all the others), and we aren’t even starting in earnest with the asymptoting yet (and likely won’t if we don’t get a Pause).
AI NotKillEveryoneism is the first order approximation of x-risk work.
I think we probably will manage to make enough AI alignment progress to avoid extinction. AI capabilities advancement seems to be on a relatively good path (less foomy) and AI Safety work is starting to make real progress for avoiding the worst outcomes (although a new RL paradigm, illegible/unfaithful CoT could make this more scary).
Yet gradual disempowerment risks seem extremely hard to mitigate, very important and pretty neglected. The AI Alignment/Safety bar for good outcomes could be significantly higher than avoiding extinction.
Most fundamentally human welfare currently seems highly contingent on our productivity and decoupling that could be very hard.
What makes you think this? Every technique there is is statistical in nature (due to the nature of the deep learning paradigm), and none are even approaching 3 9s of safety and we need something like 13 9s if we are going to survive more than a few years of ASI.
I also don’t see how it’s less foomy. SWE bench and ML researcher automation are still improving—what happens when the models are drop in replacements for top researchers?
What is the eventual end result after total disempowerment? Extinction, right?
The gap between weak AGI and strong AGI/ASI timeline predictions seems to have ticked up a bit. It doesn’t seem like the intra-token reasoning/capabilities is scaling as hard as I’d previously feared. The models themselves are not getting so scarily capable and agentic in each forward pass, instead we are increasingly eliciting those capabilities/agency in context with the models remaining myopic and largely indifferent.
If the new paradigm holds with a significant focus on scaling inference it seems to both be less aggressive (in terms of scaling intelligence) and more conducive to ‘passing’ safety.
The current paradigm likely places a much lower burden on hard interpretability than I expected ~1 year ago, it feels much more like a verification problem than a full solve. With current rates of interpretability progress (and AI accelerating safety ~inline with capabilities) we could actually be able to verify that a CoT is faithful and legible and that might be ~sufficient.
Agreed, I still think there’s a reasonable chance that ML research does fall within the set of capabilities that quickly reach superhuman levels and foom is still on the cards, also more RL in general is just inherently quite scary.
The 9s of safety makes sense from a control perspective but I think there’s another angle, which is the possibility of a model that is aligned-enough to actually not want to pursue human extinction.
Potentially, but I think there’s still room for scenarios where humans are broadly disempowered yet not extinct—worlds where we get a passing grade on safety. Where we effectively avoid strongly-agentic systems and achieve sufficient alignment such that human lives are valued, but fall short of the full fidelity necessary for a flourishing future.
Still this point has updated me slightly, I’ve reduced my disagreement.
My model looks something like this:
There are a bunch of increasingly hard questions on the Alignment Test. We need to get enough of the core questions right to avoid the ASI → everyone quickly dies scenario. This is the ‘passing grade’. There are some bonus/extra credit questions that we need to also get right to get an A (a flourishing future).
We don’t know exactly which questions will be included or in which section. We also don’t know the thresholds for these grades and we are (rightly) focusing the vast majority of our efforts on the expected fundamental questions to maximise our chance of the passing grade.
Relatively to ~1 year ago the ‘passing grade’ for alignment feels a bit easier and we’ve got a bit more study time. I’ve also become aware of just how much more difficult the A grade might be and that a pass might not be very valuable at all—I don’t think anything has changed there, I was just somewhat ignorant of risks from gradual disempowerment.
It might make sense to dedicate say 5-20% of our effort to study for questions we expect in the bonus/extra credit section. I think we currently do less than that (perhaps 1-5%). So I think the vast majority of the effort should be spent on avoiding extinction, but I’m less sure about effort at the margin.
I think the bonus/extra credit questions are part of the main test—if you don’t get them right everyone still dies, but maybe a bit more slowly.
All the doom flows through the cracks of imperfect alignment/control. And we can asymptote toward, but never reach, existential safety[1].
Of course this applies to all other x-risks too. It’s just that ASI x-risk is very near term and acute (in absolute terms, and relative to all the others), and we aren’t even starting in earnest with the asymptoting yet (and likely won’t if we don’t get a Pause).
Digital sentience could also dominate this equation.