In light of recent events, we should question how plausible it is that society will fail to adequately address such an integral part of the problem. Perhaps you believe that policy-makers or general society simply won’t worry much about AI deception. Or maybe people will worry about AI deception, but they will quickly feel reassured by results from superficial eval tests. Personally, I’m pretty skeptical of both of these possibilities
Possibility 1 has now been empirically falsified and 2 seems unlikely now. See this from the new UK government AI Safety Institute, which aims to develop evals that address:
We now know that in the absence of any empirical evidence of any instance of deceptive alignment at least one major government is directing resources to developing deception evals anyway. And because they intend to work with the likes of Apollo research who focus on mechinterp based evals and are extremely concerned about specification gaming, reward hacking and other high-alignment difficulty failure modes, I would also consider 2 pretty close to empirically falsified already.
Compare to this (somewhat goofy) future prediction/sci fi story from Eliezer released 4 days before this announcement which imagines that,
Agree, but I also think that insufficient “security mindset” is still a big problem. From OP:
it still remains to be seen whether US and international regulatory policy will adequately address every essential sub-problem of AI risk. It is still plausible that the world will take aggressive actions to address AI safety, but that these actions will have little effect on the probability of human extinction, simply because they will be poorly designed. One possible reason for this type of pessimism is that the alignment problem might just be so difficult to solve that no “normal” amount of regulation could be sufficient to make adequate progress on the core elements of the problem—even if regulators were guided by excellent advisors—and therefore we need to clamp down hard now and pause AI worldwide indefinitely.
Matthew goes on to say:
That said, I don’t see any strong evidence supporting that position.
I’d argue the opposite. I don’t see any strong evidence opposing that position (given that doom is the default outcome of AGI). The fact that a moratorium was off the table at the UK AI Safety Summit was worrying. Matthew Syed, writing in The Times, has it right:
The one idea AI won’t come up with for itself — a moratorium
The Bletchley Park summit was an encouraging sign, but talk of regulators and off switches was delusional
Crazy that accepted levels of [catastrophic] risk for AGI [~10%] are 1000x higher (or more) than for nuclear power. Any sane regulation would immediately ban the construction of ML-based AGI.
Possibility 1 has now been empirically falsified and 2 seems unlikely now. See this from the new UK government AI Safety Institute, which aims to develop evals that address:
We now know that in the absence of any empirical evidence of any instance of deceptive alignment at least one major government is directing resources to developing deception evals anyway. And because they intend to work with the likes of Apollo research who focus on mechinterp based evals and are extremely concerned about specification gaming, reward hacking and other high-alignment difficulty failure modes, I would also consider 2 pretty close to empirically falsified already.
Compare to this (somewhat goofy) future prediction/sci fi story from Eliezer released 4 days before this announcement which imagines that,
Agree, but I also think that insufficient “security mindset” is still a big problem. From OP:
Matthew goes on to say:
I’d argue the opposite. I don’t see any strong evidence opposing that position (given that doom is the default outcome of AGI). The fact that a moratorium was off the table at the UK AI Safety Summit was worrying. Matthew Syed, writing in The Times, has it right:
Or, as I recently put it on X. It’s