Thanks so much for pointing this out, James! I must have missed it, and it is indeed an important crux.
One threat model that implies a higher probability of failure rather than an unchanged probability of failure is that goal preservation of an agentic AI against a given SGD-based plan may be strictly easier with prior knowledge of what that plan is. If true, then the fixation probability of a misaligned AGI that can successfully preserve its misaligned goal could increase.
A more general point is that this situation-specific analysis (of which AI safety plans could lose their value by being posted on the Internet, and which don’t lose value or lose less) is difficult to do a priori . Reforming AI safety research norms to be more broadly pro-security-mindset might capture most of the benefits, even if it’s a blunt instrument.
Pulling this sentence out for emphasis because it seems like the crux to me.
Thanks so much for pointing this out, James! I must have missed it, and it is indeed an important crux.
One threat model that implies a higher probability of failure rather than an unchanged probability of failure is that goal preservation of an agentic AI against a given SGD-based plan may be strictly easier with prior knowledge of what that plan is. If true, then the fixation probability of a misaligned AGI that can successfully preserve its misaligned goal could increase.
A more general point is that this situation-specific analysis (of which AI safety plans could lose their value by being posted on the Internet, and which don’t lose value or lose less) is difficult to do a priori . Reforming AI safety research norms to be more broadly pro-security-mindset might capture most of the benefits, even if it’s a blunt instrument.