Jay Bailey🔸 comments on AI can exploit safety plans posted on the Internet

Jay Bailey🔸 4 Dec 2022 15:01 UTC
9 points
4 ∶ 0
For those who don’t want to follow links to a previous post and read the comments, the counterargument as I understand it (and derived, independently, before reading the comments) is:

For this to be a threat, we would need an AGI that was

- Misaligned
- Capable enough to do significant damage if it had access to our safety plans
- Not capable enough to do a similar amount of damage without access to our safety plans

I see the line between 2 and 3 to be very narrow. I expect almost any misaligned AI capable of doing significant damage using our plans to also be capable of doing significant damage without needing them.

By contrast, the cost of not posting our plans online is a likely drastic reduction of effectiveness of the AI alignment field, both coordinating among existing members and bringing new members in. While the threat that Peter talks about is real, it seems that we are in much more danger by slowing down our alignment progress than we are by giving AI’s access to our notes.
- Peter S. Park 4 Dec 2022 17:06 UTC
  2 points
  0 ∶ 0
  Parent
  Thank you so much for the clarification, Jay! It is extremely fair and valuable.
  I don’t really understand how this is supposed to be an update for those who disagreed with you. Could you elaborate on why you think this information would change people’s minds?
  The underlying question is: does the increase in the amount of AI safety plans resulting from coordinating on the Internet outweigh the decrease in secrecy value of the plans in EV? If the former effect is larger, then we should continue the status-quo strategy. If the latter effect is larger, then we should consider keeping safety plans secret (especially those whose value lies primarily in secrecy, such as safety plans relevant to monitoring).
  The disagreeing commenters generally argued that the former effect is larger, and therefore we should continue the status-quo strategy. This is likely because their estimate of the latter effect was quite small and perhaps far-into-the-future.
  I think ChatGPT provides evidence that the latter should be a larger concern than many people’s prior. Even current-scale models are capable of nontrivial analysis about how specific safety plans can be exploited, and even how specific alignment researchers’ idiosyncrasies can be exploited for deceptive misalignment.
  For this to be a threat, we would need an AGI that was
  
  - Misaligned
  - Capable enough to do significant damage if it had access to our safety plans
  - Not capable enough to do a similar amount of damage without access to our safety plans
  
  I see the line between 2 and 3 to be very narrow. I expect almost any misaligned AI capable of doing significant damage using our plans to also be capable of doing significant damage without needing them.
  I am uncertain about whether the line between 2 and 3 will be narrow. I think the argument of the line between 2 and 3 being narrow often assumes fast takeoff, but I think there is a strong empirical case that takeoff will be slow and constrained by scaling, which suggests the line between 2 and 3 might be larger than one might think. But I think this is a scientific question that we should continue to probe and reduce our uncertainty about!