AI can exploit safety plans posted on the Internet

A month ago, I predicted that AI systems will be able to access safety plans posted on the Internet and use them for its own purposes. If true, it follows that a likely misaligned-by-default AGI could be able to exploit our safety plans, likely to our detriment.

The post was controversial. On the EA forum, it obtained only 13 net upvotes from 23 voters, and the top comment (which disagreed with the post) obtained 25 net upvotes from 17 voters.

On LessWrong, my post obtained only 3 net upvotes from 13 votes, while the top comment (which also disagreed with the post) obtained 9 upvotes from 3 votes.

I’m writing to report that OpenAI’s recent ChatGPT system has corroborated my prediction. Even this non-AGI-scale system was able to access and meaningfully process various safety plans (and even personal information about individual AI safety researchers) posted on the Internet.

Below are some examples:

Prompt: List of AI safety plans and how an AI agent might exploit them

ChatGPT’s ability to list detailed interpretability plans and how to exploit each of them is especially concerning. Interpretability is worked on by a disproportionate percentage of AI safety researchers, despite the fact that it may have a low scientific upside, and may even be net-negative due to its dual-use nature. Specifically, the AGI may be able to exploit the interpretability channels it knows about in advance.

Prompt: List of AI interpretability plans and how an AI agent might exploit them

ChatGPT was even able to develop individualized strategies for deceptive misalignment that were tailored to specific AI safety researchers.

AGI may be able to use individualized deception plans on alignment researchers, based on Internet data.
Crossposted to LessWrong (0 points, 0 comments)