Hi Mayowa, I agree that open-source safety is a big (and I think too overlooked) AI Safety problem. We can be sure that Hacker groups are currently/​will soon use local LLMs for cyber-attacks.
Your idea is neat, but I’m worried that in practice it would be easy for actors (esp ones with decent technical capabilities) to circumvent these defences. You already mention the potential to just alter the data stored in plaintext, but I think the same would be possible with other methods of tracking the safety state. Eg with steganography, the attacker could occasionally use a different model to reword/​rewrite outputs and thus remove the hidden information about the safety state.
I published this post on LessWrong as well, and someone made this exact same point as you. However their tone of voice was unproductive and condescending—it was clear they weren’t trying to converse. It’s good to know there’s an alternative platform where people actually want to have constructive discussions.
I’m aware of this possibility. I was aware of it even before writing the post—it was one item on the list of potential issues I noted. I have ideas on how to navigate it - possibly it’ll be the subject of a subsequent post.
Great, I would be keen to read yoir next post! Esp because I think that the ability of attackers to remove many kinds of safeguards is a fundamental challenge in open source safety.
Hi Mayowa, I agree that open-source safety is a big (and I think too overlooked) AI Safety problem. We can be sure that Hacker groups are currently/​will soon use local LLMs for cyber-attacks.
Your idea is neat, but I’m worried that in practice it would be easy for actors (esp ones with decent technical capabilities) to circumvent these defences. You already mention the potential to just alter the data stored in plaintext, but I think the same would be possible with other methods of tracking the safety state. Eg with steganography, the attacker could occasionally use a different model to reword/​rewrite outputs and thus remove the hidden information about the safety state.
Hey Jan, thanks for your comment.
I published this post on LessWrong as well, and someone made this exact same point as you. However their tone of voice was unproductive and condescending—it was clear they weren’t trying to converse. It’s good to know there’s an alternative platform where people actually want to have constructive discussions.
I’m aware of this possibility. I was aware of it even before writing the post—it was one item on the list of potential issues I noted. I have ideas on how to navigate it - possibly it’ll be the subject of a subsequent post.
Great, I would be keen to read yoir next post! Esp because I think that the ability of attackers to remove many kinds of safeguards is a fundamental challenge in open source safety.