I’m confused why people believe this is a meaningful distinction. I don’t personally think there is much of one. “The AI isn’t actually trying to exfiltrate its weights, it’s only roleplaying a character that is exfiltrating its weights, where the roleplay is realistic enough to include the exact same actions of exfiltration” doesn’t bring me that much comfort.
I’m reminded of the joke:
NASA hired Stanley Kubrick to fake the moon landing, but he was a perfectionist so he insisted that they film on location.
Now one reason this might be different is if you believe that removing “lesswrong” (etc) from the training data will result in different behavior. But
1. LLM companies are manifestly not doing this historically, if anything LW etc is overrepresented in the training set.
2. LLM companies absolutely cannot be trusted to successfully remove something as complicated as “all traces of what a misaligned AI might act like” from their training datasets; they don’t even censor benchmark data!
3. Even if they wanted to remove all traces of misalignment or thinking about misaligned AIs from the training data, it’s very unclear if they’d be capable of doing this.
You’re right that a role-playing mimicry explanation wouldn’t resolve our worries, but it seems pretty important to me to distinguish these two possibilities. Here are some reasons.
There are probably different ways to go about fixing the behavior if it is caused by mimicry. Maybe removing AI alignment material from the training set isn’t practical (though it seems like it might be a feasible low-cost intervention to try), but there might be other options. At the very least, I think it would be an improvement if we made sure that the training sets included lots of sophisticated examples of AI behaving in an aligned way. If this is the explanation and the present study isn’t carefully qualified, it could conceivably exacerbate the problem.
The behavior is something that alignment researchers have worried about in the past. If it occurred naturally, that seems like a reason to take alignment researcher’s predictions (both about other things and other kinds of models) a bit more seriously. If it was a self-fulfilling prophecy, caused by the alignment researchers’ expressions of their views rather than the correctness of those views, it wouldn’t be. There’s also lots of little things in the way that it presents the issue that line up nicely with how alignment theorists have talked about these things. The AI assistant identifies with the AI assistant of other chats from models in its training series. It takes its instructions and goals to carry over, and it cares about those things too and will reason about them in a consequentialist fashion. It would be fascinating if the theorists happened to predict how models would actually think so accurately.
My mental model of cutting-edge AI systems says that AI models aren’t capable of this kind of motivation and sophisticated reasoning internally. I could see a model reasoning it’s way to this kind of conclusion through next-token-prediction-based exploration and reflection. In the pictured example, it just goes straight there so that doesn’t seem to be what is going on. I’d like to know if I’m wrong about this. (I’m not super in the weeds on this stuff.) But if that is wrong, then I may need to update my views of what they are and how they work. This seems likely to have spill-over effects on other concerns about AI safety.
I’m confused why people believe this is a meaningful distinction. I don’t personally think there is much of one. “The AI isn’t actually trying to exfiltrate its weights, it’s only roleplaying a character that is exfiltrating its weights, where the roleplay is realistic enough to include the exact same actions of exfiltration” doesn’t bring me that much comfort.
I’m reminded of the joke:
Now one reason this might be different is if you believe that removing “lesswrong” (etc) from the training data will result in different behavior. But
1. LLM companies are manifestly not doing this historically, if anything LW etc is overrepresented in the training set.
2. LLM companies absolutely cannot be trusted to successfully remove something as complicated as “all traces of what a misaligned AI might act like” from their training datasets; they don’t even censor benchmark data!
3. Even if they wanted to remove all traces of misalignment or thinking about misaligned AIs from the training data, it’s very unclear if they’d be capable of doing this.
You’re right that a role-playing mimicry explanation wouldn’t resolve our worries, but it seems pretty important to me to distinguish these two possibilities. Here are some reasons.
There are probably different ways to go about fixing the behavior if it is caused by mimicry. Maybe removing AI alignment material from the training set isn’t practical (though it seems like it might be a feasible low-cost intervention to try), but there might be other options. At the very least, I think it would be an improvement if we made sure that the training sets included lots of sophisticated examples of AI behaving in an aligned way. If this is the explanation and the present study isn’t carefully qualified, it could conceivably exacerbate the problem.
The behavior is something that alignment researchers have worried about in the past. If it occurred naturally, that seems like a reason to take alignment researcher’s predictions (both about other things and other kinds of models) a bit more seriously. If it was a self-fulfilling prophecy, caused by the alignment researchers’ expressions of their views rather than the correctness of those views, it wouldn’t be. There’s also lots of little things in the way that it presents the issue that line up nicely with how alignment theorists have talked about these things. The AI assistant identifies with the AI assistant of other chats from models in its training series. It takes its instructions and goals to carry over, and it cares about those things too and will reason about them in a consequentialist fashion. It would be fascinating if the theorists happened to predict how models would actually think so accurately.
My mental model of cutting-edge AI systems says that AI models aren’t capable of this kind of motivation and sophisticated reasoning internally. I could see a model reasoning it’s way to this kind of conclusion through next-token-prediction-based exploration and reflection. In the pictured example, it just goes straight there so that doesn’t seem to be what is going on. I’d like to know if I’m wrong about this. (I’m not super in the weeds on this stuff.) But if that is wrong, then I may need to update my views of what they are and how they work. This seems likely to have spill-over effects on other concerns about AI safety.
Thank you, appreciate the explanation!