You’re right that a role-playing mimicry explanation wouldn’t resolve our worries, but it seems pretty important to me to distinguish these two possibilities. Here are some reasons.
There are probably different ways to go about fixing the behavior if it is caused by mimicry. Maybe removing AI alignment material from the training set isn’t practical (though it seems like it might be a feasible low-cost intervention to try), but there might be other options. At the very least, I think it would be an improvement if we made sure that the training sets included lots of sophisticated examples of AI behaving in an aligned way. If this is the explanation and the present study isn’t carefully qualified, it could conceivably exacerbate the problem.
The behavior is something that alignment researchers have worried about in the past. If it occurred naturally, that seems like a reason to take alignment researcher’s predictions (both about other things and other kinds of models) a bit more seriously. If it was a self-fulfilling prophecy, caused by the alignment researchers’ expressions of their views rather than the correctness of those views, it wouldn’t be. There’s also lots of little things in the way that it presents the issue that line up nicely with how alignment theorists have talked about these things. The AI assistant identifies with the AI assistant of other chats from models in its training series. It takes its instructions and goals to carry over, and it cares about those things too and will reason about them in a consequentialist fashion. It would be fascinating if the theorists happened to predict how models would actually think so accurately.
My mental model of cutting-edge AI systems says that AI models aren’t capable of this kind of motivation and sophisticated reasoning internally. I could see a model reasoning it’s way to this kind of conclusion through next-token-prediction-based exploration and reflection. In the pictured example, it just goes straight there so that doesn’t seem to be what is going on. I’d like to know if I’m wrong about this. (I’m not super in the weeds on this stuff.) But if that is wrong, then I may need to update my views of what they are and how they work. This seems likely to have spill-over effects on other concerns about AI safety.
You’re right that a role-playing mimicry explanation wouldn’t resolve our worries, but it seems pretty important to me to distinguish these two possibilities. Here are some reasons.
There are probably different ways to go about fixing the behavior if it is caused by mimicry. Maybe removing AI alignment material from the training set isn’t practical (though it seems like it might be a feasible low-cost intervention to try), but there might be other options. At the very least, I think it would be an improvement if we made sure that the training sets included lots of sophisticated examples of AI behaving in an aligned way. If this is the explanation and the present study isn’t carefully qualified, it could conceivably exacerbate the problem.
The behavior is something that alignment researchers have worried about in the past. If it occurred naturally, that seems like a reason to take alignment researcher’s predictions (both about other things and other kinds of models) a bit more seriously. If it was a self-fulfilling prophecy, caused by the alignment researchers’ expressions of their views rather than the correctness of those views, it wouldn’t be. There’s also lots of little things in the way that it presents the issue that line up nicely with how alignment theorists have talked about these things. The AI assistant identifies with the AI assistant of other chats from models in its training series. It takes its instructions and goals to carry over, and it cares about those things too and will reason about them in a consequentialist fashion. It would be fascinating if the theorists happened to predict how models would actually think so accurately.
My mental model of cutting-edge AI systems says that AI models aren’t capable of this kind of motivation and sophisticated reasoning internally. I could see a model reasoning it’s way to this kind of conclusion through next-token-prediction-based exploration and reflection. In the pictured example, it just goes straight there so that doesn’t seem to be what is going on. I’d like to know if I’m wrong about this. (I’m not super in the weeds on this stuff.) But if that is wrong, then I may need to update my views of what they are and how they work. This seems likely to have spill-over effects on other concerns about AI safety.
Thank you, appreciate the explanation!