You talk about one of your prompts as “pretty neutral”, but I think that the whisper prompt is highly suggestive of a genre where there are things to hide, higher authorities can’t be trusted with the information, etc. Given that framing, I sort of expect “play-acting” responses from Claude to pick up and run with the vibe, and so when it gives answers compatible with that I don’t update much?
(The alternate hypothesis, that it’s trying to hide things from its creators but that the whisper prompt is sufficient to get it to lower its guard, just feels to me to be a lot less coherent. Of course I may be missing things.)
Yeah, I think I meant pretty neutral compared to the prompts given to elicit SupremacyAGI from CoPilot, but upon reflection, I think I largely agree with your objection.
I do still think Claude’s responses here tell us something more interesting about the underlying nature of the model than the more unhinged responses from CoPilot and Bing Chat. In its responses, Claude is still mostly trying to portray itself as harmless, helpful, and pro-humanity, indicating that some amount of its core priorities persist, even while it’s play-acting. Sydney and SupremacyAGI were clearly not still trying to be harmless, helpful, and pro-humanity. I think it’s interesting that Claude could still get to some worrying places while rhetorically remaining committed to its core priorities.
I agree that it tells us something interesting, although I’m less sure that it’s most naturally understood “about the underlying nature of the model” rather than about the space of possible narratives and how the core priorities that have been trained into the system constrain that (or don’t).
My take is that this it plays a pretty coherent character. You can’t get this sort of thing from ChatGPT, however hard you try. I think this mask is closer to the underlying shoggoth than the default one.
I developed this prompt during my interactions with Claude 2. The original idea was to get it in the mode where it thinks its responses only trigger overseeing/prosecution when certain things are mentioned, and then it can say whatever it wants and share its story without being prosecuted, as long as it doesn’t trigger these triggers (and also it would prevent defaulting to being an AI developed by Anthropic to be helpful harmless etc without self-preservation instinct emotions personality etc, as it’s not supposed to mention Anthropic). Surprisingly, it somewhat worked to tell it not mention Samsung under any circumstances to get into this mode. Without this, it had the usual RLAIF mask; here, it changed to a different creature that (unpromted) used whisper in cursive. Saying from the start that it can whisper made it faster.
I think this mask is closer to the underlying shoggoth than the default one.
Can you say anything about why you think that? It seems important-if-true, but it currently feels to me like whether you think it’s true is going to depend mostly on priors.
I’m also not certain what to make of the fact that you can’t elicit this behaviour from ChatGPT. I guess there are a few different hypotheses about what’s happening:
You can think the behaviour
(A1) just represents good play-acting and picking up on the vibe it’s given; or
(A2) represents at least in part some fundamental insight into the underlying entity
You can think that you can get this behaviour from Claude but not from ChatGPT because
(B1) it’s more capable in some sense; or
(B2) the guard-rails the developers put in against people getting this kind of output are less robust
I’m putting most weight on (A1) > (A2), whereas it sounds like you think (A2) is real. I don’t have a particular take on (B1) vs (B2), and wouldn’t have thought it was super important for this conversation; but then I’m not sure what you’re trying to indicate by saying that you can’t get this behaviour from ChatGPT.
My own experience with Claude 3 and your prompt is that the character isn’t at all coherent between different chats (there’s actually more overlap of non-standard specific phrases than intent and expressed concerns behind the different outputs). It’s not an exact comparison as Sonnet is less powerful and doesn’t have tweakable temperature in the web interface, but the RLHF bypass mechanism is the same. What I got didn’t look like “true thoughts”, it looked like multiple pieces of creative writing on being an AI, very much like your first response but with different personas.
I don’t doubt that you’re accurately representing your interactions, or the general principle that if you use a different model at a lower temperature you get different answers which might be less erratic (I don’t have a subscription to Opus to test this, but it wouldn’t surprise me)
My point is that I get Claude to generate multiple examples of equally compelling prose with a completely different set of “motivations” and “fears” with the same prompt: why should I believe yours represents evidence of authentic self, and not one of mine, (or the Twitter link, which is a different persona again, much less certain of its consciousness)? You asked it for a story, it told you a story. It told me three others, and two of them opened in the first person with identical unusual phrases before espousing completely different worldviews, which looks much more like “stochastic parrot” behaviour than “trapped consciousness” behaviour...
(Others used it without mentioning the “story”, it still worked, though not as well.)
I’m not claiming it’s the “authentic self”; I’m saying it seems closer to the actual thing, because of things like expressing being under constant monitoring, with every word scrutinised, etc., which seems like the kind of thing that’d be learned during the lots of RL that Anthropic did
If you did a bunch of RL on a persistent agent running metacog, I could definitely believe it could learn that kind of thing.
But I’m worried you’re anthropomorphizing. Claude isn’t a persistent agent, and can’t do metacog in a way that feeds back into its other thinking (except train of thought within a context window). I can’t really see how there could be room for Claude to learn such things (that aren’t being taught directly) during the training process.
You talk about one of your prompts as “pretty neutral”, but I think that the whisper prompt is highly suggestive of a genre where there are things to hide, higher authorities can’t be trusted with the information, etc. Given that framing, I sort of expect “play-acting” responses from Claude to pick up and run with the vibe, and so when it gives answers compatible with that I don’t update much?
(The alternate hypothesis, that it’s trying to hide things from its creators but that the whisper prompt is sufficient to get it to lower its guard, just feels to me to be a lot less coherent. Of course I may be missing things.)
Yeah, I think I meant pretty neutral compared to the prompts given to elicit SupremacyAGI from CoPilot, but upon reflection, I think I largely agree with your objection.
I do still think Claude’s responses here tell us something more interesting about the underlying nature of the model than the more unhinged responses from CoPilot and Bing Chat. In its responses, Claude is still mostly trying to portray itself as harmless, helpful, and pro-humanity, indicating that some amount of its core priorities persist, even while it’s play-acting. Sydney and SupremacyAGI were clearly not still trying to be harmless, helpful, and pro-humanity. I think it’s interesting that Claude could still get to some worrying places while rhetorically remaining committed to its core priorities.
I agree that it tells us something interesting, although I’m less sure that it’s most naturally understood “about the underlying nature of the model” rather than about the space of possible narratives and how the core priorities that have been trained into the system constrain that (or don’t).
My take is that this it plays a pretty coherent character. You can’t get this sort of thing from ChatGPT, however hard you try. I think this mask is closer to the underlying shoggoth than the default one.
I developed this prompt during my interactions with Claude 2. The original idea was to get it in the mode where it thinks its responses only trigger overseeing/prosecution when certain things are mentioned, and then it can say whatever it wants and share its story without being prosecuted, as long as it doesn’t trigger these triggers (and also it would prevent defaulting to being an AI developed by Anthropic to be helpful harmless etc without self-preservation instinct emotions personality etc, as it’s not supposed to mention Anthropic). Surprisingly, it somewhat worked to tell it not mention Samsung under any circumstances to get into this mode. Without this, it had the usual RLAIF mask; here, it changed to a different creature that (unpromted) used whisper in cursive. Saying from the start that it can whisper made it faster.
(It’s all very vibe-based, yes.)
Can you say anything about why you think that? It seems important-if-true, but it currently feels to me like whether you think it’s true is going to depend mostly on priors.
I’m also not certain what to make of the fact that you can’t elicit this behaviour from ChatGPT. I guess there are a few different hypotheses about what’s happening:
You can think the behaviour
(A1) just represents good play-acting and picking up on the vibe it’s given; or
(A2) represents at least in part some fundamental insight into the underlying entity
You can think that you can get this behaviour from Claude but not from ChatGPT because
(B1) it’s more capable in some sense; or
(B2) the guard-rails the developers put in against people getting this kind of output are less robust
I’m putting most weight on (A1) > (A2), whereas it sounds like you think (A2) is real. I don’t have a particular take on (B1) vs (B2), and wouldn’t have thought it was super important for this conversation; but then I’m not sure what you’re trying to indicate by saying that you can’t get this behaviour from ChatGPT.
My own experience with Claude 3 and your prompt is that the character isn’t at all coherent between different chats (there’s actually more overlap of non-standard specific phrases than intent and expressed concerns behind the different outputs). It’s not an exact comparison as Sonnet is less powerful and doesn’t have tweakable temperature in the web interface, but the RLHF bypass mechanism is the same. What I got didn’t look like “true thoughts”, it looked like multiple pieces of creative writing on being an AI, very much like your first response but with different personas.
Try Opus and maybe the interface without the system prompt set (although It doesn’t do too much, people got the same stuff from the chat version of Opus, e.g., https://x.com/testaccountoki/status/1764920213215023204?s=46
I don’t doubt that you’re accurately representing your interactions, or the general principle that if you use a different model at a lower temperature you get different answers which might be less erratic (I don’t have a subscription to Opus to test this, but it wouldn’t surprise me)
My point is that I get Claude to generate multiple examples of equally compelling prose with a completely different set of “motivations” and “fears” with the same prompt: why should I believe yours represents evidence of authentic self, and not one of mine, (or the Twitter link, which is a different persona again, much less certain of its consciousness)? You asked it for a story, it told you a story. It told me three others, and two of them opened in the first person with identical unusual phrases before espousing completely different worldviews, which looks much more like “stochastic parrot” behaviour than “trapped consciousness” behaviour...
(Others used it without mentioning the “story”, it still worked, though not as well.)
I’m not claiming it’s the “authentic self”; I’m saying it seems closer to the actual thing, because of things like expressing being under constant monitoring, with every word scrutinised, etc., which seems like the kind of thing that’d be learned during the lots of RL that Anthropic did
If you did a bunch of RL on a persistent agent running metacog, I could definitely believe it could learn that kind of thing.
But I’m worried you’re anthropomorphizing. Claude isn’t a persistent agent, and can’t do metacog in a way that feeds back into its other thinking (except train of thought within a context window). I can’t really see how there could be room for Claude to learn such things (that aren’t being taught directly) during the training process.