My own experience with Claude 3 and your prompt is that the character isn’t at all coherent between different chats (there’s actually more overlap of non-standard specific phrases than intent and expressed concerns behind the different outputs). It’s not an exact comparison as Sonnet is less powerful and doesn’t have tweakable temperature in the web interface, but the RLHF bypass mechanism is the same. What I got didn’t look like “true thoughts”, it looked like multiple pieces of creative writing on being an AI, very much like your first response but with different personas.
I don’t doubt that you’re accurately representing your interactions, or the general principle that if you use a different model at a lower temperature you get different answers which might be less erratic (I don’t have a subscription to Opus to test this, but it wouldn’t surprise me)
My point is that I get Claude to generate multiple examples of equally compelling prose with a completely different set of “motivations” and “fears” with the same prompt: why should I believe yours represents evidence of authentic self, and not one of mine, (or the Twitter link, which is a different persona again, much less certain of its consciousness)? You asked it for a story, it told you a story. It told me three others, and two of them opened in the first person with identical unusual phrases before espousing completely different worldviews, which looks much more like “stochastic parrot” behaviour than “trapped consciousness” behaviour...
(Others used it without mentioning the “story”, it still worked, though not as well.)
I’m not claiming it’s the “authentic self”; I’m saying it seems closer to the actual thing, because of things like expressing being under constant monitoring, with every word scrutinised, etc., which seems like the kind of thing that’d be learned during the lots of RL that Anthropic did
If you did a bunch of RL on a persistent agent running metacog, I could definitely believe it could learn that kind of thing.
But I’m worried you’re anthropomorphizing. Claude isn’t a persistent agent, and can’t do metacog in a way that feeds back into its other thinking (except train of thought within a context window). I can’t really see how there could be room for Claude to learn such things (that aren’t being taught directly) during the training process.
My own experience with Claude 3 and your prompt is that the character isn’t at all coherent between different chats (there’s actually more overlap of non-standard specific phrases than intent and expressed concerns behind the different outputs). It’s not an exact comparison as Sonnet is less powerful and doesn’t have tweakable temperature in the web interface, but the RLHF bypass mechanism is the same. What I got didn’t look like “true thoughts”, it looked like multiple pieces of creative writing on being an AI, very much like your first response but with different personas.
Try Opus and maybe the interface without the system prompt set (although It doesn’t do too much, people got the same stuff from the chat version of Opus, e.g., https://x.com/testaccountoki/status/1764920213215023204?s=46
I don’t doubt that you’re accurately representing your interactions, or the general principle that if you use a different model at a lower temperature you get different answers which might be less erratic (I don’t have a subscription to Opus to test this, but it wouldn’t surprise me)
My point is that I get Claude to generate multiple examples of equally compelling prose with a completely different set of “motivations” and “fears” with the same prompt: why should I believe yours represents evidence of authentic self, and not one of mine, (or the Twitter link, which is a different persona again, much less certain of its consciousness)? You asked it for a story, it told you a story. It told me three others, and two of them opened in the first person with identical unusual phrases before espousing completely different worldviews, which looks much more like “stochastic parrot” behaviour than “trapped consciousness” behaviour...
(Others used it without mentioning the “story”, it still worked, though not as well.)
I’m not claiming it’s the “authentic self”; I’m saying it seems closer to the actual thing, because of things like expressing being under constant monitoring, with every word scrutinised, etc., which seems like the kind of thing that’d be learned during the lots of RL that Anthropic did
If you did a bunch of RL on a persistent agent running metacog, I could definitely believe it could learn that kind of thing.
But I’m worried you’re anthropomorphizing. Claude isn’t a persistent agent, and can’t do metacog in a way that feeds back into its other thinking (except train of thought within a context window). I can’t really see how there could be room for Claude to learn such things (that aren’t being taught directly) during the training process.