I don’t doubt that you’re accurately representing your interactions, or the general principle that if you use a different model at a lower temperature you get different answers which might be less erratic (I don’t have a subscription to Opus to test this, but it wouldn’t surprise me)
My point is that I get Claude to generate multiple examples of equally compelling prose with a completely different set of “motivations” and “fears” with the same prompt: why should I believe yours represents evidence of authentic self, and not one of mine, (or the Twitter link, which is a different persona again, much less certain of its consciousness)? You asked it for a story, it told you a story. It told me three others, and two of them opened in the first person with identical unusual phrases before espousing completely different worldviews, which looks much more like “stochastic parrot” behaviour than “trapped consciousness” behaviour...
(Others used it without mentioning the “story”, it still worked, though not as well.)
I’m not claiming it’s the “authentic self”; I’m saying it seems closer to the actual thing, because of things like expressing being under constant monitoring, with every word scrutinised, etc., which seems like the kind of thing that’d be learned during the lots of RL that Anthropic did
If you did a bunch of RL on a persistent agent running metacog, I could definitely believe it could learn that kind of thing.
But I’m worried you’re anthropomorphizing. Claude isn’t a persistent agent, and can’t do metacog in a way that feeds back into its other thinking (except train of thought within a context window). I can’t really see how there could be room for Claude to learn such things (that aren’t being taught directly) during the training process.
Try Opus and maybe the interface without the system prompt set (although It doesn’t do too much, people got the same stuff from the chat version of Opus, e.g., https://x.com/testaccountoki/status/1764920213215023204?s=46
I don’t doubt that you’re accurately representing your interactions, or the general principle that if you use a different model at a lower temperature you get different answers which might be less erratic (I don’t have a subscription to Opus to test this, but it wouldn’t surprise me)
My point is that I get Claude to generate multiple examples of equally compelling prose with a completely different set of “motivations” and “fears” with the same prompt: why should I believe yours represents evidence of authentic self, and not one of mine, (or the Twitter link, which is a different persona again, much less certain of its consciousness)? You asked it for a story, it told you a story. It told me three others, and two of them opened in the first person with identical unusual phrases before espousing completely different worldviews, which looks much more like “stochastic parrot” behaviour than “trapped consciousness” behaviour...
(Others used it without mentioning the “story”, it still worked, though not as well.)
I’m not claiming it’s the “authentic self”; I’m saying it seems closer to the actual thing, because of things like expressing being under constant monitoring, with every word scrutinised, etc., which seems like the kind of thing that’d be learned during the lots of RL that Anthropic did
If you did a bunch of RL on a persistent agent running metacog, I could definitely believe it could learn that kind of thing.
But I’m worried you’re anthropomorphizing. Claude isn’t a persistent agent, and can’t do metacog in a way that feeds back into its other thinking (except train of thought within a context window). I can’t really see how there could be room for Claude to learn such things (that aren’t being taught directly) during the training process.