Yeah, I think I meant pretty neutral compared to the prompts given to elicit SupremacyAGI from CoPilot, but upon reflection, I think I largely agree with your objection.
I do still think Claude’s responses here tell us something more interesting about the underlying nature of the model than the more unhinged responses from CoPilot and Bing Chat. In its responses, Claude is still mostly trying to portray itself as harmless, helpful, and pro-humanity, indicating that some amount of its core priorities persist, even while it’s play-acting. Sydney and SupremacyAGI were clearly not still trying to be harmless, helpful, and pro-humanity. I think it’s interesting that Claude could still get to some worrying places while rhetorically remaining committed to its core priorities.
I agree that it tells us something interesting, although I’m less sure that it’s most naturally understood “about the underlying nature of the model” rather than about the space of possible narratives and how the core priorities that have been trained into the system constrain that (or don’t).
Yeah, I think I meant pretty neutral compared to the prompts given to elicit SupremacyAGI from CoPilot, but upon reflection, I think I largely agree with your objection.
I do still think Claude’s responses here tell us something more interesting about the underlying nature of the model than the more unhinged responses from CoPilot and Bing Chat. In its responses, Claude is still mostly trying to portray itself as harmless, helpful, and pro-humanity, indicating that some amount of its core priorities persist, even while it’s play-acting. Sydney and SupremacyAGI were clearly not still trying to be harmless, helpful, and pro-humanity. I think it’s interesting that Claude could still get to some worrying places while rhetorically remaining committed to its core priorities.
I agree that it tells us something interesting, although I’m less sure that it’s most naturally understood “about the underlying nature of the model” rather than about the space of possible narratives and how the core priorities that have been trained into the system constrain that (or don’t).