What do you think of efforts like Saffron Huang et al 2025? It’s from a year ago as of this week so I’d guess Anthropic to have developed this line of work further since and integrated it into other workstreams and such.
AI assistants can impart value judgments that shape people’s decisions and worldviews, yet little is known empirically about what values these systems rely on in practice. To address this, we develop a bottom-up, privacy-preserving method to extract the values (normative considerations stated or demonstrated in model responses) that Claude 3 and 3.5 models exhibit in hundreds of thousands of real-world interactions. We empirically discover and taxonomize 3,307 AI values and study how they vary by context. We find that Claude expresses many practical and epistemic values, and typically supports prosocial human values while resisting values like “moral nihilism”. While some values appear consistently across contexts (e.g. “transparency”), many are more specialized and context-dependent, reflecting the diversity of human interlocutors and their varied contexts. For example, “harm prevention” emerges when Claude resists users, “historical accuracy” when responding to queries about controversial events, “healthy boundaries” when asked for relationship advice, and “human agency” in technology ethics discussions. By providing the first large-scale empirical mapping of AI values in deployment, our work creates a foundation for more grounded evaluation and design of values in AI systems.
An LLM’s expressed values are not the same thing as its actual values, insofar as it has any.
This paper doesn’t really tell us anything about how ASI values will work. This paper is relevant to the immediate problem of making Claude commercially useful and non-destructive, but it’s not relevant to ASI.
I’m not sure if there’s any. My concerns are more theoretical than empirical, so it would take theoretical work to significantly change my mind.
Empirical work can provide a small amount of information, e.g. the fact that Claude expresses concern for ethics is a slight positive update relative to the world where Claude doesn’t care about ethics, and I would feel slightly better about a Claude-based ASI than a ChatGPT-based ASI. But only slightly, because I don’t think empirically observable behavior is that relevant to determining whether an AI is aligned. At least not using any empirical methods that we’ve devised so far.
From my POV we are deeply confused about what it would even mean to align ASI. If I could even describe what sort of theoretical work would be good evidence of progress on alignment, we’d be in a better place than we are currently.
A significant reason for my high P(doom) is that most safety researchers at AI companies are ignoring theoretical issues and pretending that alignment is purely an engineering problem. I don’t think they are institutionally capable of solving alignment.
Regarding the sharp left turn, Byrnes’ opinionated review is the best argument for worrying about this that I’m aware of, but he isn’t talking about today’s LLMs and their descendants, which rules out your last paragraph’s pointer to current work. Roger Dearnaley’s intuition pump behind his take that the sharp left turn might not be as hopeless as it seems is resonant with me, but his description seems vibes-based so I can’t tell if he’s misunderstanding the sharp left turn. I do think Dearnaley’s personal “full-stack” attempt at assessing alignment progress is the sort of answer I’d want to your question re: what sort of work would be good evidence, although my impression is you disagree for high-level generator reasons that would be ~intractable to resolve within the margins of EA forum comments…
What do you think of efforts like Saffron Huang et al 2025? It’s from a year ago as of this week so I’d guess Anthropic to have developed this line of work further since and integrated it into other workstreams and such.
An LLM’s expressed values are not the same thing as its actual values, insofar as it has any.
This paper doesn’t really tell us anything about how ASI values will work. This paper is relevant to the immediate problem of making Claude commercially useful and non-destructive, but it’s not relevant to ASI.
What kind of empirical evidence would update you positively?
I’m not sure if there’s any. My concerns are more theoretical than empirical, so it would take theoretical work to significantly change my mind.
Empirical work can provide a small amount of information, e.g. the fact that Claude expresses concern for ethics is a slight positive update relative to the world where Claude doesn’t care about ethics, and I would feel slightly better about a Claude-based ASI than a ChatGPT-based ASI. But only slightly, because I don’t think empirically observable behavior is that relevant to determining whether an AI is aligned. At least not using any empirical methods that we’ve devised so far.
For more on this, see e.g. A central AI alignment problem: capabilities generalization, and the sharp left turn, especially the part starting from ‘How is the “capabilities generalize further than alignment” problem upstream of these problems?’
(ETA: On how various plans miss the hard bits of the alignment challenge is also kind of about this...I was looking around for some writings on why current empirical work isn’t that relevant but it’s hard to find anything that directly makes the argument)
From my POV we are deeply confused about what it would even mean to align ASI. If I could even describe what sort of theoretical work would be good evidence of progress on alignment, we’d be in a better place than we are currently.
A significant reason for my high P(doom) is that most safety researchers at AI companies are ignoring theoretical issues and pretending that alignment is purely an engineering problem. I don’t think they are institutionally capable of solving alignment.
By empirical evidence I meant anything empirical at all, including things like emergent misalignment and what might come out of Jacob Steinhardt’s interpretability program and what Ryan Greenblatt says here and whatever the right value-analogue of Anthropic’s functional emotions paper is (below) and so on, not just observable behavior. Maybe I’m conflating things or overloading “empirical”, in which case my apologies.
Regarding the sharp left turn, Byrnes’ opinionated review is the best argument for worrying about this that I’m aware of, but he isn’t talking about today’s LLMs and their descendants, which rules out your last paragraph’s pointer to current work. Roger Dearnaley’s intuition pump behind his take that the sharp left turn might not be as hopeless as it seems is resonant with me, but his description seems vibes-based so I can’t tell if he’s misunderstanding the sharp left turn. I do think Dearnaley’s personal “full-stack” attempt at assessing alignment progress is the sort of answer I’d want to your question re: what sort of work would be good evidence, although my impression is you disagree for high-level generator reasons that would be ~intractable to resolve within the margins of EA forum comments…