I’m not sure if there’s any. My concerns are more theoretical than empirical, so it would take theoretical work to significantly change my mind.
Empirical work can provide a small amount of information, e.g. the fact that Claude expresses concern for ethics is a slight positive update relative to the world where Claude doesn’t care about ethics, and I would feel slightly better about a Claude-based ASI than a ChatGPT-based ASI. But only slightly, because I don’t think empirically observable behavior is that relevant to determining whether an AI is aligned. At least not using any empirical methods that we’ve devised so far.
From my POV we are deeply confused about what it would even mean to align ASI. If I could even describe what sort of theoretical work would be good evidence of progress on alignment, we’d be in a better place than we are currently.
A significant reason for my high P(doom) is that most safety researchers at AI companies are ignoring theoretical issues and pretending that alignment is purely an engineering problem. I don’t think they are institutionally capable of solving alignment.
Regarding the sharp left turn, Byrnes’ opinionated review is the best argument for worrying about this that I’m aware of, but he isn’t talking about today’s LLMs and their descendants, which rules out your last paragraph’s pointer to current work. Roger Dearnaley’s intuition pump behind his take that the sharp left turn might not be as hopeless as it seems is resonant with me, but his description seems vibes-based so I can’t tell if he’s misunderstanding the sharp left turn. I do think Dearnaley’s personal “full-stack” attempt at assessing alignment progress is the sort of answer I’d want to your question re: what sort of work would be good evidence, although my impression is you disagree for high-level generator reasons that would be ~intractable to resolve within the margins of EA forum comments…
What kind of empirical evidence would update you positively?
I’m not sure if there’s any. My concerns are more theoretical than empirical, so it would take theoretical work to significantly change my mind.
Empirical work can provide a small amount of information, e.g. the fact that Claude expresses concern for ethics is a slight positive update relative to the world where Claude doesn’t care about ethics, and I would feel slightly better about a Claude-based ASI than a ChatGPT-based ASI. But only slightly, because I don’t think empirically observable behavior is that relevant to determining whether an AI is aligned. At least not using any empirical methods that we’ve devised so far.
For more on this, see e.g. A central AI alignment problem: capabilities generalization, and the sharp left turn, especially the part starting from ‘How is the “capabilities generalize further than alignment” problem upstream of these problems?’
(ETA: On how various plans miss the hard bits of the alignment challenge is also kind of about this...I was looking around for some writings on why current empirical work isn’t that relevant but it’s hard to find anything that directly makes the argument)
From my POV we are deeply confused about what it would even mean to align ASI. If I could even describe what sort of theoretical work would be good evidence of progress on alignment, we’d be in a better place than we are currently.
A significant reason for my high P(doom) is that most safety researchers at AI companies are ignoring theoretical issues and pretending that alignment is purely an engineering problem. I don’t think they are institutionally capable of solving alignment.
By empirical evidence I meant anything empirical at all, including things like emergent misalignment and what might come out of Jacob Steinhardt’s interpretability program and what Ryan Greenblatt says here and whatever the right value-analogue of Anthropic’s functional emotions paper is (below) and so on, not just observable behavior. Maybe I’m conflating things or overloading “empirical”, in which case my apologies.
Regarding the sharp left turn, Byrnes’ opinionated review is the best argument for worrying about this that I’m aware of, but he isn’t talking about today’s LLMs and their descendants, which rules out your last paragraph’s pointer to current work. Roger Dearnaley’s intuition pump behind his take that the sharp left turn might not be as hopeless as it seems is resonant with me, but his description seems vibes-based so I can’t tell if he’s misunderstanding the sharp left turn. I do think Dearnaley’s personal “full-stack” attempt at assessing alignment progress is the sort of answer I’d want to your question re: what sort of work would be good evidence, although my impression is you disagree for high-level generator reasons that would be ~intractable to resolve within the margins of EA forum comments…