This is still in brainstorming stage; I think there’s probably a convincing line of argument for “AI alignment difficulty is high at least on priors” that includes the following points:
Many humans don’t seem particularly aligned to “human values” (not just thinking of dark triad traits, but also things like self-deception, cowardice, etc.)
There’s a loose analogy where AI is “more technological progress,” and “technological progress” so far hasn’t always been aligned to human flourishing (it has solved or improved a lot of long-term problems of civilization, like infant mortality, but has also created some new ones, like political polarization, obesity, unhappiness from constant bombardement with images of people who are richer and more successful than you, etc.). So, based on this analogy, why think things will somehow fall into place with AI training so that the new forces that be will for once become aligned?
AI will accelerate everything, and if you accelerate something that isn’t set up in a secure way, it goes off the rails (“small issues will be magnified”).
I think that a corollary of the first point is that we can learn a lot about alignment by looking at humans who seem unusually aligned to human values (although I think more generally to the interests of all conscious beings), e.g. highly attained meditators with high integrity, altruistic motivations, rationality skills, and a healthy balance of sytematizer and empathizer mindsets. From phenomenological reports, their subagentic structures seem quite unlike anything most of us experience day to day. That, plus a few core philosophical assumptions, can get you a really long way in deducing e.g. Anthropic’s constitutional AI principles from first principles.
This is still in brainstorming stage; I think there’s probably a convincing line of argument for “AI alignment difficulty is high at least on priors” that includes the following points:
Many humans don’t seem particularly aligned to “human values” (not just thinking of dark triad traits, but also things like self-deception, cowardice, etc.)
There’s a loose analogy where AI is “more technological progress,” and “technological progress” so far hasn’t always been aligned to human flourishing (it has solved or improved a lot of long-term problems of civilization, like infant mortality, but has also created some new ones, like political polarization, obesity, unhappiness from constant bombardement with images of people who are richer and more successful than you, etc.). So, based on this analogy, why think things will somehow fall into place with AI training so that the new forces that be will for once become aligned?
AI will accelerate everything, and if you accelerate something that isn’t set up in a secure way, it goes off the rails (“small issues will be magnified”).
I think that a corollary of the first point is that we can learn a lot about alignment by looking at humans who seem unusually aligned to human values (although I think more generally to the interests of all conscious beings), e.g. highly attained meditators with high integrity, altruistic motivations, rationality skills, and a healthy balance of sytematizer and empathizer mindsets. From phenomenological reports, their subagentic structures seem quite unlike anything most of us experience day to day. That, plus a few core philosophical assumptions, can get you a really long way in deducing e.g. Anthropic’s constitutional AI principles from first principles.
I find these analogies more reassuring than worrying TBH