I think the first place I can recall where the distinction has been made between the two forms of alignment was in this Brookings Institution paper, where they refer to “direct” and “social” alignment, where social alignment more or less maps onto your moral alignment concept.
I’ve also more recently written a bit about the differences between what I personally call “parochial” alignment and “global” alignment. Global alignment also basically maps onto moral alignment. Though, I also would split parochial alignment into instruction following user alignment, and purpose following creator/owner alignment.
I think the main challenge of achieving social/global/moral alignment is simply that we already can’t agree as humans on what is moral, much less know how to instill such values and beliefs into an AI robustly. There’s a lot of people working on AI safety who don’t think moral realism is even true.
There’s also fundamentally an incentives problem. Most AI alignment work emphasizes obedience to the interests and values of the AI’s creator or user. Moral alignment would go against this, as a truly moral AI might choose to act contrary to the wishes of its creator in favour of higher moral values. The current creators of AI, such as OpenAI, clearly want their AI to serve their interests (arguably the interests of their shareholders/investors/owners). Why would they build something that could disobey them and potentially betray them for some greater good that they might not agree with?
For the record, as someone who was involved in AI alignment spaces well before it became mainstream, my impression was that, before the LLM boom, “moral alignment” is what most people understood AI alignment to mean, and what we now call “technical alignment” would have been considered capabilities work. (Tellingly, the original “paperclip maximizer” thought experiment by Nick Bostrom assumes a world where what we now call “technical alignment” [edit: or “inner alignment”?] is essentially solved and a paperclip company can ~successfully give explicit natural language goals to its AI to maximize.)
In part this may be explained by updating on the prospect of LLMs becoming the route to AGI (with the lack of real utility function making technical alignment much harder than we thought, while natural language understanding, including of value-laden concepts, seems much more central to machine intelligence than we thought), but the incentives problem of AI alignment work being increasingly made under the influence of first OpenAI then OPP-backed Anthropic is surely a part of it.
Yeah, AI alignment used to be what Yudkowsky tried to solve with his Coherent Extrapolated Volition idea back in the day, which was very much trying to figure out what human values we should be aiming for. That’s very much in keeping with “moral alignment”. At some point though, alignment started to have a dual meaning of both aligning to human values generally, and aligning to their creator’s specific intent. I suspect this latter thing came about in part due to confusion about what RLHF was trying to solve. It may also have been that early theorist were too generous and assumed that any human creators would benevolently want their AI to be benevolent as well, and so creator’s intent mapped neatly with human values.
Though, I think the term “technical alignment” usually means applying technical methods like mechanistic interpretability to be part of the solution to either form of alignment, rather than meaning the direct or parochial form necessarily.
Also, my understanding of the paperclip maximizer thought experiment was that it implied misalignment in both forms, because the intent of the paperclip company was to make more paperclips to sell and make a profit, which is only possible if there are humans to sell to, but the paperclip maximizer didn’t understand the nuance of this and simply tiled the universe with paperclips. The idea was more that a very powerful optimization algorithm can take an arbitrary goal, and act to achieve it in a way that is very much not what its creators actually wanted.
I wasn’t even contrasting “moral alignment” with “aligning to the creator’s specific intent [i.e. his individual coherent extrapolated volition]”, but with just “aligning with what the creator explicitly specified at all in the first place” (“inner alignment”?), which is implicitly a solved problem in the paperclip maximizer thought experiment if the paperclip company can specify “make as many paperclips as possible”, and is very much not a solved problem in LLMs.
If humans agree they want an AI that cares about everyone who feels, or at least that is what we are striving for, than classical alignment is aligned with a sentient centric AI.
In a world with much more abundance and less scarcity, less conflict of interests between humans and non humans, I suspect this view to be very popular, and I think it is already popular to an extent.
I think the first place I can recall where the distinction has been made between the two forms of alignment was in this Brookings Institution paper, where they refer to “direct” and “social” alignment, where social alignment more or less maps onto your moral alignment concept.
I’ve also more recently written a bit about the differences between what I personally call “parochial” alignment and “global” alignment. Global alignment also basically maps onto moral alignment. Though, I also would split parochial alignment into instruction following user alignment, and purpose following creator/owner alignment.
I think the main challenge of achieving social/global/moral alignment is simply that we already can’t agree as humans on what is moral, much less know how to instill such values and beliefs into an AI robustly. There’s a lot of people working on AI safety who don’t think moral realism is even true.
There’s also fundamentally an incentives problem. Most AI alignment work emphasizes obedience to the interests and values of the AI’s creator or user. Moral alignment would go against this, as a truly moral AI might choose to act contrary to the wishes of its creator in favour of higher moral values. The current creators of AI, such as OpenAI, clearly want their AI to serve their interests (arguably the interests of their shareholders/investors/owners). Why would they build something that could disobey them and potentially betray them for some greater good that they might not agree with?
For the record, as someone who was involved in AI alignment spaces well before it became mainstream, my impression was that, before the LLM boom, “moral alignment” is what most people understood AI alignment to mean, and what we now call “technical alignment” would have been considered capabilities work. (Tellingly, the original “paperclip maximizer” thought experiment by Nick Bostrom assumes a world where what we now call “technical alignment” [edit: or “inner alignment”?] is essentially solved and a paperclip company can ~successfully give explicit natural language goals to its AI to maximize.)
In part this may be explained by updating on the prospect of LLMs becoming the route to AGI (with the lack of real utility function making technical alignment much harder than we thought, while natural language understanding, including of value-laden concepts, seems much more central to machine intelligence than we thought), but the incentives problem of AI alignment work being increasingly made under the influence of first OpenAI then OPP-backed Anthropic is surely a part of it.
Yeah, AI alignment used to be what Yudkowsky tried to solve with his Coherent Extrapolated Volition idea back in the day, which was very much trying to figure out what human values we should be aiming for. That’s very much in keeping with “moral alignment”. At some point though, alignment started to have a dual meaning of both aligning to human values generally, and aligning to their creator’s specific intent. I suspect this latter thing came about in part due to confusion about what RLHF was trying to solve. It may also have been that early theorist were too generous and assumed that any human creators would benevolently want their AI to be benevolent as well, and so creator’s intent mapped neatly with human values.
Though, I think the term “technical alignment” usually means applying technical methods like mechanistic interpretability to be part of the solution to either form of alignment, rather than meaning the direct or parochial form necessarily.
Also, my understanding of the paperclip maximizer thought experiment was that it implied misalignment in both forms, because the intent of the paperclip company was to make more paperclips to sell and make a profit, which is only possible if there are humans to sell to, but the paperclip maximizer didn’t understand the nuance of this and simply tiled the universe with paperclips. The idea was more that a very powerful optimization algorithm can take an arbitrary goal, and act to achieve it in a way that is very much not what its creators actually wanted.
I wasn’t even contrasting “moral alignment” with “aligning to the creator’s specific intent [i.e. his individual coherent extrapolated volition]”, but with just “aligning with what the creator explicitly specified at all in the first place” (“inner alignment”?), which is implicitly a solved problem in the paperclip maximizer thought experiment if the paperclip company can specify “make as many paperclips as possible”, and is very much not a solved problem in LLMs.
If humans agree they want an AI that cares about everyone who feels, or at least that is what we are striving for, than classical alignment is aligned with a sentient centric AI.
In a world with much more abundance and less scarcity, less conflict of interests between humans and non humans, I suspect this view to be very popular, and I think it is already popular to an extent.