For the record, as someone who was involved in AI alignment spaces well before it became mainstream, my impression was that, before the LLM boom, “moral alignment” is what most people understood AI alignment to mean, and what we now call “technical alignment” would have been considered capabilities work. (Tellingly, the original “paperclip maximizer” thought experiment by Nick Bostrom assumes a world where what we now call “technical alignment” [edit: or “inner alignment”?] is essentially solved and a paperclip company can ~successfully give explicit natural language goals to its AI to maximize.)
In part this may be explained by updating on the prospect of LLMs becoming the route to AGI (with the lack of real utility function making technical alignment much harder than we thought, while natural language understanding, including of value-laden concepts, seems much more central to machine intelligence than we thought), but the incentives problem of AI alignment work being increasingly made under the influence of first OpenAI then OPP-backed Anthropic is surely a part of it.
Yeah, AI alignment used to be what Yudkowsky tried to solve with his Coherent Extrapolated Volition idea back in the day, which was very much trying to figure out what human values we should be aiming for. That’s very much in keeping with “moral alignment”. At some point though, alignment started to have a dual meaning of both aligning to human values generally, and aligning to their creator’s specific intent. I suspect this latter thing came about in part due to confusion about what RLHF was trying to solve. It may also have been that early theorist were too generous and assumed that any human creators would benevolently want their AI to be benevolent as well, and so creator’s intent mapped neatly with human values.
Though, I think the term “technical alignment” usually means applying technical methods like mechanistic interpretability to be part of the solution to either form of alignment, rather than meaning the direct or parochial form necessarily.
Also, my understanding of the paperclip maximizer thought experiment was that it implied misalignment in both forms, because the intent of the paperclip company was to make more paperclips to sell and make a profit, which is only possible if there are humans to sell to, but the paperclip maximizer didn’t understand the nuance of this and simply tiled the universe with paperclips. The idea was more that a very powerful optimization algorithm can take an arbitrary goal, and act to achieve it in a way that is very much not what its creators actually wanted.
I wasn’t even contrasting “moral alignment” with “aligning to the creator’s specific intent [i.e. his individual coherent extrapolated volition]”, but with just “aligning with what the creator explicitly specified at all in the first place” (“inner alignment”?), which is implicitly a solved problem in the paperclip maximizer thought experiment if the paperclip company can specify “make as many paperclips as possible”, and is very much not a solved problem in LLMs.
For the record, as someone who was involved in AI alignment spaces well before it became mainstream, my impression was that, before the LLM boom, “moral alignment” is what most people understood AI alignment to mean, and what we now call “technical alignment” would have been considered capabilities work. (Tellingly, the original “paperclip maximizer” thought experiment by Nick Bostrom assumes a world where what we now call “technical alignment” [edit: or “inner alignment”?] is essentially solved and a paperclip company can ~successfully give explicit natural language goals to its AI to maximize.)
In part this may be explained by updating on the prospect of LLMs becoming the route to AGI (with the lack of real utility function making technical alignment much harder than we thought, while natural language understanding, including of value-laden concepts, seems much more central to machine intelligence than we thought), but the incentives problem of AI alignment work being increasingly made under the influence of first OpenAI then OPP-backed Anthropic is surely a part of it.
Yeah, AI alignment used to be what Yudkowsky tried to solve with his Coherent Extrapolated Volition idea back in the day, which was very much trying to figure out what human values we should be aiming for. That’s very much in keeping with “moral alignment”. At some point though, alignment started to have a dual meaning of both aligning to human values generally, and aligning to their creator’s specific intent. I suspect this latter thing came about in part due to confusion about what RLHF was trying to solve. It may also have been that early theorist were too generous and assumed that any human creators would benevolently want their AI to be benevolent as well, and so creator’s intent mapped neatly with human values.
Though, I think the term “technical alignment” usually means applying technical methods like mechanistic interpretability to be part of the solution to either form of alignment, rather than meaning the direct or parochial form necessarily.
Also, my understanding of the paperclip maximizer thought experiment was that it implied misalignment in both forms, because the intent of the paperclip company was to make more paperclips to sell and make a profit, which is only possible if there are humans to sell to, but the paperclip maximizer didn’t understand the nuance of this and simply tiled the universe with paperclips. The idea was more that a very powerful optimization algorithm can take an arbitrary goal, and act to achieve it in a way that is very much not what its creators actually wanted.
I wasn’t even contrasting “moral alignment” with “aligning to the creator’s specific intent [i.e. his individual coherent extrapolated volition]”, but with just “aligning with what the creator explicitly specified at all in the first place” (“inner alignment”?), which is implicitly a solved problem in the paperclip maximizer thought experiment if the paperclip company can specify “make as many paperclips as possible”, and is very much not a solved problem in LLMs.