“AI Alignment” is a Dangerously Overloaded Term
Alignment as Aimability or as Goalcraft?
The Less Wrong and AI risk communities have obviously had a huge role in mainstreaming the concept of risks from artificial intelligence, but we have a serious terminology problem.
The term “AI Alignment” has become popular, but people cannot agree whether it means something like making “Good” AI or whether it means something like making “Aimable” AI. We can define the terms as follows:
AI Aimability = Create AI systems that will do what the creator/developer/owner/user intends them to do, whether or not that thing is good or bad
AI Goalcraft = Create goals for AI systems that we ultimately think lead to the best outcomes
Aimability is a relatively well-defined technical problem and in practice almost all of the technical work on AI Alignment is actually work on AI Aimability. Less Wrong has for a long time been concerned with Aimability failures (what Yudkowsky in the early days would have called “Technical Failures of Friendly AI”) rather than failures of Goalcraft (old-school MIRI terminology would be “Friendliness Content”).
The problem is that as the term “AI Alignment” has gained popularity, people have started to completely merge the definitions of Aimability and Goalcraft under the term “Alignment”. I recently ran some Twitter polls on this subject, and it seems that people are relatively evenly split between the two definitions.
This is a relatively bad state of affairs. We should not have the fate of the universe partially determined by how people interpret an ambiguous word.
In particular, the way we are using the term AI Alignment right now means that it’s hard to solve the AI Goalcraft problem and easy to solve the Aimability problem, because there is a part of AI that is distinct from Aimability which the current terminology doesn’t have a word for.
Not having a word for what goals to give the most powerful AI system in the universe is certainly a problem, and it means that everyone will be attracted to the easier Aimability research where one can quickly get stuck in and show a concrete improvement on a metric and publish a paper.
Why doesn’t the Less Wrong / AI risk community have good terminology for the right hand side of the diagram? Well, this (I think) goes back to a decision by Eliezer from the SL4 mailing list days that one should not discuss what the world would be like after the singularity, because a lot of time would be wasted arguing about politics, instead of the then more urgent problem of solving the AI Aimability problem (which was then called the control problem). At the time this decision was probably correct, but times have changed. There are now quite a few people working on Aimability, and far more are surely to come, and it also seems quite likely (though not certain) that Eliezer was wrong about how hard Aimability/Control actually is.
Words Have Consequences
This decision to not talk about AI goals or content might eventually result in some unscrupulous actors getting to define the actual content and goals of superintelligence, cutting the X-risk and LW community out of the only part of the AI saga that actually matters in the end. For example, the recent popularity of the e/acc movement has been associated with the Landian strain of AI goal content—acceleration towards a deliberate and final extermination of humanity, in order to appease the Thermodynamic God. And the field that calls itself AI Ethics has been tainted with extremist far-left ideology around DIE (Diversity, Inclusion and Equity) that is perhaps even more frightening than the Landian Accelerationist strain. By not having mainstream terminology for AI goals and content, we may cede the future of the universe to extremists.
I suggest the term “AI Goalcraft” for the study of which goals for AI systems we ultimately think lead to the best outcomes. The seminal work on AI Goalcraft is clearly Eliezer’s Coherent Extrapolated Volition, and I think we need to push that agenda further now that AI risk has been mainstreamed and there’s a lot of money going into the Aimability/Control problem.
Gud Car Studies
What should we do with the term “Alignment” though? I’m not sure. I think it unfortunately leads people into confusion. It doesn’t track the underlying reality—which I believe is that action naturally factors into Goalcraft followed by Aimability, and you can work on Aimability without knowing much about Goalcraft and vice-versa because the mechanisms of Aimability don’t care much about what goal one is aiming at, and the structure of Goalcraft doesn’t care much about how you’re going to aim at the goal and stay on target. When people hear “Aligned” they just hear “Good”, but with a side order of sophistication. It would be like if we lumped mechanical engineers who developed car engines in with computer scientists working on GPS navigators and called their field Gud Car Studies. Gud Car Studies is obviously an abomination of a term that doesn’t properly reflect the underlying reality that designing a good engine is mostly independent of deciding where to drive the car to, and how to navigate there. I think that “Alignment” has unfortunately become the “Gud Car Studies” of our time.
I’m at a loss as to what to do—I suspect that the term AI Alignment has already gotten away from us and we should stop using it and talk about Aimability and Goalcraft instead.
(Crossposted from the Less Wrong site)
I feel like a lot of what you’re describing is already encompassed by the concept of scalability, which would naturally include integration with existing social systems. However you are right in questioning whether this is a “relatively well-defined technical problem.”
An alternative taxonomy might be “technical” and “game-theoretic” alignment. The latter recognizes that competing visions for social organization exist and will not be solved within the scope of AI regulation. That in turn leads to more meta-theoretical discussions about how ambitious the AI safety agenda should be in order not to stifle market competition, which would be the ultimate insurance against extremist goalcraft.
Otherwise, engaging in these debates at the object level creates an open invitation for manipulation and bad faith.
I wouldn’t limit AI goalcraft to integration with existing social systems. It may be better to use the capabilites of AI to build fundamentally better preference aggregation engines. That’s the idea of CEV and its ilk.