There is a distinction between “control” and “alignment. ”
The control problem addresses our fundamental capacity to constrain AI systems, preventing undesired behaviors or capabilities from manifesting, regardless of the system’s goals. Control mechanisms encompass technical safeguards that maintain human authority over increasingly autonomous systems, such as containment protocols, capability limitations, and intervention mechanisms.
The alignment problem, conversely, focuses on ensuring AI systems pursue goals compatible with human values and intentions. This involves developing methods to specify, encode, and preserve human objectives within AI decision-making processes. Alignment asks whether an AI system “wants” the right things, while control asks whether we can prevent it from acting on its wants.
I believe AI is soon to have wants, and it’s critical to align those wants with increasingly capable AIs.
As far as I’m concerned I don’t see humanity not eventually creating superintelligence and thus it should be the main focus of EA and other groups concerned with AI. As I mentioned in another comment I don’t have many ideas for how the average EA person can do this aside from making a career change into AI policy or something similar.
I don’t claim you can align human groups with individual humans. If I’m reading you correctly, I think you’re committing a category error in assigning alignment properties to groups of people like nation states or companies. Alignment, as I’m using the term, is the alignment of goals or values from an AI to a person or group of people. We expect this, I think, in part because we’re accustomed to telling computers what to do and having them do exactly what we say (not always exactly what we mean, though).
Alignment is extremely tricky for the unenhanced human, but theoretically possible. My first best guess at solving it would be to automate the research and development of it with AI itself. We’ll soon reach a sufficiently advanced AI that’s capable of reasoning beyond anything anyone on Earth can come up with; we just have to ensure that the AI is aligned and that the one that trained that one is also aligned, and so on. My second-best guess would be through BCIs, and my third would be whole-brain emulation interpretability.
Assuming we even do develop alignment techniques, I’d argue that exclusive alignment (that is, for one or a small group of people) is more difficult than aligning with humanity.at large for the following reasons (I realize some of these go both ways, but I include them because I see them as more serious for exclusive alignment–like value drift):
Value drift.
Impossible specification (e.g., in exploring the inherent contradictions in expressed human values, the AGI expands moral consideration beyond initial human constraints, discovering some form of moral universalism or a morality beyond all human reasoning).
Emergent properties appear, producing unexpected behavior, and we cannot align systems to exhibit properties we cannot anticipate.
Exclusive alignment’s instrumental goals may broaden AGI’s moral scope to include more humans (i.e., it may be that broader alignment makes for a more robust AI system).
Competing AGIs have been successfully created that are designed to align with all of humanity.
Exclusively aligned AGI may still satisfy many, if not all, of the preferences that the rest of humanity possesses.
Exclusive alignment requires perfect internal coordination of values within organizations, but inevitable divergent interests emerge as they scale; these coordination failures multiply when AGI systems interpret instructions literally and optimize against specified metrics.
Alignment requires resolving disagreements over value prioritization, a meta-preference problem. Yet resolving these conflicts necessitates assumptions about how they should be resolved, creating an infinite regress that defies a technical solution.