[Question] Alignment & Capabilities: What’s the difference?

In the AI safety literature, AI alignment is often presented as conceptually distinct from capabilities. However, (1) the distinction seems somewhat fuzzy and (2) many techniques that are supposed to improve alignment also improve capabilities.

(1) The distinction is fuzzy because one common way of defining alignment is getting an AI system to do what the programmer or user intends. However, programmers intend for systems to be capable. eg we want chess systems to win at chess. So, a system that wins more is more intent aligned, and is also more capable.

(2) eg This Irving et al (2018) paper by a team at Open AI proposes debate as a way to improve safety and alignment, where alignment is defined as aligning with human goals. However, the debate also improved the accuracy of image classification in the paper, and therefore also improved capabilities.

Similarly, Reinforcement learning with human feedback was initially presented as an alignment strategy, but my loose impression is that it also made significant capabilities improvements. There are many other examples in the literature of alignment strategies also improving capabilities.


This makes me wonder whether alignment is actually more neglected that capabilities work. AI companies want to make aligned systems because they are more useful.

How do people see the difference between alignment and capabilities?

No comments.