In the AI safety literature, AI alignment is often presented as conceptually distinct from capabilities. However, (1) the distinction seems somewhat fuzzy and (2) many techniques that are supposed to improve alignment also improve capabilities.
(1) The distinction is fuzzy because one common way of defining alignment is getting an AI system to do what the programmer or user intends. However, programmers intend for systems to be capable. eg we want chess systems to win at chess. So, a system that wins more is more intent aligned, and is also more capable.
(2) eg This Irving et al (2018) paper by a team at Open AI proposes debate as a way to improve safety and alignment, where alignment is defined as aligning with human goals. However, the debate also improved the accuracy of image classification in the paper, and therefore also improved capabilities.
Similarly, Reinforcement learning with human feedback was initially presented as an alignment strategy, but my loose impression is that it also made significant capabilities improvements. There are many other examples in the literature of alignment strategies also improving capabilities.
**
This makes me wonder whether alignment is actually more neglected that capabilities work. AI companies want to make aligned systems because they are more useful.
How do people see the difference between alignment and capabilities?
I define “alignment” as “the AI is trying to do things that the AI designer had intended for the AI to be trying to do”, see here for discussion.
If you define “capabilities” as “anything that would make an AI more useful / desirable to a person or company”, then alignment research would be by definition a subset of capabilities research.
But it’s a very small subset!
Examples of things that constitute capabilities progress but not alignment progress include: faster and better and more and cheaper chips (and other related hardware like interconnects), the development of CUDA, PyTorch, etc., the invention of BatchNorm and Xavier initialization and adam optimizers and Transformers, etc. etc.
Here’s a concrete example. Suppose future AGI winds up working somewhat like human brain within-lifetime learning (which I claim is in the broad category of model-based reinforcement learning (RL).) A key ingredient in model-based RL is the reward function, which in the human brain case loosely corresponds to “innate drives”, like pain being bad (other things equal), eating when hungry being good, and hundreds more things like that. If future AI works like that, then future AI programmers can put whatever innate drives they want into their future AIs. So there’s a technical problem of “what innate drives / reward function (if any) would lead to AIs that are honest, cooperative, kind, etc.?” And this problem is not only currently unsolved, but almost nobody is working on it. Is solving this problem necessary to create economically-useful powerful AGIs? Unfortunately, it is not!! Just look at human high-functioning sociopaths. If we made AGIs like that, we could get extremely competent agents—agents that can make and execute complicated plans, figure things out, do science, invent tools to solve their problems, etc.—with none of the machinery that gives humans an intrinsic tendency to compassion and morality. Such AGIs would nevertheless be very profitable to use … for exactly as long as they can be successfully prevented from breaking free and pursuing their own interests, in which case we’re in big trouble. (By analogy, human slaves are likewise not “aligned” with their masters but still economically useful.) I have much more discussion and elaboration of all this stuff here.
Thanks a lot for this!
To take one of your examples—faster and better chips (or more compute generally). It seems like this does actually improve alignment on perhaps the most popular definition of alignment as intent-alignment. In terms of answering questions from prompts, GPT-4 is more in line with the intentions of the user than GPT-3 and this is mainly due to more compute. I mean this in the sense that it producers answers that are better/more in line with what users want
I’m not sure I agree that nobody is working on the problem of which reward functions make AIs that are honest and cooperative. For instance, leading AI companies seem to me to be trying to make LLMs that are honest, and cooperative with their users (e.g. not threatening them). In fact, this seems to be a major focus of these companies. Do you think I am missing something?
I think the attitude most people (including me) have is: “If we want to do technical work to reduce AI x-risk, then we should NOT be working on any technical problems that will almost definitely get solved “by default”, e.g. because they’re straightforward and lots of people are already working on them and mostly succeeding, or because there’s no way to make powerful AGI except via first solving those problems, etc.”.
Then I would rephrase your original question as: “OK, if we shouldn’t be working on those types of technical problems above … then are there any technical problems left that we should be working on?”
And my answer is: Yes! There are lots!
GPT-4 is not all that capable—and in particular, not capable enough to constitute an x-risk. For example, I can NOT take 1000 copies of GPT-4, ask each one to start a company, give each of them some seed money, and each will brainstorm company ideas, and start talking to potential customers, and researching the competitive landscape, and hiring people, filing paperwork, iterating product ideas, etc. etc. That’s way beyond GPT-4.
But there will eventually be some future AI that can do that kind of thing.
And when there is, then I’m very interested in exactly what that AI will be “trying” / “motivated” to do. (Hopefully not “self-replicate around the internet, gradually win allies and consolidate power, and eventually launch a coup against humanity”!)
Personally, I happen to think that this kind of future AI will NOT look very much like LLM+RLHF—see my post where I come out as an “LLM plateau-ist”. So I can’t really speak from my own inside view here. But among the people who think that future LLM+RLHF+AutoGPT version N could do that kind of thing, I think most of them are not very optimistic that we can trust such AIs to not launch a coup against humanity, solely on the basis of RLHF seeming to make AIs more helpful and docile right now.
In principle, there seem to be numerous ways that RLHF can go wrong, and there are some reasons to think that future more capable models will have alignment-related failure modes that current models don’t, which are inherent to the way that RLHF works, and thus which can’t be fixed by just doing more RLHF with a more capable base model. For example, you can tell a story like Ajeya’s “training game” in the context of RLHF.
So we need to figure out if RLHF is or isn’t a solution that will continue to work all the way through the time when we will have the kind of future agentic situationally-aware AI that poses an x-risk. And if it doesn’t, then we need to figure out what else to do instead. I think we should be starting work on that right now, because there are reasons to think it’s a very hard technical problem, and will remain very hard even in the future when we have misaligned systems right in front of us to run tests on.
Oh, I was talking about model-based RL. You’re talking about LLM+RLHF, which is a different AI architecture. These days, LLM+RLHF is so much in the news that people sometimes forget that other types of AI exist at all. But really, model-based RL remains a reasonably active field, and was more so in the recent past and might be again in the future. Famous examples of model-based RL include MuZero, AlphaStar, OpenAI Five, etc. All of those projects were laser-focused on making agents that were effective at winning games. They sure weren’t trying to make agents that viewed kindness as an end in itself.
In the case of figuring out how model-based RL works in the brain, here I’m intimately familiar with the literature, and I can vouch that there is dramatically more work and interest tackling the question of “how does the brain reward signal update the trained model?” than the question of “how is the brain reward signal calculated in the first place?” This is especially true among the AI-adjacent neuroscientists with a knack for algorithms.
I mostly think of alignment as about avoiding deception or catastrophic misgeneralization outside of testing settings.
In general I also believe that AI companies have a massive incentive to align their systems with user intent. You can’t profit if you are dead.
Dan Hendrycks’ lecture on “Safety-Capabilities Balance” might be helpful here.
Thanks this is really useful. (I will try to go through this course, as well.)
I’m not sure the talk has it quite right though. My take is that on the most popular definitions of alignment and capabilities, they are partly conceptually the same, depending on which intentions we are meant to be aligning with. So, it’s not the case that there is a ‘alignment externality’ of a capabilities improvements, but rather that some alignment improvements are capabilities improvements, by definition.
When people distinguish between alignment and capabilities, I think they’re often interested in the question of what research is good vs. bad for humanity. Alignment vs. capabilities seems insufficient to answer that more important question. Here’s my attempt at a better distinction:
There are many different risks from AI. Research can reduce some risks while exacerbating others. “Safety” and “capabilities” are therefore incorrectly reductive. Research should be assessed by its distinct impacts on many different risks and benefits. If a research direction is better for humanity than most other research directions, then perhaps we should award it the high-status title of “safety research.”
Scalable oversight is a great example. It provides more accurate feedback to AI systems, reducing the risk that AIs will pursue objectives that conflict with human goals because their feedback has been inaccurate. But it also makes AI systems more commercially viable, shortening timelines and perhaps hastening the onset of other risks, such as misuse, arms races, or deceptive alignment. The cost-benefit calculation is quite complicated.
“Alignment” can be a red herring in these discussions, as misalignment is far from the only way that AI can lead to catastrophe or extinction.
Related: https://www.lesswrong.com/posts/zswuToWK6zpYSwmCn/some-background-for-reasoning-about-dual-use-alignment
I think you missed out:
Take OpenAI’s Superalignment approach. It involves “building a roughly human-level automated alignment researcher” then “using vast amounts of compute to scale efforts, and iteratively align superintelligence”.
AI capabilities is central to the alignment approach because us humans are way too limited to achieve alignment ourselves.
I agree with Jaime’s answer about how alignment should avoid deception. (Catastrophic misgeneralization seems like it could fall under your alignment as capabilities argument.)
I sometimes think of alignment as something like “aligned with universal human values” more than “aligned with the specific goal of the human who programmed this model”. One might argue there aren’t a ton of universal human values. Which is correct! I’m thinking very basic stuff like, “I value there being enough breathable oxygen to support human life”.