I define “alignment” as “the AI is trying to do things that the AI designer had intended for the AI to be trying to do”, see here for discussion.
If you define “capabilities” as “anything that would make an AI more useful / desirable to a person or company”, then alignment research would be by definition a subset of capabilities research.
But it’s a very small subset!
Examples of things that constitute capabilities progress but not alignment progress include: faster and better and more and cheaper chips (and other related hardware like interconnects), the development of CUDA, PyTorch, etc., the invention of BatchNorm and Xavier initialization and adam optimizers and Transformers, etc. etc.
Here’s a concrete example. Suppose future AGI winds up working somewhat like human brain within-lifetime learning (which I claim is in the broad category of model-based reinforcement learning (RL).) A key ingredient in model-based RL is the reward function, which in the human brain case loosely corresponds to “innate drives”, like pain being bad (other things equal), eating when hungry being good, and hundreds more things like that. If future AI works like that, then future AI programmers can put whatever innate drives they want into their future AIs. So there’s a technical problem of “what innate drives / reward function (if any) would lead to AIs that are honest, cooperative, kind, etc.?” And this problem is not only currently unsolved, but almost nobody is working on it. Is solving this problem necessary to create economically-useful powerful AGIs? Unfortunately, it is not!! Just look at human high-functioning sociopaths. If we made AGIs like that, we could get extremely competent agents—agents that can make and execute complicated plans, figure things out, do science, invent tools to solve their problems, etc.—with none of the machinery that gives humans an intrinsic tendency to compassion and morality. Such AGIs would nevertheless be very profitable to use … for exactly as long as they can be successfully prevented from breaking free and pursuing their own interests, in which case we’re in big trouble. (By analogy, human slaves are likewise not “aligned” with their masters but still economically useful.) I have much more discussion and elaboration of all this stuff here.
To take one of your examples—faster and better chips (or more compute generally). It seems like this does actually improve alignment on perhaps the most popular definition of alignment as intent-alignment. In terms of answering questions from prompts, GPT-4 is more in line with the intentions of the user than GPT-3 and this is mainly due to more compute. I mean this in the sense that it producers answers that are better/more in line with what users want
“So there’s a technical problem of “what innate drives / reward function (if any) would lead to AIs that are honest, cooperative, kind, etc.?” And this problem is not only currently unsolved, but almost nobody is working on it.”
I’m not sure I agree that nobody is working on the problem of which reward functions make AIs that are honest and cooperative. For instance, leading AI companies seem to me to be trying to make LLMs that are honest, and cooperative with their users (e.g. not threatening them). In fact, this seems to be a major focus of these companies. Do you think I am missing something?
I think the attitude most people (including me) have is: “If we want to do technical work to reduce AI x-risk, then we should NOT be working on any technical problems that will almost definitely get solved “by default”, e.g. because they’re straightforward and lots of people are already working on them and mostly succeeding, or because there’s no way to make powerful AGI except via first solving those problems, etc.”.
Then I would rephrase your original question as: “OK, if we shouldn’t be working on those types of technical problems above … then are there any technical problems left that we should be working on?”
And my answer is: Yes! There are lots!
GPT-4 is more in line with the intentions of the user than GPT-3 and this is mainly due to more compute.
GPT-4 is not all that capable—and in particular, not capable enough to constitute an x-risk. For example, I can NOT take 1000 copies of GPT-4, ask each one to start a company, give each of them some seed money, and each will brainstorm company ideas, and start talking to potential customers, and researching the competitive landscape, and hiring people, filing paperwork, iterating product ideas, etc. etc. That’s way beyond GPT-4.
But there will eventually be some future AI that can do that kind of thing.
And when there is, then I’m very interested in exactly what that AI will be “trying” / “motivated” to do. (Hopefully not “self-replicate around the internet, gradually win allies and consolidate power, and eventually launch a coup against humanity”!)
Personally, I happen to think that this kind of future AI will NOT look very much like LLM+RLHF—see my post where I come out as an “LLM plateau-ist”. So I can’t really speak from my own inside view here. But among the people who think that future LLM+RLHF+AutoGPT version N could do that kind of thing, I think most of them are not very optimistic that we can trust such AIs to not launch a coup against humanity, solely on the basis of RLHF seeming to make AIs more helpful and docile right now.
In principle, there seem to be numerous ways that RLHF can go wrong, and there are some reasons to think that future more capable models will have alignment-related failure modes that current models don’t, which are inherent to the way that RLHF works, and thus which can’t be fixed by just doing more RLHF with a more capable base model. For example, you can tell a story like Ajeya’s “training game” in the context of RLHF.
So we need to figure out if RLHF is or isn’t a solution that will continue to work all the way through the time when we will have the kind of future agentic situationally-aware AI that poses an x-risk. And if it doesn’t, then we need to figure out what else to do instead. I think we should be starting work on that right now, because there are reasons to think it’s a very hard technical problem, and will remain very hard even in the future when we have misaligned systems right in front of us to run tests on.
I’m not sure I agree that nobody is working on the problem of which reward functions make AIs that are honest and cooperative.
Oh, I was talking about model-based RL. You’re talking about LLM+RLHF, which is a different AI architecture. These days, LLM+RLHF is so much in the news that people sometimes forget that other types of AI exist at all. But really, model-based RL remains a reasonably active field, and was more so in the recent past and might be again in the future. Famous examples of model-based RL include MuZero, AlphaStar, OpenAI Five, etc. All of those projects were laser-focused on making agents that were effective at winning games. They sure weren’t trying to make agents that viewed kindness as an end in itself.
In the case of figuring out how model-based RL works in the brain, here I’m intimately familiar with the literature, and I can vouch that there is dramatically more work and interest tackling the question of “how does the brain reward signal update the trained model?” than the question of “how is the brain reward signal calculated in the first place?” This is especially true among the AI-adjacent neuroscientists with a knack for algorithms.
I define “alignment” as “the AI is trying to do things that the AI designer had intended for the AI to be trying to do”, see here for discussion.
If you define “capabilities” as “anything that would make an AI more useful / desirable to a person or company”, then alignment research would be by definition a subset of capabilities research.
But it’s a very small subset!
Examples of things that constitute capabilities progress but not alignment progress include: faster and better and more and cheaper chips (and other related hardware like interconnects), the development of CUDA, PyTorch, etc., the invention of BatchNorm and Xavier initialization and adam optimizers and Transformers, etc. etc.
Here’s a concrete example. Suppose future AGI winds up working somewhat like human brain within-lifetime learning (which I claim is in the broad category of model-based reinforcement learning (RL).) A key ingredient in model-based RL is the reward function, which in the human brain case loosely corresponds to “innate drives”, like pain being bad (other things equal), eating when hungry being good, and hundreds more things like that. If future AI works like that, then future AI programmers can put whatever innate drives they want into their future AIs. So there’s a technical problem of “what innate drives / reward function (if any) would lead to AIs that are honest, cooperative, kind, etc.?” And this problem is not only currently unsolved, but almost nobody is working on it. Is solving this problem necessary to create economically-useful powerful AGIs? Unfortunately, it is not!! Just look at human high-functioning sociopaths. If we made AGIs like that, we could get extremely competent agents—agents that can make and execute complicated plans, figure things out, do science, invent tools to solve their problems, etc.—with none of the machinery that gives humans an intrinsic tendency to compassion and morality. Such AGIs would nevertheless be very profitable to use … for exactly as long as they can be successfully prevented from breaking free and pursuing their own interests, in which case we’re in big trouble. (By analogy, human slaves are likewise not “aligned” with their masters but still economically useful.) I have much more discussion and elaboration of all this stuff here.
Thanks a lot for this!
To take one of your examples—faster and better chips (or more compute generally). It seems like this does actually improve alignment on perhaps the most popular definition of alignment as intent-alignment. In terms of answering questions from prompts, GPT-4 is more in line with the intentions of the user than GPT-3 and this is mainly due to more compute. I mean this in the sense that it producers answers that are better/more in line with what users want
I’m not sure I agree that nobody is working on the problem of which reward functions make AIs that are honest and cooperative. For instance, leading AI companies seem to me to be trying to make LLMs that are honest, and cooperative with their users (e.g. not threatening them). In fact, this seems to be a major focus of these companies. Do you think I am missing something?
I think the attitude most people (including me) have is: “If we want to do technical work to reduce AI x-risk, then we should NOT be working on any technical problems that will almost definitely get solved “by default”, e.g. because they’re straightforward and lots of people are already working on them and mostly succeeding, or because there’s no way to make powerful AGI except via first solving those problems, etc.”.
Then I would rephrase your original question as: “OK, if we shouldn’t be working on those types of technical problems above … then are there any technical problems left that we should be working on?”
And my answer is: Yes! There are lots!
GPT-4 is not all that capable—and in particular, not capable enough to constitute an x-risk. For example, I can NOT take 1000 copies of GPT-4, ask each one to start a company, give each of them some seed money, and each will brainstorm company ideas, and start talking to potential customers, and researching the competitive landscape, and hiring people, filing paperwork, iterating product ideas, etc. etc. That’s way beyond GPT-4.
But there will eventually be some future AI that can do that kind of thing.
And when there is, then I’m very interested in exactly what that AI will be “trying” / “motivated” to do. (Hopefully not “self-replicate around the internet, gradually win allies and consolidate power, and eventually launch a coup against humanity”!)
Personally, I happen to think that this kind of future AI will NOT look very much like LLM+RLHF—see my post where I come out as an “LLM plateau-ist”. So I can’t really speak from my own inside view here. But among the people who think that future LLM+RLHF+AutoGPT version N could do that kind of thing, I think most of them are not very optimistic that we can trust such AIs to not launch a coup against humanity, solely on the basis of RLHF seeming to make AIs more helpful and docile right now.
In principle, there seem to be numerous ways that RLHF can go wrong, and there are some reasons to think that future more capable models will have alignment-related failure modes that current models don’t, which are inherent to the way that RLHF works, and thus which can’t be fixed by just doing more RLHF with a more capable base model. For example, you can tell a story like Ajeya’s “training game” in the context of RLHF.
So we need to figure out if RLHF is or isn’t a solution that will continue to work all the way through the time when we will have the kind of future agentic situationally-aware AI that poses an x-risk. And if it doesn’t, then we need to figure out what else to do instead. I think we should be starting work on that right now, because there are reasons to think it’s a very hard technical problem, and will remain very hard even in the future when we have misaligned systems right in front of us to run tests on.
Oh, I was talking about model-based RL. You’re talking about LLM+RLHF, which is a different AI architecture. These days, LLM+RLHF is so much in the news that people sometimes forget that other types of AI exist at all. But really, model-based RL remains a reasonably active field, and was more so in the recent past and might be again in the future. Famous examples of model-based RL include MuZero, AlphaStar, OpenAI Five, etc. All of those projects were laser-focused on making agents that were effective at winning games. They sure weren’t trying to make agents that viewed kindness as an end in itself.
In the case of figuring out how model-based RL works in the brain, here I’m intimately familiar with the literature, and I can vouch that there is dramatically more work and interest tackling the question of “how does the brain reward signal update the trained model?” than the question of “how is the brain reward signal calculated in the first place?” This is especially true among the AI-adjacent neuroscientists with a knack for algorithms.