To take one of your examples—faster and better chips (or more compute generally). It seems like this does actually improve alignment on perhaps the most popular definition of alignment as intent-alignment. In terms of answering questions from prompts, GPT-4 is more in line with the intentions of the user than GPT-3 and this is mainly due to more compute. I mean this in the sense that it producers answers that are better/more in line with what users want
“So there’s a technical problem of “what innate drives / reward function (if any) would lead to AIs that are honest, cooperative, kind, etc.?” And this problem is not only currently unsolved, but almost nobody is working on it.”
I’m not sure I agree that nobody is working on the problem of which reward functions make AIs that are honest and cooperative. For instance, leading AI companies seem to me to be trying to make LLMs that are honest, and cooperative with their users (e.g. not threatening them). In fact, this seems to be a major focus of these companies. Do you think I am missing something?
I think the attitude most people (including me) have is: “If we want to do technical work to reduce AI x-risk, then we should NOT be working on any technical problems that will almost definitely get solved “by default”, e.g. because they’re straightforward and lots of people are already working on them and mostly succeeding, or because there’s no way to make powerful AGI except via first solving those problems, etc.”.
Then I would rephrase your original question as: “OK, if we shouldn’t be working on those types of technical problems above … then are there any technical problems left that we should be working on?”
And my answer is: Yes! There are lots!
GPT-4 is more in line with the intentions of the user than GPT-3 and this is mainly due to more compute.
GPT-4 is not all that capable—and in particular, not capable enough to constitute an x-risk. For example, I can NOT take 1000 copies of GPT-4, ask each one to start a company, give each of them some seed money, and each will brainstorm company ideas, and start talking to potential customers, and researching the competitive landscape, and hiring people, filing paperwork, iterating product ideas, etc. etc. That’s way beyond GPT-4.
But there will eventually be some future AI that can do that kind of thing.
And when there is, then I’m very interested in exactly what that AI will be “trying” / “motivated” to do. (Hopefully not “self-replicate around the internet, gradually win allies and consolidate power, and eventually launch a coup against humanity”!)
Personally, I happen to think that this kind of future AI will NOT look very much like LLM+RLHF—see my post where I come out as an “LLM plateau-ist”. So I can’t really speak from my own inside view here. But among the people who think that future LLM+RLHF+AutoGPT version N could do that kind of thing, I think most of them are not very optimistic that we can trust such AIs to not launch a coup against humanity, solely on the basis of RLHF seeming to make AIs more helpful and docile right now.
In principle, there seem to be numerous ways that RLHF can go wrong, and there are some reasons to think that future more capable models will have alignment-related failure modes that current models don’t, which are inherent to the way that RLHF works, and thus which can’t be fixed by just doing more RLHF with a more capable base model. For example, you can tell a story like Ajeya’s “training game” in the context of RLHF.
So we need to figure out if RLHF is or isn’t a solution that will continue to work all the way through the time when we will have the kind of future agentic situationally-aware AI that poses an x-risk. And if it doesn’t, then we need to figure out what else to do instead. I think we should be starting work on that right now, because there are reasons to think it’s a very hard technical problem, and will remain very hard even in the future when we have misaligned systems right in front of us to run tests on.
I’m not sure I agree that nobody is working on the problem of which reward functions make AIs that are honest and cooperative.
Oh, I was talking about model-based RL. You’re talking about LLM+RLHF, which is a different AI architecture. These days, LLM+RLHF is so much in the news that people sometimes forget that other types of AI exist at all. But really, model-based RL remains a reasonably active field, and was more so in the recent past and might be again in the future. Famous examples of model-based RL include MuZero, AlphaStar, OpenAI Five, etc. All of those projects were laser-focused on making agents that were effective at winning games. They sure weren’t trying to make agents that viewed kindness as an end in itself.
In the case of figuring out how model-based RL works in the brain, here I’m intimately familiar with the literature, and I can vouch that there is dramatically more work and interest tackling the question of “how does the brain reward signal update the trained model?” than the question of “how is the brain reward signal calculated in the first place?” This is especially true among the AI-adjacent neuroscientists with a knack for algorithms.
Thanks a lot for this!
To take one of your examples—faster and better chips (or more compute generally). It seems like this does actually improve alignment on perhaps the most popular definition of alignment as intent-alignment. In terms of answering questions from prompts, GPT-4 is more in line with the intentions of the user than GPT-3 and this is mainly due to more compute. I mean this in the sense that it producers answers that are better/more in line with what users want
I’m not sure I agree that nobody is working on the problem of which reward functions make AIs that are honest and cooperative. For instance, leading AI companies seem to me to be trying to make LLMs that are honest, and cooperative with their users (e.g. not threatening them). In fact, this seems to be a major focus of these companies. Do you think I am missing something?
I think the attitude most people (including me) have is: “If we want to do technical work to reduce AI x-risk, then we should NOT be working on any technical problems that will almost definitely get solved “by default”, e.g. because they’re straightforward and lots of people are already working on them and mostly succeeding, or because there’s no way to make powerful AGI except via first solving those problems, etc.”.
Then I would rephrase your original question as: “OK, if we shouldn’t be working on those types of technical problems above … then are there any technical problems left that we should be working on?”
And my answer is: Yes! There are lots!
GPT-4 is not all that capable—and in particular, not capable enough to constitute an x-risk. For example, I can NOT take 1000 copies of GPT-4, ask each one to start a company, give each of them some seed money, and each will brainstorm company ideas, and start talking to potential customers, and researching the competitive landscape, and hiring people, filing paperwork, iterating product ideas, etc. etc. That’s way beyond GPT-4.
But there will eventually be some future AI that can do that kind of thing.
And when there is, then I’m very interested in exactly what that AI will be “trying” / “motivated” to do. (Hopefully not “self-replicate around the internet, gradually win allies and consolidate power, and eventually launch a coup against humanity”!)
Personally, I happen to think that this kind of future AI will NOT look very much like LLM+RLHF—see my post where I come out as an “LLM plateau-ist”. So I can’t really speak from my own inside view here. But among the people who think that future LLM+RLHF+AutoGPT version N could do that kind of thing, I think most of them are not very optimistic that we can trust such AIs to not launch a coup against humanity, solely on the basis of RLHF seeming to make AIs more helpful and docile right now.
In principle, there seem to be numerous ways that RLHF can go wrong, and there are some reasons to think that future more capable models will have alignment-related failure modes that current models don’t, which are inherent to the way that RLHF works, and thus which can’t be fixed by just doing more RLHF with a more capable base model. For example, you can tell a story like Ajeya’s “training game” in the context of RLHF.
So we need to figure out if RLHF is or isn’t a solution that will continue to work all the way through the time when we will have the kind of future agentic situationally-aware AI that poses an x-risk. And if it doesn’t, then we need to figure out what else to do instead. I think we should be starting work on that right now, because there are reasons to think it’s a very hard technical problem, and will remain very hard even in the future when we have misaligned systems right in front of us to run tests on.
Oh, I was talking about model-based RL. You’re talking about LLM+RLHF, which is a different AI architecture. These days, LLM+RLHF is so much in the news that people sometimes forget that other types of AI exist at all. But really, model-based RL remains a reasonably active field, and was more so in the recent past and might be again in the future. Famous examples of model-based RL include MuZero, AlphaStar, OpenAI Five, etc. All of those projects were laser-focused on making agents that were effective at winning games. They sure weren’t trying to make agents that viewed kindness as an end in itself.
In the case of figuring out how model-based RL works in the brain, here I’m intimately familiar with the literature, and I can vouch that there is dramatically more work and interest tackling the question of “how does the brain reward signal update the trained model?” than the question of “how is the brain reward signal calculated in the first place?” This is especially true among the AI-adjacent neuroscientists with a knack for algorithms.