AGI will be able to model human language and psychology very accurately. Given that, isn’t alignment easy if you train the AGI to interpret linguistic prompts in the way that the “average” human would? (I know language doesn’t encode an exact meaning, but for any chunk of text, there does exist a distribution of ways that humans interpret it.)
Thus, on its face, inner alignment seems fairly doable. But apparently, according to RobBesinger, “We don’t know how to get an AI system’s goals to robustly ‘point at’ objects like ‘the American people’ … [or even] simpler physical systems.” Why is this so difficult? Does there exist an argument that it is impossible?
Outer alignment doesn’t seem very difficult to me, either. Here’s a prompt I thought of: “Do not do an action if anyone in a specified list of philosophers, intellectuals, members of the public, etc. would prefer you not do it, if they had all relevant knowledge of the action and its effects beforehand, consistent with the human legal standard of informed consent.” Wouldn’t this prompt (in its ideal form, not exactly as I wrote it) guard against many bad actions, including power-seeking behavior?
Thank you for the help!
Well, I hope it’s not impossible! If it is, we’re in a pretty bad spot. But it’s definitely true that we don’t know how to do it, despite lots of hard work over the last 30+ years. To really get why this should be, you have to understand how AI training works in a somewhat low-level way.
Suppose we want an image classifier—something that’ll tell us whether a picture has a sheep in it, let’s say. Schematically, here’s how we build one:
Start with a list of a few million random numbers. These are our parameters.
Find a bunch of images, some with sheep in them and some without (we know which is which because humans labeled them manually). This is our training data.
Pick some images from the training data, multiply them with the parameters in various ways, and interpret the result as a confidence, between 0 and 1, of whether each image has a sheep.
Probably it did terribly! Random numbers don’t know anything about sheep.
So, we make some small random-ish changes to the parameters and see if that helps.
For example, we might say “we changed this parameter from 0.5 to 0.6 and overall accuracy went from 51.2% to 51.22%, next time we’ll go to 0.7 and see if that keeps helping or if it’s too high or what.”
Repeat step 5, a lot.
Eventually you get to where the parameters do a good job predicting whether the images have a sheep or not, so you stop.
I’m leaving out some mathematical details, but nothing that changes the overall picture.
All modern AI training works basically this way: start with random numbers, use them to do some task, evaluate their performance, and then tweak the numbers in a way that seems to point toward better performance.
Crucially, we never know why a change to particular parameter is good, just that it is. Similarly, we never know what the AI is “really” trying to do, just that whatever it’s doing helps it do the task—to classify the images in our training set, for example. But that doesn’t mean that it’s doing what we want. For example, maybe all the pictures of sheep are in big grassy fields, while the non-sheep pictures tend to have more trees, and so what we actually trained was an “are there a lot of trees?” classifier. This kind of thing happens all the time in machine learning applications. When people talk about “generalizing out of distribution”, this is what they mean: the AI was trained on some data, but will it still perform the way we’d want on other, different data? Often the answer is no.
So that’s the first big difficulty with setting terminal goals: we can’t define the AI’s goals directly, we just show it a bunch of examples of the thing we want and hope it learns what they all have in common. Even after we’re done, we have no way to find out what patterns it really found except by experiment, which with superhuman AIs is very dangerous. There are other difficulties but this post is already rather long.
Excellent explanation. It seems to me that this problem might be mitigated if we reworked AI’s structure/growth so that it mimicked a human brain as closely as possible.
There is actually an impossibility argument. Even if you could robustly specify goals in AGI, there is another convergent phenonemon that would cause misaligned effects and eventually remove the goal structures.
You can find an intuitive summary here: https://www.lesswrong.com/posts/jFkEhqpsCRbKgLZrd/what-if-alignment-is-not-enough
Part of your question here seems to be, “If we can design a system that understands goals written in natural language, won’t it be very unlikely to deviate from what we really wanted when we wrote the goal?” Regarding that point, I’m not an expert, but I’ll point to some discussion by experts.
There are, as you may have seen, lists of examples where real AI systems have done things completely different from what their designers were intending. For example, this talk, in the section on Goodhart’s law, has a link to such a list. But from what I can tell, those examples never involve the designers specifying goals in natural language. (I’m guessing that specifying goals that way hasn’t seemed even faintly possible until recently, so nobody’s really tried it?)
Here’s a recent paper by academic philosophers that seems supportive of your question. The authors argue that AGI systems that involve large language models would be safer than alternative systems precisely because they could receive goals written in natural language. (See especially the two sections titled “reward misspecification”—though note also the last paragraph, where they suggest it might be a better idea to avoid goal-directed AI altogether.) If you want more details on whether that suggestion is correct, you might keep an eye on reactions to this paper. There are some comments on the LessWrong post, and I see the paper was submitted for a contest.
I think you’re on to something and some related thoughts are a significant part of my research agenda. Here are some references you might find useful (heavily biased towards my own thinking on the subject), numbered by paragraph in your post:
There’s a lot of cumulated evidence of significant overlap between LM and human linguistic representations, scaling laws of this phenomenon seem favorable and LM embeddings have also been used as a model of shared linguistic space for transmitting thoughts during communication. I interpret this as suggesting outer alignment will likely be solved by default for LMs.
I think I disagree quite strongly that “We don’t know how to get an AI system’s goals to robustly ‘point at’ objects like ‘the American people’ … [or even] simpler physical systems.”, e.g. I suspect many alignment-relevant concepts (like ‘Helpful, Harmless, Honest’) are abstract and groundable in language, see e.g. Language is more abstract than you think, or, why aren’t languages more iconic?. Also, the previous point (brain-LM comparisons), as well as LM performance, suggest the linguistic grounding is probably already happening to a significant degree.
Robustness here seems hard, see e.g. these references on shortcuts in in-context learning (ICL) / prompting: https://arxiv.org/abs/2303.03846 https://arxiv.org/abs/2305.17256 https://arxiv.org/abs/2305.13299 https://arxiv.org/abs/2305.14950 https://arxiv.org/abs/2305.19148. An easier / more robust target might be something like ‘be helpful’. Though I agree in general ICL as Bayesian inference (see e.g. http://ai.stanford.edu/blog/understanding-incontext/ and follow the citation trail, there are a lot of recent related works) suggests that the longer the prompt, the more likely it would be to ‘locate the task’.
I’ll also note that the role of the Constitution in Constitutional AI (https://www.anthropic.com/index/claudes-constitution) seems quite related to your 3rd paragraph.
This comment will focus on the specific approaches you set out, rather than the high level question, although I’m also interested in seeing comments from others on how difficult it is to solve alignment, and why.
The approach you’ve set out resembles Coherent Extrapolated Volition (CEV), which was described earlier by Bostrom. I’m not sure what the consensus is on CEV, but here’s a few thoughts which I have in my head from when I thought about CEV (several years ago now).
How do we choose the correct philosophers and intellectuals—e.g. would we want Nietsche or Wagner to be on the list of intellectuals, given the (arguable) links to the Nazis?
How do we extrapolate? (i.e. how do you determine whether the list of intellectuals would want the action to happen?)
For example, Plato was arguably in favour of dictatorships and preferred them over democracies, but recent history seems to suggest that democracies have fared better than dictatorships—should we extrapolate that Plato would prefer democracies if he lived today? How do we know?
Another example, perhaps a bit closer to home: some philosophers might argue that under some forms of utilitarianism, the ends justify the means, and it is appropriate to steal resources in order to fund activities which are in the best long-term interests of humanity. Even if those philosophers say they don’t believe that, they might just be pandering to expectations from society, and the AI might extrapolate that they would say that if unfettered.
In other words, I don’t think this does clearly guarantee us against power-seeking behaviour.
“How do we choose the correct philosophers?” Choose nearly all of them; don’t be selective. Because the AI must get approval fom every philosopher, this will be a severe constraint, but it ensures that the AI’s actions will be unambiguously good. Even if the AI has to make contentious extrapolations about some of the philosophers, I don’t think it would be free to do anything awful.
Under that constraint, I wonder if the AI would be free to do anything at all.
Ok, maybe don’t include every philosopher. But I think it would be good to include people with a diverse range of views: utilitarians, deontologists, animal rights activists, human rights activists, etc. I’m uncomfortable with the thought of AI unilaterally imposing a contentious moral philosophy (like extreme utilitarianism) on the world.
Even with my constraints, I think AI would be free to solve many huge problems, e.g. climate change, pandemics, natural disasters, and extreme poverty.
Assuming it could be implemented, I definitely think your approach would help prevent the imposition of serious harms.
I still intuitively think the AI could just get stuck though, given the range of contradictory views even in fairly mainstream moral and political philosophy. It would need to have a process for making decisions under moral uncertainty, which might entail putting additional weight on the views on certain philosophers. But because this is (as far as I know) a very recent area of ethics, the only existing work could be quite badly flawed.
I think a superintelligent AI will be able to find solutions with no moral uncertainty. For example, I can’t imagine what philosopher would object to bioengineering a cure to a disease.
I don’t think you need to commit yourself to including everyone. If it is true for any subset of people, then the point you gesture at in your post goes through. I have had similar thoughts to those you suggest in the post. If we gave the AI the goal of ‘do what Barack Obama would do if properly informed and at his most lucid’, I don’t really get why we would have high confidence in a treacherous turn or of the AI misbehaving in a catastrophic way. The main response to this seems to be to point to examples of AI not doing what we intend from limited computer games. I agree something similar might happen with advanced AI but don’t get why it is guaranteed to do so or why any of the arguments I have seen lend weight to any particular probability estimate of catastrophe.
It also seems like increased capabilities would in a sense increased alignment (with Obama) because the more advanced AIs would have a better idea of what Obama would do.
The goal you specify in the prompt is not the goal that the AI is acting on when it responds. Consider: if someone tells you, “Your goal is now [x]”, does that change your (terminal) goals? No, because those don’t come from other people telling you things (or other environmental inputs)[1].
Understanding a goal that’s been put into writing, and having that goal, are two very different things.
This is a bit of an exaggeration, because humans don’t generally have very coherent goals, and will “discover” new goals or refine existing ones as they learn new things. But I think it’s basically correct to say that there’s no straightforward relationship between telling a human to have a goal, and them having it, especially for adults (i.e. a trained model).
Sorry, I’m still a little confused. If we establish an AI’s terminal goal from the get-go, why wouldn’t we have total control over it?
We don’t know how to do that. It’s something that falls out of its training, but we currently don’t know how to even predict what goal any particular training setup will result in, let alone aim for a specific one.
I’d suggest this thread (and the linked LW post) as a good overview of the arguments. You could also take a look at the relevant section of the Intro to EA handbook or this post.
In general, I think you’ll probably find that you’ll get a better response on the forum if you spend some time engaging with the intro materials and come back with specific questions or arguments.
I’ve tried to engage with the intro materials, but I still have several questions:
a. Why doesn’t my proposed prompt solve outer alignment?
b. Why would AI ever pursue a proxy goal at the expense of its assigned goal? The human evolution analogy doesn’t quite make sense to me because evolution isn’t an algorithm with an assigned goal. Besides, even when friendship doesn’t increase the odds of reproduction, it doesn’t decrease the odds either; so this doesn’t seem like an example where the proxy goal is being pursued at the expense of the assigned goal.
c. I’ve read that it’s very difficult to get AI to point at any specific thing in the environment. But wouldn’t that problem be resolved if AI deeply understood the distribution of ways that humans use language to refer to things?
Thanks for thoughtfully engaging with this topic! I’ve spent a lot of time exploring arguments that alignment is hard, and am also unconvinced. I’m particularly skeptical about deceptive alignment, which is closely related to your point b. I’m clearly not the right person to explain why people think the problem is hard, but I think it’s good to share alternative perspectives.
If you’re interested in more skeptical arguments, there’s a forum tag and a lesswrong tag. I particularly like Quintin Pope’s posts on the topic.
Maybe frame it more as if you’re talking to a child. Yes you can tell the child to follow something but how are you certain that it will do it?
Similarly, how can we trust the AI to actually follow the prompt? To trust it we would fundamentally have to understand the AI or safeguard against problems if we don’t understand it. The question then becomes how your prompt is represented in machine language, which is very hard to answer.
To reiterate, ask yourself, how do you know that the AI will do what you say?