This is a post about my own confusions. It seems likely that other people have discussed these issues at length somewhere, and that I am not up with current thoughts on them, because I don’t keep good track of even everything great that everyone writes. I welcome anyone kindly directing me to the most relevant things, or if such things are sufficiently well thought through that people can at this point just correct me in a small number of sentences, I’d appreciate that even more.
~
The traditional argument for AI alignment being hard is that human value is ‘complex’ and ‘fragile’. That is, it is hard to write down what kind of future we want, and if we get it even a little bit wrong, most futures that fit our description will be worthless.
The illustrations I have seen of this involve a person trying to write a description of value conceptual analysis style, and failing to put in things like ‘boredom’ or ‘consciousness’, and so getting a universe that is highly repetitive, or unconscious.
I’m not yet convinced that this is world-destroyingly hard.
Firstly, it seems like you could do better than imagined in these hypotheticals:
These thoughts are from a while ago. If instead you used ML to learn what ‘human flourishing’ looked like in a bunch of scenarios, I expect you would get something much closer than if you try to specify it manually. Compare manually specifying what a face looks like, then generating examples from your description to using modern ML to learn it and generate them.
Even in the manually describing it case, if you had like a hundred people spend a hundred years writing a very detailed description of what went wrong, instead of a writer spending an hour imagining ways that a more ignorant person may mess up if they spent no time on it, I could imagine it actually being pretty close. I don’t have a good sense of how far away it is.
I agree that neither of these would likely get you to exactly human values.
But secondly, I’m not sure about the fragility argument: that if there is basically any distance between your description and what is truly good, you will lose everything.
This seems to be a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something, and b) assuming that there is a fast takeoff so that the relevant AI has its values forever, and takes over the world.
My guess is that values that are got using ML but still somewhat off from human values are much closer in terms of not destroying all value of the universe, than ones that a person tries to write down. Like, the kinds of errors people have used to illustrate this problem (forget to put in, ‘consciousness is good’) are like forgetting to say faces have nostrils in trying to specify what a face is like, whereas a modern ML system’s imperfect impression of a face seems more likely to meet my standards for ‘very facelike’ (most of the time).
Perhaps a bigger thing for me though is the issue of whether an AI takes over the world suddenly. I agree that if that happens, lack of perfect alignment is a big problem, though not obviously an all value nullifying one (see above). But if it doesn’t abruptly take over the world, and merely becomes a large part of the world’s systems, with ongoing ability for us to modify it and modify its roles in things and make new AI systems, then the question seems to be how forcefully the non-alignment is pushing us away from good futures relative to how forcefully we can correct this. And in the longer run, how well we can correct it in a deep way before AI does come to be in control of most decisions. So something like the speed of correction vs. the speed of AI influence growing.
These are empirical questions about the scales of different effects, rather than questions about whether a thing is analytically perfect. And I haven’t seen much analysis of them. To my own quick judgment, it’s not obvious to me that they look bad.
For one thing, these dynamics are already in place: the world is full of agents and more basic optimizing processes that are not aligned with broad human values—most individuals to a small degree, some strange individuals to a large degree, corporations, competitions, the dynamics of political processes. It is also full of forces for aligning them individually and stopping the whole show from running off the rails: law, social pressures, adjustment processes for the implicit rules of both of these, individual crusades. The adjustment processes themselves are not necessarily perfectly aligned, they are just overall forces for redirecting toward alignment. And in fairness, this is already pretty alarming. It’s not obvious to me that imperfectly aligned AI is likely to be worse than the currently misaligned processes, and even that it won’t be a net boon for the side of alignment.
So then the largest remaining worry is that it will still gain power fast and correction processes will be slow enough that its somewhat misaligned values will be set in forever. But it isn’t obvious to me that by that point it isn’t sufficiently well aligned that we would recognize its future as a wondrous utopia, just not the very best wondrous utopia that we would have imagined if we had really carefully sat down and imagined utopias for thousands of years. This again seems like an empirical question of the scale of different effects, unless there is a an argument that some effect will be totally overwhelming.
Some thoughts:
Not really knowledgeable, but wasn’t the project of coding values into AI was attempted in some way by machine ethicists? That could serve as a starting point for guessing how much time it should take to specify human values.
I find it interesting that you are alarmed by current non-AI agents/optimization processes. I think that if you take Drexler’s CAIS seriously, that might make that sort of analysis more important.
I think that Friendship is Optimal’s depiction of a Utopia is relevant here.
Not much of a spoiler, but beware—It seems like the possibility of having future civilization living a life that is practically very similar to ours (autonomy, possibility of doing something important, community, food,.. 😇) but just better in almost every aspect is incredible. There are some weird stuff there, some of which are horrible, so I’m not that certain about that.
Regarding intuition of ML for learning faces, I am not sure that this is a great analogy because the module that tries to understand human morality might get totally misinterpreted by other modules. Reward hacking, overfitting and adversarial examples are some things that pop to mind here as ways this can go wrong. My intuition here is that any maximizer would find “bugs” in it’s model of human morality to exploit (because it is complex and fragile).
That may be a fundamental problem
It seems like your intuition is mostly based on the possibility of self correction, and I feel like that is indeed where a major crux for this question lies.
Machine learning works fine on non adversarial inputs. If you train a network to distinguish cats from dogs, and put in a normal picture of a cat, it works. However, there are all sorts of wierd inputs that look nothing like cats or dogs that will also get classified as cats. If you give the network a bunch of bad situations, and a bunch of good, (say you crack open a history textbook, and ask a bunch of people how nice various periods and regimes were.) then you will get a network that can distinguish bad from good within the normal flow of human history. This doesn’t stop there being some wierd state that counts as extremely good. Deciding what is and isn’t a good future depends on the answers to moral questions that haven’t come up yet, and so we don’t have any training data for questions involving tech we don’t yet have. This can make a big difference. If we decided that uploaded minds do count morally, we are probably going for an entirely virtual civilization, one an anti uploader would consider worthless. If we decide that mind uploads don’t count morally, we might simulate loads in horrible situations for violent video games. Someone who did think that uploaded minds mattered would consider that an S risk, potentially worse than nothing.
Human level goals are moderately complicated in terms of human level concepts. In the outcome pump, “get my mother out of the building” is a human level concept. I agree that you could probably get useful and safeish behavior from such a device given a few philosopher years. Much of the problem is that concepts like “mother” and “building” are really difficult to specify in terms of quantum operators on quark positions or whatever. The more you break human concepts down, the more edge cases you find. Getting a system that would explode the building is most of the job.
The examples of obviously stupid utility functions having obviously bad results are toy problems, when we have a better understanding of symbol grounding, we will know how much the problems keep reappearing. Manually specifying a utility function Might be feasible.
Presumably, many actors will be investing a lot of resources into building the most capable and competitive ML models in many domains (e.g. models for predicting stock prices). It seems to me that the purpose of the field of AI alignment is to make it easier for actors to build such models in a way that is both safe and competitive. AI alignment seems hard to me because using arbitrarily-scaled-up versions of contemporary ML methods—in a safe and competitive way—seems hard.