If no more AI safety work is necessary, that means that there is nothing we can do to significantly increase the chance of FAI over UFAI.
I could be almost certain that FAI would win because I had already built one. Although I suspect that there will be double checking to do, the new FAI will need told about what friendly behavior is, someone should keep an eye out for any UFAI ect. So FAI work will be needed until the point where no human labor is needed and we are all living in a utopia.
I could be almost certain that UFAI will win. I could see lots of people working on really scary systems and still not have the slightest idea of how to do make anything friendly. But there would still be a chance that those systems didn’t scale to superintelligence, that the people running them could be persuaded to turn them off, and that someone might come up with a brilliant alignment scheme tomorrow. Circumstances where you can see that you are utterly screwed, yet still be alive, seem unlikely. Keep working untill the nanites turn you into paperclips.
Alternatively, it might be clear that we aren’t getting any AI any time soon. The most likely cause of this would be a pretty serious disaster. It would have to destroy most of humanities technical ability and stop us rebuilding it. If AI alignment is something that we will need to do in a few hundred years, once we rebuild society enough to make silicon chips, its still probably worth having someone making sure that progress isn’t forgotten, and that the problem will be solved in time.
We gain some philosophical insight that says that AI is inherently good, always evil, impossible ect. It’s hard to imagine what a philosophical insight that you don’t have is like.
Machine learning works fine on non adversarial inputs. If you train a network to distinguish cats from dogs, and put in a normal picture of a cat, it works. However, there are all sorts of wierd inputs that look nothing like cats or dogs that will also get classified as cats. If you give the network a bunch of bad situations, and a bunch of good, (say you crack open a history textbook, and ask a bunch of people how nice various periods and regimes were.) then you will get a network that can distinguish bad from good within the normal flow of human history. This doesn’t stop there being some wierd state that counts as extremely good. Deciding what is and isn’t a good future depends on the answers to moral questions that haven’t come up yet, and so we don’t have any training data for questions involving tech we don’t yet have. This can make a big difference. If we decided that uploaded minds do count morally, we are probably going for an entirely virtual civilization, one an anti uploader would consider worthless. If we decide that mind uploads don’t count morally, we might simulate loads in horrible situations for violent video games. Someone who did think that uploaded minds mattered would consider that an S risk, potentially worse than nothing.
Human level goals are moderately complicated in terms of human level concepts. In the outcome pump, “get my mother out of the building” is a human level concept. I agree that you could probably get useful and safeish behavior from such a device given a few philosopher years. Much of the problem is that concepts like “mother” and “building” are really difficult to specify in terms of quantum operators on quark positions or whatever. The more you break human concepts down, the more edge cases you find. Getting a system that would explode the building is most of the job.
The examples of obviously stupid utility functions having obviously bad results are toy problems, when we have a better understanding of symbol grounding, we will know how much the problems keep reappearing. Manually specifying a utility function Might be feasible.