Machine learning works fine on non adversarial inputs. If you train a network to distinguish cats from dogs, and put in a normal picture of a cat, it works. However, there are all sorts of wierd inputs that look nothing like cats or dogs that will also get classified as cats. If you give the network a bunch of bad situations, and a bunch of good, (say you crack open a history textbook, and ask a bunch of people how nice various periods and regimes were.) then you will get a network that can distinguish bad from good within the normal flow of human history. This doesn’t stop there being some wierd state that counts as extremely good. Deciding what is and isn’t a good future depends on the answers to moral questions that haven’t come up yet, and so we don’t have any training data for questions involving tech we don’t yet have. This can make a big difference. If we decided that uploaded minds do count morally, we are probably going for an entirely virtual civilization, one an anti uploader would consider worthless. If we decide that mind uploads don’t count morally, we might simulate loads in horrible situations for violent video games. Someone who did think that uploaded minds mattered would consider that an S risk, potentially worse than nothing.
Human level goals are moderately complicated in terms of human level concepts. In the outcome pump, “get my mother out of the building” is a human level concept. I agree that you could probably get useful and safeish behavior from such a device given a few philosopher years. Much of the problem is that concepts like “mother” and “building” are really difficult to specify in terms of quantum operators on quark positions or whatever. The more you break human concepts down, the more edge cases you find. Getting a system that would explode the building is most of the job.
The examples of obviously stupid utility functions having obviously bad results are toy problems, when we have a better understanding of symbol grounding, we will know how much the problems keep reappearing. Manually specifying a utility function Might be feasible.
Machine learning works fine on non adversarial inputs. If you train a network to distinguish cats from dogs, and put in a normal picture of a cat, it works. However, there are all sorts of wierd inputs that look nothing like cats or dogs that will also get classified as cats. If you give the network a bunch of bad situations, and a bunch of good, (say you crack open a history textbook, and ask a bunch of people how nice various periods and regimes were.) then you will get a network that can distinguish bad from good within the normal flow of human history. This doesn’t stop there being some wierd state that counts as extremely good. Deciding what is and isn’t a good future depends on the answers to moral questions that haven’t come up yet, and so we don’t have any training data for questions involving tech we don’t yet have. This can make a big difference. If we decided that uploaded minds do count morally, we are probably going for an entirely virtual civilization, one an anti uploader would consider worthless. If we decide that mind uploads don’t count morally, we might simulate loads in horrible situations for violent video games. Someone who did think that uploaded minds mattered would consider that an S risk, potentially worse than nothing.
Human level goals are moderately complicated in terms of human level concepts. In the outcome pump, “get my mother out of the building” is a human level concept. I agree that you could probably get useful and safeish behavior from such a device given a few philosopher years. Much of the problem is that concepts like “mother” and “building” are really difficult to specify in terms of quantum operators on quark positions or whatever. The more you break human concepts down, the more edge cases you find. Getting a system that would explode the building is most of the job.
The examples of obviously stupid utility functions having obviously bad results are toy problems, when we have a better understanding of symbol grounding, we will know how much the problems keep reappearing. Manually specifying a utility function Might be feasible.