Machine learning works fine on non adversarial inputs. If you train a network to distinguish cats from dogs, and put in a normal picture of a cat, it works. However, there are all sorts of wierd inputs that look nothing like cats or dogs that will also get classified as cats. If you give the network a bunch of bad situations, and a bunch of good, (say you crack open a history textbook, and ask a bunch of people how nice various periods and regimes were.) then you will get a network that can distinguish bad from good within the normal flow of human history. This doesn’t stop there being some wierd state that counts as extremely good. Deciding what is and isn’t a good future depends on the answers to moral questions that haven’t come up yet, and so we don’t have any training data for questions involving tech we don’t yet have. This can make a big difference. If we decided that uploaded minds do count morally, we are probably going for an entirely virtual civilization, one an anti uploader would consider worthless. If we decide that mind uploads don’t count morally, we might simulate loads in horrible situations for violent video games. Someone who did think that uploaded minds mattered would consider that an S risk, potentially worse than nothing.
Human level goals are moderately complicated in terms of human level concepts. In the outcome pump, “get my mother out of the building” is a human level concept. I agree that you could probably get useful and safeish behavior from such a device given a few philosopher years. Much of the problem is that concepts like “mother” and “building” are really difficult to specify in terms of quantum operators on quark positions or whatever. The more you break human concepts down, the more edge cases you find. Getting a system that would explode the building is most of the job.
The examples of obviously stupid utility functions having obviously bad results are toy problems, when we have a better understanding of symbol grounding, we will know how much the problems keep reappearing. Manually specifying a utility function Might be feasible.
Rot13
Gung vf pyrneyl n centzngvp qrpvfvba onfrq ba gur fbpvrgl ur jnf va. Svefgyl, gur fynirel nf cenpgvfrq va napvrag Terrpr jnf bsgra zhpu yrff pehry guna pbybavny fynirel. Tvira gung nyybjvat gur fynir gb znxr gurve bja jnl va gur jbeyq, rneavat zbarl ubjrire gurl fnj svg, naq gura chggvat n cevpr ba gur fynirf serrqbz jnf pbzzba cenpgvpr, gung znxrf fbzr cenpgvprf gung jrer pnyyrq fynirel bs gur gvzr ybbx abg gung qvssrerag sebz qrog.
Frpbaqyl, ur whfgvslf vg ol bgure crbcyr univat avpr guvatf, vs fubja gur cbjre bs zbqrea cebqhpgvba yvarf sbe znxvat avpr guvatf jvgubhg fynirel, ur jbhyq unir cebonoyl nterrq gung gung jnf n orggre fbyhgvba. Rira zber fb vs fubja fbzr NV anabgrpu gung pbhyq zntvp hc nal avpr guvat.
Guveqyl, zbfg fpvragvfgf hagvy gur ynfg srj uhaqerq lrnef jrer snveyl ryvgr fbpvnyyl. Zrzoref bs fbzr hccre pynff jub pbhyq nssbeq gb rkcrevzrag engure guna jbex. Tvira gur ynetr orarsvg gurl unq, guvf ybbxf yvxr n zhpu ynetre fbhepr bs hgvyvgl guna gur qverpg avprarff bs univat avpr guvatf.
V qba’g guvax gur qrpvfvba ur znqr jnf haernfbanoyr, tvira gur fbpvny pbagrkg naq vasbezngvba ninvynoyr gb uvz ng gur gvzr.