We don’t have a working definition of “what has intrinsic value.” My basic view on these hairy problems (“but what should I value?”) is that we really don’t want to be coding in the answer by hand. I’m more optimistic about building something that has a few layers of indirection, e.g., something that figures out how to act as intended, rather than trying to transmit your object-level intentions by hand.
In the paper you linked, I think Max is raising about a slightly different issue. He’s talking about what we would call the ontology identification problem. Roughly, imagine building an AI system that you want to produce lots of diamond. Maybe it starts out with an atomic model of the universe, and you (looking at its model) give it a utility function that scores one point per second for every carbon atom covalently bound to four other carbon atoms (and then time-discounts or something). Later, the system develops a nuclear model of the universe. You do want it to somehow deduce that carbon atoms in the old model map onto six-proton atoms in the new model, and maybe query the user about how to value carbon isotopes in its diamond lattice. You don’t want it to conclude that none of these six-proton nuclei pattern-match to “true carbon”, and then turn the universe upside down looking for some hidden cache of “true carbon.”
We don’t have a working definition of “what has intrinsic value.” My basic view on these hairy problems (“but what should I value?”) is that we really don’t want to be coding in the answer by hand. I’m more optimistic about building something that has a few layers of indirection, e.g., something that figures out how to act as intended, rather than trying to transmit your object-level intentions by hand.
In the paper you linked, I think Max is raising about a slightly different issue. He’s talking about what we would call the ontology identification problem. Roughly, imagine building an AI system that you want to produce lots of diamond. Maybe it starts out with an atomic model of the universe, and you (looking at its model) give it a utility function that scores one point per second for every carbon atom covalently bound to four other carbon atoms (and then time-discounts or something). Later, the system develops a nuclear model of the universe. You do want it to somehow deduce that carbon atoms in the old model map onto six-proton atoms in the new model, and maybe query the user about how to value carbon isotopes in its diamond lattice. You don’t want it to conclude that none of these six-proton nuclei pattern-match to “true carbon”, and then turn the universe upside down looking for some hidden cache of “true carbon.”
We have a few different papers that mention this problem, albeit shallowly: Ontological Crises in Artificial Agents’ Value Systems, The Value Learning Problem, Formalizing Two Problems of Realistic World-Models. There’s a lot more work to be done here, and it’s definitely on our radar, though also note that work on this problem is at least a little blocked on attaining a better understanding of how to build multi-level maps of the world.
That diamond/carbon scenario is an excellent concrete example of the ontology problem.