I’m sympathetic to the general thrust of the argument, that we should be reasonably optimistic about “business-as-usual” leading to successful narrow alignment. I put particular weight on the second argument, that the AI research community will identify and be successful at solving these problems.
However, you mostly lost me in the third argument. You suggest using whatever state-of-the-art general purpose learning technique exists to model human values, and then optimise them. I’m pessimistic about this since it involves an adversarial relationship between the optimiser (e.g. an RL algorithm) and the learned reward function. This will work if the optimiser is weak and the reward model is strong. But if we are hypothesising a far improved reward learning technique, we should also assume far more powerful RL algorithms than we have today.
Currently, it seems like RL is generally an easier problem than learning a reward function. For example, current IRL algorithms will overfit the reward function to demonstrations in a high-dimensional environment. If you later optimize the reward with an RL algorithm, you get a policy which does well under the learned reward function, but terribly (often worse than random) on the ground truth reward function. This is why you normally learn the policy jointly with the reward in a GAN-based approach. Regularizers to learn a good reward model (which can then generalize) is in active area of research, see e.g. the variational discriminator bottleneck. However, solving it in generality seems very hard. There’s been little success in adversarial defences, which is a related problem, and there are theoretical reasons to believe adversarial examples will be present for any model class in high-dimensional environments.
Overall, I’m optimistic about the research community solving these problems, but think that present techniques are far from adequate. Although improved general-purpose learning techniques will be important, I believe there will also need to be a concerted focus on solving alignment-related problems.
I’m sympathetic to the general thrust of the argument, that we should be reasonably optimistic about “business-as-usual” leading to successful narrow alignment. I put particular weight on the second argument, that the AI research community will identify and be successful at solving these problems.
However, you mostly lost me in the third argument. You suggest using whatever state-of-the-art general purpose learning technique exists to model human values, and then optimise them. I’m pessimistic about this since it involves an adversarial relationship between the optimiser (e.g. an RL algorithm) and the learned reward function. This will work if the optimiser is weak and the reward model is strong. But if we are hypothesising a far improved reward learning technique, we should also assume far more powerful RL algorithms than we have today.
Currently, it seems like RL is generally an easier problem than learning a reward function. For example, current IRL algorithms will overfit the reward function to demonstrations in a high-dimensional environment. If you later optimize the reward with an RL algorithm, you get a policy which does well under the learned reward function, but terribly (often worse than random) on the ground truth reward function. This is why you normally learn the policy jointly with the reward in a GAN-based approach. Regularizers to learn a good reward model (which can then generalize) is in active area of research, see e.g. the variational discriminator bottleneck. However, solving it in generality seems very hard. There’s been little success in adversarial defences, which is a related problem, and there are theoretical reasons to believe adversarial examples will be present for any model class in high-dimensional environments.
Overall, I’m optimistic about the research community solving these problems, but think that present techniques are far from adequate. Although improved general-purpose learning techniques will be important, I believe there will also need to be a concerted focus on solving alignment-related problems.