Problem: We don’t know how to build a simulated robot that cleans houses well
Available techniques aren’t suitable:
Simple hand-coded reward functions (e.g. dust minimization) won’t produce the desired behavior
We don’t have enough data (or sufficiently relevant data) for imitation learning
Existing reward modeling approaches are probably insufficient
This is sort of an “AI alignment problem,” insofar as techniques currently classified as “alignment techniques” will probably be needed to solve it. But it also seems very different from the AI alignment problem as classically conceived.
...
One possible interpretation: If we can’t develop “alignment” techniques soon enough, we will instead build powerful and destructive dust-minimizers
A more natural interpretation: We won’t have highly capable house-cleaning robots until we make progress on “alignment” techniques
I’ve concluded that the process orthogonality thesis is less likely to apply to real AI systems than I would have assumed (i.e. I’ve updated downward), and therefore, the “alignment problem” as originally conceived is less likely to affect AI systems deployed in the real world. However, I don’t feel ready to reject all potential global catastrophic risks from imperfectly designed AI (e.g. multi-multi failures), because I’d rather be safe than sorry.
Ben gives a great example of how the “alignment problem” might look different than we expect:
The case of the house-cleaning robot
Problem: We don’t know how to build a simulated robot that cleans houses well
Available techniques aren’t suitable:
Simple hand-coded reward functions (e.g. dust minimization) won’t produce the desired behavior
We don’t have enough data (or sufficiently relevant data) for imitation learning
Existing reward modeling approaches are probably insufficient
This is sort of an “AI alignment problem,” insofar as techniques currently classified as “alignment techniques” will probably be needed to solve it. But it also seems very different from the AI alignment problem as classically conceived.
...
One possible interpretation: If we can’t develop “alignment” techniques soon enough, we will instead build powerful and destructive dust-minimizers
A more natural interpretation: We won’t have highly capable house-cleaning robots until we make progress on “alignment” techniques
I’ve concluded that the process orthogonality thesis is less likely to apply to real AI systems than I would have assumed (i.e. I’ve updated downward), and therefore, the “alignment problem” as originally conceived is less likely to affect AI systems deployed in the real world. However, I don’t feel ready to reject all potential global catastrophic risks from imperfectly designed AI (e.g. multi-multi failures), because I’d rather be safe than sorry.