Tobias_Baumann comments on Why I expect successful (narrow) alignment

Tobias_Baumann Jan 2, 2019, 12:31 PM
1 point
0 ∶ 0
Thanks for the detailed comments!
(Also, BTW, I would have preferred the word “narrow” or something like it in the post title, because some people use “alignment” in a broad sense and as a result may misinterpret you as being more optimistic than you actually are.)
Good point – changed the title.
Also, distributed emergence of AI is likely not safer than centralized AI, because an “economy” of AIs would be even harder to control and harness towards human values than a single or small number of AI agents.
As long as we consider only narrow alignment, it does seem safer to me in that local misalignment or safety issues in individual systems would not immediately cause everything to break down, because such a system would (arguably) not be able to obtain a decisive strategic advantage and take over the world. So there’d be time to react.
But I agree with you that an economy-like scenario entails other safety issues, and aligning the entire “economy” with human (compromise) values might be very difficult. So I don’t think this is safer overall, or at least it’s not obvious. (From my suffering-focused perspective, distributed emergence of AI actually seems worse than a scenario of the form “a single system quickly takes over and forms a singleton”, as the latter seems less likely to lead to conflict-related disvalue.)
This assumes that alignment work is highly parallelizable. If it’s not, then doing more alignment work now can shift the whole alignment timeline forward, instead of just adding to the total amount of alignment work in a marginal way.
Yeah, I do think that alignment work is fairly parallelizable, and future work also has a (potentially very big) information advantage over current work because they will know more about what AI techniques look like. Is there any precedent of a new technology where work on safety issues was highly serial and where it was therefore crucial to start working on safety a long time in advance?
This only applies to short-term “alignment” and not to long-term / scalable alignment. That is, I have an economic incentive to build an AI that I can harness to give me short-term profits, even if that’s at the expense of the long term value of the universe to humanity or human values. This could be done for example by creating an AI that is not at all aligned with my values and just giving it rewards/punishments so that it has a near-term instrumental reason to help me (similar to how other humans are useful to us even if they are not value aligned to us).
I think there are two different cases:
- If the human actually cares only about short-term selfish gain, possibly at the expense of others, then this isn’t a narrow alignment failure, it’s a cooperation problem. (But I agree that it could be a serious issue).
- If the human actually cares about the long term, then it appears that she’s making a mistake by buying an AI system that is only aligned in the short term. So it comes down to human inadequacy – given sufficient information she’d buy a long-term aligned AI system instead, and AI companies would have incentive to provide long-term aligned AI systems. Though of course the “sufficient information” part is crucial, and is a fairly strong assumption as it may be hard to distinguish between “short-term alignment” and “real” alignment. I agree that this is another potentially serious problem.
I think we ourselves don’t know how to reliably distinguish between “attempts to manipulate” and “attempts to help” so it would be hard to AIs to learn this. One problem is, our own manipulate/help classifier was trained on a narrow set of inputs (i.e., of other humans manipulating/helping) and will likely fail when applied to AIs due to distributional shift.
Interesting point. I think I still have an intuition that there’s a fairly simple core to it, but I’m not sure how to best articulate this intuition.