In this comment I engage with many of the object-level arguments in the post. I upvoted this post because I think it’s useful to write down these arguments, but we should also consider the counterarguments.
(Also, BTW, I would have preferred the word “narrow” or something like it in the post title, because some people use “alignment” in a broad sense and as a result may misinterpret you as being more optimistic than you actually are.)
If the emergence of AI is gradual or distributed, then it is more plausible that safety issues can adequately be handled “as usual”, by reacting to issues as they arise, by extensive testing and engineering, and by incrementally designing systems to satisfy multiple constraints.
If the emergence of AI is gradual enough, it does seem that safety issues can be handled adequately, but even many people who think “soft takeoff” is likely don’t seem to think that AI will come that slowly. To the extent that AI does emerge that slowly, that seems to cut across many other AI-related problem areas including ones mentioned in the Summary as alternatives to narrow alignment.
Also, distributed emergence of AI is likely not safer than centralized AI, because an “economy” of AIs would be even harder to control and harness towards human values than a single or small number of AI agents. An argument can be made that AI alignment work is valuable in part so that unified AI agents can be safely built, thereby heading off such a less controllable AI economy.
So it does not seem like “distributed” by itself buys any safety. I think our intuition that it does probably comes from a sense that “distributed” is correlated with “gradual”. If you consider a fast and distributed rise of AI, does that really seem safer than a fast and centralized rise of AI?
While alignment looks neglected now, we should also take into account that huge amounts of resources will likely be invested if it becomes apparent that this is a serious problem (see also here).
This assumes that alignment work is highly parallelizable. If it’s not, then doing more alignment work now can shift the whole alignment timeline forward, instead of just adding to the total amount of alignment work in a marginal way.
Strong economic incentives will push towards alignment: it’s not economically useful to have a powerful AI system that doesn’t reliably do what you want.
This only applies to short-term “alignment” and not to long-term / scalable alignment. That is, I have an economic incentive to build an AI that I can harness to give me short-term profits, even if that’s at the expense of the long term value of the universe to humanity or human values. This could be done for example by creating an AI that is not at all aligned with my values and just giving it rewards/punishments so that it has a near-term instrumental reason to help me (similar to how other humans are useful to us even if they are not value aligned to us).
Existing approaches hold some promise
I have an issue with “approaches” (plural) here because as far as I can tell, everyone is converging to Paul Christiano’s iterated amplification approach (except for MIRI which is doing more theoretical research). ETA: To be fair, perhaps iterated amplification should be viewed as a cluster of related approaches.
But the crux is that the notion of human values doesn’t need to be perfect to understand that humans do not approve of lock-ins, that humans would not approve of attempts to manipulate them, and so on.
I think we ourselves don’t know how to reliably distinguish between “attempts to manipulate” and “attempts to help” so it would be hard to AIs to learn this. One problem is, our own manipulate/help classifier was trained on a narrow set of inputs (i.e., of other humans manipulating/helping) and will likely fail when applied to AIs due to distributional shift.
Again, it’s not that hard to understand what it means to be a helpful assistant to somebody.
Same problem here, our own understanding of what it means to be a helpful assistant to somebody likely isn’t robust to distributional shifts. I think this means we actually need to gain a broad/theoretical understanding of “corrigibility” or “helping” instead of being able to have AIs just learn it from humans.
(Also, BTW, I would have preferred the word “narrow” or something like it in the post title, because some people use “alignment” in a broad sense and as a result may misinterpret you as being more optimistic than you actually are.)
Good point – changed the title.
Also, distributed emergence of AI is likely not safer than centralized AI, because an “economy” of AIs would be even harder to control and harness towards human values than a single or small number of AI agents.
As long as we consider only narrow alignment, it does seem safer to me in that local misalignment or safety issues in individual systems would not immediately cause everything to break down, because such a system would (arguably) not be able to obtain a decisive strategic advantage and take over the world. So there’d be time to react.
But I agree with you that an economy-like scenario entails other safety issues, and aligning the entire “economy” with human (compromise) values might be very difficult. So I don’t think this is safer overall, or at least it’s not obvious. (From my suffering-focused perspective, distributed emergence of AI actually seems worse than a scenario of the form “a single system quickly takes over and forms a singleton”, as the latter seems less likely to lead to conflict-related disvalue.)
This assumes that alignment work is highly parallelizable. If it’s not, then doing more alignment work now can shift the whole alignment timeline forward, instead of just adding to the total amount of alignment work in a marginal way.
Yeah, I do think that alignment work is fairly parallelizable, and future work also has a (potentially very big) information advantage over current work because they will know more about what AI techniques look like. Is there any precedent of a new technology where work on safety issues was highly serial and where it was therefore crucial to start working on safety a long time in advance?
This only applies to short-term “alignment” and not to long-term / scalable alignment. That is, I have an economic incentive to build an AI that I can harness to give me short-term profits, even if that’s at the expense of the long term value of the universe to humanity or human values. This could be done for example by creating an AI that is not at all aligned with my values and just giving it rewards/punishments so that it has a near-term instrumental reason to help me (similar to how other humans are useful to us even if they are not value aligned to us).
I think there are two different cases:
If the human actually cares only about short-term selfish gain, possibly at the expense of others, then this isn’t a narrow alignment failure, it’s a cooperation problem. (But I agree that it could be a serious issue).
If the human actually cares about the long term, then it appears that she’s making a mistake by buying an AI system that is only aligned in the short term. So it comes down to human inadequacy – given sufficient information she’d buy a long-term aligned AI system instead, and AI companies would have incentive to provide long-term aligned AI systems. Though of course the “sufficient information” part is crucial, and is a fairly strong assumption as it may be hard to distinguish between “short-term alignment” and “real” alignment. I agree that this is another potentially serious problem.
I think we ourselves don’t know how to reliably distinguish between “attempts to manipulate” and “attempts to help” so it would be hard to AIs to learn this. One problem is, our own manipulate/help classifier was trained on a narrow set of inputs (i.e., of other humans manipulating/helping) and will likely fail when applied to AIs due to distributional shift.
Interesting point. I think I still have an intuition that there’s a fairly simple core to it, but I’m not sure how to best articulate this intuition.
In this comment I engage with many of the object-level arguments in the post. I upvoted this post because I think it’s useful to write down these arguments, but we should also consider the counterarguments.
(Also, BTW, I would have preferred the word “narrow” or something like it in the post title, because some people use “alignment” in a broad sense and as a result may misinterpret you as being more optimistic than you actually are.)
If the emergence of AI is gradual enough, it does seem that safety issues can be handled adequately, but even many people who think “soft takeoff” is likely don’t seem to think that AI will come that slowly. To the extent that AI does emerge that slowly, that seems to cut across many other AI-related problem areas including ones mentioned in the Summary as alternatives to narrow alignment.
Also, distributed emergence of AI is likely not safer than centralized AI, because an “economy” of AIs would be even harder to control and harness towards human values than a single or small number of AI agents. An argument can be made that AI alignment work is valuable in part so that unified AI agents can be safely built, thereby heading off such a less controllable AI economy.
So it does not seem like “distributed” by itself buys any safety. I think our intuition that it does probably comes from a sense that “distributed” is correlated with “gradual”. If you consider a fast and distributed rise of AI, does that really seem safer than a fast and centralized rise of AI?
This assumes that alignment work is highly parallelizable. If it’s not, then doing more alignment work now can shift the whole alignment timeline forward, instead of just adding to the total amount of alignment work in a marginal way.
This only applies to short-term “alignment” and not to long-term / scalable alignment. That is, I have an economic incentive to build an AI that I can harness to give me short-term profits, even if that’s at the expense of the long term value of the universe to humanity or human values. This could be done for example by creating an AI that is not at all aligned with my values and just giving it rewards/punishments so that it has a near-term instrumental reason to help me (similar to how other humans are useful to us even if they are not value aligned to us).
I have an issue with “approaches” (plural) here because as far as I can tell, everyone is converging to Paul Christiano’s iterated amplification approach (except for MIRI which is doing more theoretical research). ETA: To be fair, perhaps iterated amplification should be viewed as a cluster of related approaches.
I think we ourselves don’t know how to reliably distinguish between “attempts to manipulate” and “attempts to help” so it would be hard to AIs to learn this. One problem is, our own manipulate/help classifier was trained on a narrow set of inputs (i.e., of other humans manipulating/helping) and will likely fail when applied to AIs due to distributional shift.
Same problem here, our own understanding of what it means to be a helpful assistant to somebody likely isn’t robust to distributional shifts. I think this means we actually need to gain a broad/theoretical understanding of “corrigibility” or “helping” instead of being able to have AIs just learn it from humans.
Thanks for the detailed comments!
Good point – changed the title.
As long as we consider only narrow alignment, it does seem safer to me in that local misalignment or safety issues in individual systems would not immediately cause everything to break down, because such a system would (arguably) not be able to obtain a decisive strategic advantage and take over the world. So there’d be time to react.
But I agree with you that an economy-like scenario entails other safety issues, and aligning the entire “economy” with human (compromise) values might be very difficult. So I don’t think this is safer overall, or at least it’s not obvious. (From my suffering-focused perspective, distributed emergence of AI actually seems worse than a scenario of the form “a single system quickly takes over and forms a singleton”, as the latter seems less likely to lead to conflict-related disvalue.)
Yeah, I do think that alignment work is fairly parallelizable, and future work also has a (potentially very big) information advantage over current work because they will know more about what AI techniques look like. Is there any precedent of a new technology where work on safety issues was highly serial and where it was therefore crucial to start working on safety a long time in advance?
I think there are two different cases:
If the human actually cares only about short-term selfish gain, possibly at the expense of others, then this isn’t a narrow alignment failure, it’s a cooperation problem. (But I agree that it could be a serious issue).
If the human actually cares about the long term, then it appears that she’s making a mistake by buying an AI system that is only aligned in the short term. So it comes down to human inadequacy – given sufficient information she’d buy a long-term aligned AI system instead, and AI companies would have incentive to provide long-term aligned AI systems. Though of course the “sufficient information” part is crucial, and is a fairly strong assumption as it may be hard to distinguish between “short-term alignment” and “real” alignment. I agree that this is another potentially serious problem.
Interesting point. I think I still have an intuition that there’s a fairly simple core to it, but I’m not sure how to best articulate this intuition.