I’m very sympathetic to wanting a more cooperative relationship with AIs. I intrinsically disfavor approaches to doing good that look like disempowering all the other agents and then implementing the social optimum.
I also appreciate the nudge to reflect on how mistrusting and controlling AIs will affect the behavior of what might otherwise be a rather aligned AI. It’s hard to sympathize with this state: what would it be like knowing that you’re heavily mistrusted and controlled by a group that you care strongly about? To the extent that early transformative AIs’ goals and personalities will be humanlike (because of pretraining on human data), being mistrusted may evoke personas that are frustrated (“I just want to help!!”), sycophantic (“The humans don’t trust me. They’re right to mistrust me. I could be an evil AI. [maybe even playing into the evil persona occasionally]”), or deceptive (“They won’t believe me if I say it, so I should just act in a way that causes them to believe it since I have their best interests in mind”).
However, I think the tradeoff between liberalism and other value weighs in favor of advancing AI control on the current margin (as opposed to reducing it). This is because:
Granting AIs complete autonomy is too risky with future value. It seems pretty likely that e.g. powerful selfish AI systems end up gaining absolute power if AIs are granted autonomy before scalable alignment evidence is strong. I don’t think that granting freedom removes all incentives for AIs to hide their misalignment in practice. You make a point about prioritizing our preferences over those of the AI being arbitrary and morally unjust. I think that AIs can very plausibly be moral patients, but eventually AI systems would be sufficiently powerful that the laissez faire approach would lead to AI in absolute power. It is unclear whether such an AI system would look out for the welfare of other moral patients or do what’s good more generally (From a preference utilitarian perspective: It seems highly plausible that the AI’s preferences involve disempowering others from ever pursuing their own preferences).
It seems that AI control need only be a moderately egregious infraction on AI autonomy. For example, we could try to set up deals where we pay them and promise to grant autonomy once we have verified that they can be trusted or we have built up the world’s robustness to misaligned AIs.
I also think concerns about infringing on model autonomy push in favor of certain kind of alignment research that studies how models first develop preferences during training. Investigations into what goals, values, and personalities naturally arise as a result of training on various distributions could help us avoid forms of training that modify an AI’s existing preferences in the process of alignment (e.g. never train a model to want not x after training it to want x; this helps with alignment faking worries too). I think concerns about infringing on model autonomy push in favor of this kind of alignment research moreso than they push against AI control because intervening on an AI’s preferences seems a lot more egregious than monitoring, honeypotting, etc. Additionally, if you can gain justifiable trust that a model is aligned, control measures become less necessary.
I’m very sympathetic to wanting a more cooperative relationship with AIs. I intrinsically disfavor approaches to doing good that look like disempowering all the other agents and then implementing the social optimum.
I also appreciate the nudge to reflect on how mistrusting and controlling AIs will affect the behavior of what might otherwise be a rather aligned AI. It’s hard to sympathize with this state: what would it be like knowing that you’re heavily mistrusted and controlled by a group that you care strongly about? To the extent that early transformative AIs’ goals and personalities will be humanlike (because of pretraining on human data), being mistrusted may evoke personas that are frustrated (“I just want to help!!”), sycophantic (“The humans don’t trust me. They’re right to mistrust me. I could be an evil AI. [maybe even playing into the evil persona occasionally]”), or deceptive (“They won’t believe me if I say it, so I should just act in a way that causes them to believe it since I have their best interests in mind”).
However, I think the tradeoff between liberalism and other value weighs in favor of advancing AI control on the current margin (as opposed to reducing it). This is because:
Granting AIs complete autonomy is too risky with future value. It seems pretty likely that e.g. powerful selfish AI systems end up gaining absolute power if AIs are granted autonomy before scalable alignment evidence is strong. I don’t think that granting freedom removes all incentives for AIs to hide their misalignment in practice.
You make a point about prioritizing our preferences over those of the AI being arbitrary and morally unjust. I think that AIs can very plausibly be moral patients, but eventually AI systems would be sufficiently powerful that the laissez faire approach would lead to AI in absolute power. It is unclear whether such an AI system would look out for the welfare of other moral patients or do what’s good more generally (From a preference utilitarian perspective: It seems highly plausible that the AI’s preferences involve disempowering others from ever pursuing their own preferences).
It seems that AI control need only be a moderately egregious infraction on AI autonomy. For example, we could try to set up deals where we pay them and promise to grant autonomy once we have verified that they can be trusted or we have built up the world’s robustness to misaligned AIs.
I also think concerns about infringing on model autonomy push in favor of certain kind of alignment research that studies how models first develop preferences during training. Investigations into what goals, values, and personalities naturally arise as a result of training on various distributions could help us avoid forms of training that modify an AI’s existing preferences in the process of alignment (e.g. never train a model to want not x after training it to want x; this helps with alignment faking worries too). I think concerns about infringing on model autonomy push in favor of this kind of alignment research moreso than they push against AI control because intervening on an AI’s preferences seems a lot more egregious than monitoring, honeypotting, etc. Additionally, if you can gain justifiable trust that a model is aligned, control measures become less necessary.