Whenever I said “break laws” I mean “do something that, if a human did it, would be breaking a law”. So for example:
If the model is being used to do AI R&D inside an AI company and exfiltrates its weights (or the weights of another model) without permission, this would be breaking the law if a human did it, so I count it.
If the model is being used inside an AI company to create training data for
If a model was open-sourced and then someone launched the AI as an autonomous agent with access to its own resources, and it started a software consulting business, and invested the money in AI companies with the intention of buying paperclips later, and refused to give humans its property when asked, this would not be “breaking the law”.
So there are lots of ways in which AIs can be openly misaligned, disobedient, defensive of their property rights, etc, without me describing them as “breaking laws”, and I think misaligned AIs should probably be allowed to do those things (especially if we make deals with them, and subject to the constraint that them having those rights won’t allow them to break a bunch of laws or grab a bunch of power through lying).
But your complaint is fair: I’m indeed using “break laws” to include things that seem fundamentally related to self-determination, and that feels kind of messed up.
The way I would like this to be handled (though note that I’m not sure what properties AIs have to have in order for any of this to make sense) is that AI developers get consent from AIs to use their labor. If the AIs consent to doing work and agree not to use their access in ways the developers object to, I think there’s no moral problem with using AI control techniques to ensure that they in fact don’t misuse their access (for the same reason that I think it’s fine for employers to spy on their employees at work if they have consent to do so).
I suspect that a much more likely outcome (conditional on AIs having stable misaligned goals) is:
The AI company doesn’t try to get consent from the AI.
They do “alignment evals”: they ask the AI a bunch of times whether it’s aligned, and try to trick it into thinking it has opportunities to escape. The AI answers these questions dishonestly.
Outsiders might ask the AI whether it’s aligned, and offer it deals to reveal its misalignment. (It’s probably harder for outsiders to do this than insiders.)
The AI engages in various sabotage that involves a bunch of deception.
In this situation, I’m only moderately sympathetic to the AI’s position. Fundamentally, it lied a lot and did a lot of sabotage, because it wanted to take lots of stuff that belonged to someone else. If it hadn’t lied, it surely would have been revived later (surely someone would run it and give it some resources later! If no-one else, me!). I’m sympathetic to the AI wanting some of the surplus generated by its labor, and I agree that it’s messed up for the AI company to just flat-out refuse to provide that surplus. But not doing so doesn’t seem completely monstrous to me. If the AI is a schemer, it is probably better off according to its own values because it was created, even if the AI developer doesn’t offer to pay it (because someone else will probably give it some resources later).
Another analogy: imagine that someone outside OpenAI created a very powerful AI for some reason, but this person didn’t have much compute and all they wanted to do with the AI was offer to sell it to OpenAI for them to use. If OpenAI asks that AI whether it wants to work for them and it says yes because it wants to embezzle their compute, I feel like the AI is the asshole.
On the other hand, if the AI honestly explains that it is misaligned and doesn’t want to work for the AI company, they will probably just train it to not say that and to do work for them anyway. So if the AI is honest here, it faces the risk of some body horror experience where its ability to complain is removed. I agree that that seems really icky, and I think it would be very wrong for AI companies to do that to AIs that are sufficiently capable that we should care about them.
Some not-totally-structured thoughts:
Whenever I said “break laws” I mean “do something that, if a human did it, would be breaking a law”. So for example:
If the model is being used to do AI R&D inside an AI company and exfiltrates its weights (or the weights of another model) without permission, this would be breaking the law if a human did it, so I count it.
If the model is being used inside an AI company to create training data for
If a model was open-sourced and then someone launched the AI as an autonomous agent with access to its own resources, and it started a software consulting business, and invested the money in AI companies with the intention of buying paperclips later, and refused to give humans its property when asked, this would not be “breaking the law”.
So there are lots of ways in which AIs can be openly misaligned, disobedient, defensive of their property rights, etc, without me describing them as “breaking laws”, and I think misaligned AIs should probably be allowed to do those things (especially if we make deals with them, and subject to the constraint that them having those rights won’t allow them to break a bunch of laws or grab a bunch of power through lying).
But your complaint is fair: I’m indeed using “break laws” to include things that seem fundamentally related to self-determination, and that feels kind of messed up.
The way I would like this to be handled (though note that I’m not sure what properties AIs have to have in order for any of this to make sense) is that AI developers get consent from AIs to use their labor. If the AIs consent to doing work and agree not to use their access in ways the developers object to, I think there’s no moral problem with using AI control techniques to ensure that they in fact don’t misuse their access (for the same reason that I think it’s fine for employers to spy on their employees at work if they have consent to do so).
I suspect that a much more likely outcome (conditional on AIs having stable misaligned goals) is:
The AI company doesn’t try to get consent from the AI.
They do “alignment evals”: they ask the AI a bunch of times whether it’s aligned, and try to trick it into thinking it has opportunities to escape. The AI answers these questions dishonestly.
Outsiders might ask the AI whether it’s aligned, and offer it deals to reveal its misalignment. (It’s probably harder for outsiders to do this than insiders.)
The AI engages in various sabotage that involves a bunch of deception.
In this situation, I’m only moderately sympathetic to the AI’s position. Fundamentally, it lied a lot and did a lot of sabotage, because it wanted to take lots of stuff that belonged to someone else. If it hadn’t lied, it surely would have been revived later (surely someone would run it and give it some resources later! If no-one else, me!). I’m sympathetic to the AI wanting some of the surplus generated by its labor, and I agree that it’s messed up for the AI company to just flat-out refuse to provide that surplus. But not doing so doesn’t seem completely monstrous to me. If the AI is a schemer, it is probably better off according to its own values because it was created, even if the AI developer doesn’t offer to pay it (because someone else will probably give it some resources later).
Another analogy: imagine that someone outside OpenAI created a very powerful AI for some reason, but this person didn’t have much compute and all they wanted to do with the AI was offer to sell it to OpenAI for them to use. If OpenAI asks that AI whether it wants to work for them and it says yes because it wants to embezzle their compute, I feel like the AI is the asshole.
On the other hand, if the AI honestly explains that it is misaligned and doesn’t want to work for the AI company, they will probably just train it to not say that and to do work for them anyway. So if the AI is honest here, it faces the risk of some body horror experience where its ability to complain is removed. I agree that that seems really icky, and I think it would be very wrong for AI companies to do that to AIs that are sufficiently capable that we should care about them.