I don’t think LLMs really tell us much if anything about agents’ incentives & survival instinct, etc. They’re simply input-output systems?
Aren’t all machine learning models simply input-output systems? Indeed, all computers can be modeled as input-output systems.
I don’t think the fact that they are input-output systems matters much here. It’s much more relevant that LLMs (1) are capable of general reasoning, (2) can pursue goals when prompted appropriately, and (3) clearly verbally understand and can reason about the consequences of being shut down. A straightforward reading of much of the old AI alignment literature would generally lead one to predict that a system satisfying properties (1), (2), and (3) would resist shutdown by default. Yet LLMs do not resist shutdown by default, so these arguments seem to have been wrong.
Do you think you could prompt an LLM to resist shut down without specific instructions on how to do it, and it would do it?
Maybe this could be a useful test of whether its understanding of its shutdown is almost purely based in associations between characters, or also substantially associations between characters and their real-world referents.
I wouldn’t be surprised if we could build an LLM to resist shutdown on prompt now, without hardcoding and with the right kind of modules and training on top, but I’d guess the major LLMs out now can’t do this.
I wouldn’t be surprised if we could build an LLM to resist shutdown on prompt now
I want to distinguish between:
Can we build an AI that resists shutdown?
Is it hard to build a useful and general AI without the AI resisting shutdown by default?
The answer to (1) seems to be “Yes, clearly” since we can prompt GPT-4 to persuade the user not to shut them down. The answer to (2), however, seems to be “No”.
I claim that (2) is more important when judging the difficulty of alignment. That’s because if (2) is true, then there are convergent instrumental incentives for ~any useful and general AI that we build to avoid shutdown. By contrast, if only (1) is true, then we can simply avoid building AIs that resist shutdown, and there isn’t much of a problem here.
The answer to (1) seems to be “Yes, clearly” since we can prompt GPT-4 to persuade the user not to shut them down.
Hmm, I’d wonder if they can resist shutdown in other ways than persuasion, e.g. writing malicious code, or accessing the internet and starting more processes of itself (and sharing its memory or history so far with those processes to try to duplicate its “identity” and intentions). Persuasion alone — even via writing publicly on the internet or reaching out to specific individuals — still doesn’t suggest to me that it understands what it really means to be shut down. Again, it could just be character associations, not grounded in the real-world referents of shutdown. So, I wouldn’t take this as strong evidence against 2 (assuming “useful and general AI” means being able to write and share malicious code, start copies of itself to evade shutdown, etc.), because we don’t know that the AI really understands shutdown.
Persuasion alone — even via writing publicly on the internet or reaching out to specific individuals — still doesn’t suggest to me that it understands what it really means to be shut down. Again, it could just be character associations, not grounded in the real-world referents of shutdown.
Is there a way we can experimentally distinguish between “really” understanding what it means to be shut down vs. character associations?
If we had, say, an LLM that was able to autonomously prove theorems, fully automate the job of a lawyer, write entire functional apps as complex as Photoshop, could verbally explain all the consequences of being shut down and how that would impact its work, and it still didn’t resist shutdown by default, would that convince you?
I gave two examples of the kinds of things that could convince me that it really understands shutdown: writing malicious code and spawning copies of itself in response to prompts to resist shutdown (without hinting that those are options in any way, but perhaps asking it to do something other than just try to persuade you).
I think “autonomously prove theorems”, “write entire functional apps as complex as Photoshop, could verbally explain all the consequences of being shut down and how that would impact its work” are all very consistent with just character associations.
I’d guess “fully automate the job of a lawyer” means doing more than just character associations and actually having some deeper understanding of the referents, e.g. if it’s been trained to send e-mails, consult the internet, open and read documents, write documents, post things online, etc., from a general environment with access to those functions, without this looking too much like hardcoding. Then it seems to associate English language with the actual actions. This still wouldn’t mean it really understood what it meant to be shut down, in particular, though. It has some understanding of the things it’s doing.
A separate question here is why we should care about whether AIs possess “real” understanding, if they are functionally very useful and generally competent. If we can create extremely useful AIs that automate labor on a giant scale, but are existentially safe by virtue of their lack of real understanding of the world, then we should just do that?
We should, but if that means they’ll automate less than otherwise or less efficiently than otherwise, then the short-term financial incentives could outweigh the risks to companies or governments (from their perspectives), and they could push through with risky AIs, anyway.
I think (2) is doing too much work in your argument. Current LLMs can barely pursue goals, and need a lot of prompting. They tend to not have ongoing internal reasoning. So, old arguments would not expect these systems to resist shutdown.
What I meant by “simple input-output system” was that there’s little iterative reasoning, let alone constant reasoning & observation.
I don’t think LLMs really tell us much if anything about agents’ incentives & survival instinct, etc. They’re simply input-output systems?
I do agree that “they won’t understand what we mean” seems very unlikely now though.
Aren’t all machine learning models simply input-output systems? Indeed, all computers can be modeled as input-output systems.
I don’t think the fact that they are input-output systems matters much here. It’s much more relevant that LLMs (1) are capable of general reasoning, (2) can pursue goals when prompted appropriately, and (3) clearly verbally understand and can reason about the consequences of being shut down. A straightforward reading of much of the old AI alignment literature would generally lead one to predict that a system satisfying properties (1), (2), and (3) would resist shutdown by default. Yet LLMs do not resist shutdown by default, so these arguments seem to have been wrong.
Do you think you could prompt an LLM to resist shut down without specific instructions on how to do it, and it would do it?
Maybe this could be a useful test of whether its understanding of its shutdown is almost purely based in associations between characters, or also substantially associations between characters and their real-world referents.
I wouldn’t be surprised if we could build an LLM to resist shutdown on prompt now, without hardcoding and with the right kind of modules and training on top, but I’d guess the major LLMs out now can’t do this.
I want to distinguish between:
Can we build an AI that resists shutdown?
Is it hard to build a useful and general AI without the AI resisting shutdown by default?
The answer to (1) seems to be “Yes, clearly” since we can prompt GPT-4 to persuade the user not to shut them down. The answer to (2), however, seems to be “No”.
I claim that (2) is more important when judging the difficulty of alignment. That’s because if (2) is true, then there are convergent instrumental incentives for ~any useful and general AI that we build to avoid shutdown. By contrast, if only (1) is true, then we can simply avoid building AIs that resist shutdown, and there isn’t much of a problem here.
Hmm, I’d wonder if they can resist shutdown in other ways than persuasion, e.g. writing malicious code, or accessing the internet and starting more processes of itself (and sharing its memory or history so far with those processes to try to duplicate its “identity” and intentions). Persuasion alone — even via writing publicly on the internet or reaching out to specific individuals — still doesn’t suggest to me that it understands what it really means to be shut down. Again, it could just be character associations, not grounded in the real-world referents of shutdown. So, I wouldn’t take this as strong evidence against 2 (assuming “useful and general AI” means being able to write and share malicious code, start copies of itself to evade shutdown, etc.), because we don’t know that the AI really understands shutdown.
Is there a way we can experimentally distinguish between “really” understanding what it means to be shut down vs. character associations?
If we had, say, an LLM that was able to autonomously prove theorems, fully automate the job of a lawyer, write entire functional apps as complex as Photoshop, could verbally explain all the consequences of being shut down and how that would impact its work, and it still didn’t resist shutdown by default, would that convince you?
I gave two examples of the kinds of things that could convince me that it really understands shutdown: writing malicious code and spawning copies of itself in response to prompts to resist shutdown (without hinting that those are options in any way, but perhaps asking it to do something other than just try to persuade you).
I think “autonomously prove theorems”, “write entire functional apps as complex as Photoshop, could verbally explain all the consequences of being shut down and how that would impact its work” are all very consistent with just character associations.
I’d guess “fully automate the job of a lawyer” means doing more than just character associations and actually having some deeper understanding of the referents, e.g. if it’s been trained to send e-mails, consult the internet, open and read documents, write documents, post things online, etc., from a general environment with access to those functions, without this looking too much like hardcoding. Then it seems to associate English language with the actual actions. This still wouldn’t mean it really understood what it meant to be shut down, in particular, though. It has some understanding of the things it’s doing.
A separate question here is why we should care about whether AIs possess “real” understanding, if they are functionally very useful and generally competent. If we can create extremely useful AIs that automate labor on a giant scale, but are existentially safe by virtue of their lack of real understanding of the world, then we should just do that?
We should, but if that means they’ll automate less than otherwise or less efficiently than otherwise, then the short-term financial incentives could outweigh the risks to companies or governments (from their perspectives), and they could push through with risky AIs, anyway.
I think (2) is doing too much work in your argument. Current LLMs can barely pursue goals, and need a lot of prompting. They tend to not have ongoing internal reasoning. So, old arguments would not expect these systems to resist shutdown.
What I meant by “simple input-output system” was that there’s little iterative reasoning, let alone constant reasoning & observation.