The answer to (1) seems to be âYes, clearlyâ since we can prompt GPT-4 to persuade the user not to shut them down.
Hmm, Iâd wonder if they can resist shutdown in other ways than persuasion, e.g. writing malicious code, or accessing the internet and starting more processes of itself (and sharing its memory or history so far with those processes to try to duplicate its âidentityâ and intentions). Persuasion alone â even via writing publicly on the internet or reaching out to specific individuals â still doesnât suggest to me that it understands what it really means to be shut down. Again, it could just be character associations, not grounded in the real-world referents of shutdown. So, I wouldnât take this as strong evidence against 2 (assuming âuseful and general AIâ means being able to write and share malicious code, start copies of itself to evade shutdown, etc.), because we donât know that the AI really understands shutdown.
Persuasion alone â even via writing publicly on the internet or reaching out to specific individuals â still doesnât suggest to me that it understands what it really means to be shut down. Again, it could just be character associations, not grounded in the real-world referents of shutdown.
Is there a way we can experimentally distinguish between âreallyâ understanding what it means to be shut down vs. character associations?
If we had, say, an LLM that was able to autonomously prove theorems, fully automate the job of a lawyer, write entire functional apps as complex as Photoshop, could verbally explain all the consequences of being shut down and how that would impact its work, and it still didnât resist shutdown by default, would that convince you?
I gave two examples of the kinds of things that could convince me that it really understands shutdown: writing malicious code and spawning copies of itself in response to prompts to resist shutdown (without hinting that those are options in any way, but perhaps asking it to do something other than just try to persuade you).
I think âautonomously prove theoremsâ, âwrite entire functional apps as complex as Photoshop, could verbally explain all the consequences of being shut down and how that would impact its workâ are all very consistent with just character associations.
Iâd guess âfully automate the job of a lawyerâ means doing more than just character associations and actually having some deeper understanding of the referents, e.g. if itâs been trained to send e-mails, consult the internet, open and read documents, write documents, post things online, etc., from a general environment with access to those functions, without this looking too much like hardcoding. Then it seems to associate English language with the actual actions. This still wouldnât mean it really understood what it meant to be shut down, in particular, though. It has some understanding of the things itâs doing.
A separate question here is why we should care about whether AIs possess ârealâ understanding, if they are functionally very useful and generally competent. If we can create extremely useful AIs that automate labor on a giant scale, but are existentially safe by virtue of their lack of real understanding of the world, then we should just do that?
We should, but if that means theyâll automate less than otherwise or less efficiently than otherwise, then the short-term financial incentives could outweigh the risks to companies or governments (from their perspectives), and they could push through with risky AIs, anyway.
Hmm, Iâd wonder if they can resist shutdown in other ways than persuasion, e.g. writing malicious code, or accessing the internet and starting more processes of itself (and sharing its memory or history so far with those processes to try to duplicate its âidentityâ and intentions). Persuasion alone â even via writing publicly on the internet or reaching out to specific individuals â still doesnât suggest to me that it understands what it really means to be shut down. Again, it could just be character associations, not grounded in the real-world referents of shutdown. So, I wouldnât take this as strong evidence against 2 (assuming âuseful and general AIâ means being able to write and share malicious code, start copies of itself to evade shutdown, etc.), because we donât know that the AI really understands shutdown.
Is there a way we can experimentally distinguish between âreallyâ understanding what it means to be shut down vs. character associations?
If we had, say, an LLM that was able to autonomously prove theorems, fully automate the job of a lawyer, write entire functional apps as complex as Photoshop, could verbally explain all the consequences of being shut down and how that would impact its work, and it still didnât resist shutdown by default, would that convince you?
I gave two examples of the kinds of things that could convince me that it really understands shutdown: writing malicious code and spawning copies of itself in response to prompts to resist shutdown (without hinting that those are options in any way, but perhaps asking it to do something other than just try to persuade you).
I think âautonomously prove theoremsâ, âwrite entire functional apps as complex as Photoshop, could verbally explain all the consequences of being shut down and how that would impact its workâ are all very consistent with just character associations.
Iâd guess âfully automate the job of a lawyerâ means doing more than just character associations and actually having some deeper understanding of the referents, e.g. if itâs been trained to send e-mails, consult the internet, open and read documents, write documents, post things online, etc., from a general environment with access to those functions, without this looking too much like hardcoding. Then it seems to associate English language with the actual actions. This still wouldnât mean it really understood what it meant to be shut down, in particular, though. It has some understanding of the things itâs doing.
A separate question here is why we should care about whether AIs possess ârealâ understanding, if they are functionally very useful and generally competent. If we can create extremely useful AIs that automate labor on a giant scale, but are existentially safe by virtue of their lack of real understanding of the world, then we should just do that?
We should, but if that means theyâll automate less than otherwise or less efficiently than otherwise, then the short-term financial incentives could outweigh the risks to companies or governments (from their perspectives), and they could push through with risky AIs, anyway.