It’s something like, “you’ll keep pursuing the goal in new situations.” In other words, goal-internalization is a generalization problem.
I think internalizing X means “pursuing X as a terminal goal”, whereas RLHF arguably only makes model pursue X as an instrumental goal (in which case the model would be deceptively aligned). I’m not saying that GPT-4 has a distinction between instrumental and terminal goals, but a future AGI, whether an LLM or not, could have terminal goals that are different from instrumental goals.
You might argue that deceptive alignment is also an obsolete paradigm, but I would again respond that we don’t know this, or at any rate, that the essay doesn’t make the argument.
I don’t think the terminal vs. instrumental goal dichotomy is very helpful, because it shifts the focus away from behavioral stuff we can actually measure (at least in principle). I also don’t think humans exhibit this distinction particularly strongly. I would prefer to talk about generalization, which is much more empirically testable and has a practical meaning.
What if it just is the case that AI will be dangerous for reasons that current systems don’t exhibit, and hence we don’t have empirical data on? If that’s the case, then limiting our concerns to only concepts that can be empirically tested seems like it means setting ourselves up for failure.
I’m not sure what one is supposed to do with a claim that can’t be empirically tested—do we just believe it/act as if it’s true forever? Wouldn’t this simply mean an unlimited pause in AI development (and why does this only apply to AI)?
In principle, we do the same thing as with any claim (whether explicitly or otherwise): - Estimate the expected value of (directly) testing the claim. - Test it if and only if (directly) testing it has positive EV.
The point here isn’t that the claim is special, or that AI is special—just that the EV calculation consistently comes out negative (unless someone else is about to do something even more dangerous—hence the need for coordination).
This is unusual and inconvenient. It appears to be the hand we’ve been dealt. I think you’re asking the right question: what is one supposed to do with a claim that can’t be empirically tested?
No deceptive or dangerous AI has ever been built or empirically tested. (1)
Historically AI capabilities have consistently been “underwhelming”, far below the hype. (2)
If we discuss “ok we build a large AGI, give it persistent memory and online learning, and isolate it in an air gapped data center and hand carry data to the machine via hardware locked media, what is the danger” you are going to respond either with:
“I don’t know how the model escapes but it’s so smart it will find a way” or (3)
“I am confident humanity will exist very far into the future so a small risk now is unacceptable (say 1-10 percent pDoom)”.
and if I point out that this large ASI model needs thousands of H100 accelerator cards and megawatts of power and specialized network topology to exist and there is nowhere to escape to, you will argue “it will optimize itself to fit on consumer PCs and escape to a botnet”. (4)
Have I summarized the arguments?
Like we’re supposed to coordinate an international pause and I see 4 unproven assertions above that have zero direct evidence. The one about humanity existing far into the future I don’t know I don’t want to argue that because it’s not falsifiable.
Thanks I mean more in terms of “how can we productively resolve our disagreements about this?”, which the EV calculations are downstream of. To be clear, it doesn’t seem to me that this is necessarily the hand we’ve been dealt but I’m not sure how to reduce the uncertainty.
At the risk of sidestepping the question, the obvious move seems to be “try harder to make the claim empirically testable”! For example, in the case of deception, which I think is a central example we could (not claiming these ideas are novel):
Test directly for deception behaviourally and/or mechanistically (I’m aware that people are doing this, think it’s good and wish the results were more broadly shared).
Think about what aspects of deception make it particularly hard, and try to study those in isolation and test those. The most important example seems to me to be precursors: finding more testable analogues to the question of “before we get good, undetectable deception do we get kind of crappy detectable deception?”
Obviously these all run some (imo substantially lower) risks but seem well worth doing. Before we declare the question empirically inaccessible we should at least do these and synthesise the results (for instance, what does grokking say about (2)?).
(I’m spinning this comment out because it’s pretty different in style and seems worth being able to reply to separately. Please let me know if this kind of chain-posting is frowned upon here.)
Another downside to declaring things empirically out of reach and relying on priors for your EV calculations and subsequent actions is that it more-or-less inevitably converts epistemic disagreements into conflict.
If it seems likely to you that this is the way things are (and so we should pause indefinitely) but it seems highly unlikely to me (and so we should not) then we have no choice but to just advocate for different things. There’s not even the prospect of having recourse to better evidence to win over third parties, so the conflict becomes no-holds-barred. I see this right now on Twitter and it makes me very sad. I think we can do better.
(apologies for slowness; I’m not here much) I’d say it’s more about being willing to update on less direct evidence when the risk of getting more direct evidence is high.
Clearly we should aim to get more evidence. The question is how to best do that safely. At present we seem to be taking the default path—of gathering evidence in about the easiest way, rather than going for something harder, slower and safer. (e.g. all the “we need to work with frontier models” stuff; I do expect that’s most efficient on the empirical side; I don’t expect it’s a safe approach)
For demonstration, we can certainly do useful empirical stuff—ARC Evals already did the lying-to-a-taskrabbit worker demonstration (clearly this isn’t anything like deceptive alignment, but it’s deception [given suitable scaffolding]).
I think that other demonstrations of this kind will be useful in the short term.
For avoiding all forms of deception, I’m much more pessimistic—since this requires us to have no blind-spots, and to address the problem in a fundamentally general way. (personally I doubt there’s a [general solution to all kinds of deception] without some pretty general alignment solution—though I may be wrong)
I’m sure we’ll come up with solutions to particular types of / definitions of deception in particular contexts. This doesn’t necessarily tell us much about other types of deception in other contexts. (for example, this kind of thing—but not only this kind of thing)
I’d also note that “reducing the uncertainty” is only progress when we’re correct. The problem that kills us isn’t uncertainty, but overconfidence. (though granted it might be someone else’s overconfidence)
You need to have some motivation for thinking that a fundamentally new kind of danger will emerge in future systems, in such a way that we won’t be able to handle it as it arises. Otherwise anyone can come up with any nonsense they like.
If you’re talking about e.g. Evan Hubinger’s arguments for deceptive alignment, I think those arguments are very bad, in light of 1) the white box argument I give in this post, 2) the incoherence of Evan’s notion of “mechanistic optimization,” and 3) his reliance on “counting arguments” where you’re supposed to assume that the “inner goals” of the AI are sampled “uniformly at random” from some uninformative prior over goals (I don’t think the LLM / deep learning prior is uninformative in this sense at all).
You need to have some motivation for thinking that a fundamentally new kind of danger will emerge in future systems, in such a way that we won’t be able to handle it as it arises.
That was what everyone ins AI safety was discussing for a decade or more, until around 2018. You seem to ignore these arguments about why AI will be dangerous, as well as all of the arguments that alignment will be hard. Are you familiar with all of that work?
I also don’t think humans exhibit this distinction [terminal vs. instrumental goal dichotomy] particularly strongly.
the ideological turing test seems like a case where the distinction can be seen clearly in humans; the instrumental goal is to persuade the other that you sincerely hold beliefs/values (which imply goals). while your terminal goal is to advance advocacy of your different actual beliefs.
I think internalizing X means “pursuing X as a terminal goal”, whereas RLHF arguably only makes model pursue X as an instrumental goal (in which case the model would be deceptively aligned). I’m not saying that GPT-4 has a distinction between instrumental and terminal goals, but a future AGI, whether an LLM or not, could have terminal goals that are different from instrumental goals.
You might argue that deceptive alignment is also an obsolete paradigm, but I would again respond that we don’t know this, or at any rate, that the essay doesn’t make the argument.
I don’t think the terminal vs. instrumental goal dichotomy is very helpful, because it shifts the focus away from behavioral stuff we can actually measure (at least in principle). I also don’t think humans exhibit this distinction particularly strongly. I would prefer to talk about generalization, which is much more empirically testable and has a practical meaning.
What if it just is the case that AI will be dangerous for reasons that current systems don’t exhibit, and hence we don’t have empirical data on? If that’s the case, then limiting our concerns to only concepts that can be empirically tested seems like it means setting ourselves up for failure.
I’m not sure what one is supposed to do with a claim that can’t be empirically tested—do we just believe it/act as if it’s true forever? Wouldn’t this simply mean an unlimited pause in AI development (and why does this only apply to AI)?
In principle, we do the same thing as with any claim (whether explicitly or otherwise):
- Estimate the expected value of (directly) testing the claim.
- Test it if and only if (directly) testing it has positive EV.
The point here isn’t that the claim is special, or that AI is special—just that the EV calculation consistently comes out negative (unless someone else is about to do something even more dangerous—hence the need for coordination).
This is unusual and inconvenient. It appears to be the hand we’ve been dealt.
I think you’re asking the right question: what is one supposed to do with a claim that can’t be empirically tested?
So just to summarize:
No deceptive or dangerous AI has ever been built or empirically tested. (1)
Historically AI capabilities have consistently been “underwhelming”, far below the hype. (2)
If we discuss “ok we build a large AGI, give it persistent memory and online learning, and isolate it in an air gapped data center and hand carry data to the machine via hardware locked media, what is the danger” you are going to respond either with:
“I don’t know how the model escapes but it’s so smart it will find a way” or (3)
“I am confident humanity will exist very far into the future so a small risk now is unacceptable (say 1-10 percent pDoom)”.
and if I point out that this large ASI model needs thousands of H100 accelerator cards and megawatts of power and specialized network topology to exist and there is nowhere to escape to, you will argue “it will optimize itself to fit on consumer PCs and escape to a botnet”. (4)
Have I summarized the arguments?
Like we’re supposed to coordinate an international pause and I see 4 unproven assertions above that have zero direct evidence. The one about humanity existing far into the future I don’t know I don’t want to argue that because it’s not falsifiable.
Shouldn’t we wait for evidence?
Thanks I mean more in terms of “how can we productively resolve our disagreements about this?”, which the EV calculations are downstream of. To be clear, it doesn’t seem to me that this is necessarily the hand we’ve been dealt but I’m not sure how to reduce the uncertainty.
At the risk of sidestepping the question, the obvious move seems to be “try harder to make the claim empirically testable”! For example, in the case of deception, which I think is a central example we could (not claiming these ideas are novel):
Test directly for deception behaviourally and/or mechanistically (I’m aware that people are doing this, think it’s good and wish the results were more broadly shared).
Think about what aspects of deception make it particularly hard, and try to study those in isolation and test those. The most important example seems to me to be precursors: finding more testable analogues to the question of “before we get good, undetectable deception do we get kind of crappy detectable deception?”
Obviously these all run some (imo substantially lower) risks but seem well worth doing. Before we declare the question empirically inaccessible we should at least do these and synthesise the results (for instance, what does grokking say about (2)?).
(I’m spinning this comment out because it’s pretty different in style and seems worth being able to reply to separately. Please let me know if this kind of chain-posting is frowned upon here.)
Another downside to declaring things empirically out of reach and relying on priors for your EV calculations and subsequent actions is that it more-or-less inevitably converts epistemic disagreements into conflict.
If it seems likely to you that this is the way things are (and so we should pause indefinitely) but it seems highly unlikely to me (and so we should not) then we have no choice but to just advocate for different things. There’s not even the prospect of having recourse to better evidence to win over third parties, so the conflict becomes no-holds-barred. I see this right now on Twitter and it makes me very sad. I think we can do better.
(apologies for slowness; I’m not here much)
I’d say it’s more about being willing to update on less direct evidence when the risk of getting more direct evidence is high.
Clearly we should aim to get more evidence. The question is how to best do that safely. At present we seem to be taking the default path—of gathering evidence in about the easiest way, rather than going for something harder, slower and safer. (e.g. all the “we need to work with frontier models” stuff; I do expect that’s most efficient on the empirical side; I don’t expect it’s a safe approach)
I think a lot depends on whether we’re:
Aiming to demonstrate that deception can happen.
Aiming to robustly avoid deception.
For demonstration, we can certainly do useful empirical stuff—ARC Evals already did the lying-to-a-taskrabbit worker demonstration (clearly this isn’t anything like deceptive alignment, but it’s deception [given suitable scaffolding]).
I think that other demonstrations of this kind will be useful in the short term.
For avoiding all forms of deception, I’m much more pessimistic—since this requires us to have no blind-spots, and to address the problem in a fundamentally general way. (personally I doubt there’s a [general solution to all kinds of deception] without some pretty general alignment solution—though I may be wrong)
I’m sure we’ll come up with solutions to particular types of / definitions of deception in particular contexts. This doesn’t necessarily tell us much about other types of deception in other contexts. (for example, this kind of thing—but not only this kind of thing)
I’d also note that “reducing the uncertainty” is only progress when we’re correct. The problem that kills us isn’t uncertainty, but overconfidence. (though granted it might be someone else’s overconfidence)
You need to have some motivation for thinking that a fundamentally new kind of danger will emerge in future systems, in such a way that we won’t be able to handle it as it arises. Otherwise anyone can come up with any nonsense they like.
If you’re talking about e.g. Evan Hubinger’s arguments for deceptive alignment, I think those arguments are very bad, in light of 1) the white box argument I give in this post, 2) the incoherence of Evan’s notion of “mechanistic optimization,” and 3) his reliance on “counting arguments” where you’re supposed to assume that the “inner goals” of the AI are sampled “uniformly at random” from some uninformative prior over goals (I don’t think the LLM / deep learning prior is uninformative in this sense at all).
That was what everyone ins AI safety was discussing for a decade or more, until around 2018. You seem to ignore these arguments about why AI will be dangerous, as well as all of the arguments that alignment will be hard. Are you familiar with all of that work?
the ideological turing test seems like a case where the distinction can be seen clearly in humans; the instrumental goal is to persuade the other that you sincerely hold beliefs/values (which imply goals). while your terminal goal is to advance advocacy of your different actual beliefs.