Thanks I mean more in terms of “how can we productively resolve our disagreements about this?”, which the EV calculations are downstream of. To be clear, it doesn’t seem to me that this is necessarily the hand we’ve been dealt but I’m not sure how to reduce the uncertainty.
At the risk of sidestepping the question, the obvious move seems to be “try harder to make the claim empirically testable”! For example, in the case of deception, which I think is a central example we could (not claiming these ideas are novel):
Test directly for deception behaviourally and/or mechanistically (I’m aware that people are doing this, think it’s good and wish the results were more broadly shared).
Think about what aspects of deception make it particularly hard, and try to study those in isolation and test those. The most important example seems to me to be precursors: finding more testable analogues to the question of “before we get good, undetectable deception do we get kind of crappy detectable deception?”
Obviously these all run some (imo substantially lower) risks but seem well worth doing. Before we declare the question empirically inaccessible we should at least do these and synthesise the results (for instance, what does grokking say about (2)?).
(I’m spinning this comment out because it’s pretty different in style and seems worth being able to reply to separately. Please let me know if this kind of chain-posting is frowned upon here.)
Another downside to declaring things empirically out of reach and relying on priors for your EV calculations and subsequent actions is that it more-or-less inevitably converts epistemic disagreements into conflict.
If it seems likely to you that this is the way things are (and so we should pause indefinitely) but it seems highly unlikely to me (and so we should not) then we have no choice but to just advocate for different things. There’s not even the prospect of having recourse to better evidence to win over third parties, so the conflict becomes no-holds-barred. I see this right now on Twitter and it makes me very sad. I think we can do better.
(apologies for slowness; I’m not here much) I’d say it’s more about being willing to update on less direct evidence when the risk of getting more direct evidence is high.
Clearly we should aim to get more evidence. The question is how to best do that safely. At present we seem to be taking the default path—of gathering evidence in about the easiest way, rather than going for something harder, slower and safer. (e.g. all the “we need to work with frontier models” stuff; I do expect that’s most efficient on the empirical side; I don’t expect it’s a safe approach)
For demonstration, we can certainly do useful empirical stuff—ARC Evals already did the lying-to-a-taskrabbit worker demonstration (clearly this isn’t anything like deceptive alignment, but it’s deception [given suitable scaffolding]).
I think that other demonstrations of this kind will be useful in the short term.
For avoiding all forms of deception, I’m much more pessimistic—since this requires us to have no blind-spots, and to address the problem in a fundamentally general way. (personally I doubt there’s a [general solution to all kinds of deception] without some pretty general alignment solution—though I may be wrong)
I’m sure we’ll come up with solutions to particular types of / definitions of deception in particular contexts. This doesn’t necessarily tell us much about other types of deception in other contexts. (for example, this kind of thing—but not only this kind of thing)
I’d also note that “reducing the uncertainty” is only progress when we’re correct. The problem that kills us isn’t uncertainty, but overconfidence. (though granted it might be someone else’s overconfidence)
Thanks I mean more in terms of “how can we productively resolve our disagreements about this?”, which the EV calculations are downstream of. To be clear, it doesn’t seem to me that this is necessarily the hand we’ve been dealt but I’m not sure how to reduce the uncertainty.
At the risk of sidestepping the question, the obvious move seems to be “try harder to make the claim empirically testable”! For example, in the case of deception, which I think is a central example we could (not claiming these ideas are novel):
Test directly for deception behaviourally and/or mechanistically (I’m aware that people are doing this, think it’s good and wish the results were more broadly shared).
Think about what aspects of deception make it particularly hard, and try to study those in isolation and test those. The most important example seems to me to be precursors: finding more testable analogues to the question of “before we get good, undetectable deception do we get kind of crappy detectable deception?”
Obviously these all run some (imo substantially lower) risks but seem well worth doing. Before we declare the question empirically inaccessible we should at least do these and synthesise the results (for instance, what does grokking say about (2)?).
(I’m spinning this comment out because it’s pretty different in style and seems worth being able to reply to separately. Please let me know if this kind of chain-posting is frowned upon here.)
Another downside to declaring things empirically out of reach and relying on priors for your EV calculations and subsequent actions is that it more-or-less inevitably converts epistemic disagreements into conflict.
If it seems likely to you that this is the way things are (and so we should pause indefinitely) but it seems highly unlikely to me (and so we should not) then we have no choice but to just advocate for different things. There’s not even the prospect of having recourse to better evidence to win over third parties, so the conflict becomes no-holds-barred. I see this right now on Twitter and it makes me very sad. I think we can do better.
(apologies for slowness; I’m not here much)
I’d say it’s more about being willing to update on less direct evidence when the risk of getting more direct evidence is high.
Clearly we should aim to get more evidence. The question is how to best do that safely. At present we seem to be taking the default path—of gathering evidence in about the easiest way, rather than going for something harder, slower and safer. (e.g. all the “we need to work with frontier models” stuff; I do expect that’s most efficient on the empirical side; I don’t expect it’s a safe approach)
I think a lot depends on whether we’re:
Aiming to demonstrate that deception can happen.
Aiming to robustly avoid deception.
For demonstration, we can certainly do useful empirical stuff—ARC Evals already did the lying-to-a-taskrabbit worker demonstration (clearly this isn’t anything like deceptive alignment, but it’s deception [given suitable scaffolding]).
I think that other demonstrations of this kind will be useful in the short term.
For avoiding all forms of deception, I’m much more pessimistic—since this requires us to have no blind-spots, and to address the problem in a fundamentally general way. (personally I doubt there’s a [general solution to all kinds of deception] without some pretty general alignment solution—though I may be wrong)
I’m sure we’ll come up with solutions to particular types of / definitions of deception in particular contexts. This doesn’t necessarily tell us much about other types of deception in other contexts. (for example, this kind of thing—but not only this kind of thing)
I’d also note that “reducing the uncertainty” is only progress when we’re correct. The problem that kills us isn’t uncertainty, but overconfidence. (though granted it might be someone else’s overconfidence)