For demonstration, we can certainly do useful empirical stuff—ARC Evals already did the lying-to-a-taskrabbit worker demonstration (clearly this isn’t anything like deceptive alignment, but it’s deception [given suitable scaffolding]).
I think that other demonstrations of this kind will be useful in the short term.
For avoiding all forms of deception, I’m much more pessimistic—since this requires us to have no blind-spots, and to address the problem in a fundamentally general way. (personally I doubt there’s a [general solution to all kinds of deception] without some pretty general alignment solution—though I may be wrong)
I’m sure we’ll come up with solutions to particular types of / definitions of deception in particular contexts. This doesn’t necessarily tell us much about other types of deception in other contexts. (for example, this kind of thing—but not only this kind of thing)
I’d also note that “reducing the uncertainty” is only progress when we’re correct. The problem that kills us isn’t uncertainty, but overconfidence. (though granted it might be someone else’s overconfidence)
I think a lot depends on whether we’re:
Aiming to demonstrate that deception can happen.
Aiming to robustly avoid deception.
For demonstration, we can certainly do useful empirical stuff—ARC Evals already did the lying-to-a-taskrabbit worker demonstration (clearly this isn’t anything like deceptive alignment, but it’s deception [given suitable scaffolding]).
I think that other demonstrations of this kind will be useful in the short term.
For avoiding all forms of deception, I’m much more pessimistic—since this requires us to have no blind-spots, and to address the problem in a fundamentally general way. (personally I doubt there’s a [general solution to all kinds of deception] without some pretty general alignment solution—though I may be wrong)
I’m sure we’ll come up with solutions to particular types of / definitions of deception in particular contexts. This doesn’t necessarily tell us much about other types of deception in other contexts. (for example, this kind of thing—but not only this kind of thing)
I’d also note that “reducing the uncertainty” is only progress when we’re correct. The problem that kills us isn’t uncertainty, but overconfidence. (though granted it might be someone else’s overconfidence)