Do you think that different trajectories of prosaic TAI have big impacts on the usefulness of your current project? (For example, perhaps you think that TAI that is agentic would just be taught to deceive). If so, which? If not, could you say something about why it seems general?
(NB: the above is not supposed to imply criticism of a plan that only works in some worlds).
We are researching techniques that are simpler precursors to adversarial training techniques that seem most likely to work if you assume that it’s possible to build systems that are performance-competitive and training-competitive, and do well on average on their training distribution.
There are a variety of reasons to worry that this assumption won’t hold. In particular, it seems plausible that humanity will only have the ability to produce AGIs that will collude with each other if it’s possible for them to do so. This seems especially likely if it’s only affordable to train your AGI from scratch a few times, because then all the systems you’re using are similar to each other and will find collusion easier. (It’s not training-competitive to assume you’re able to train the AGI from scratch multiple times, if you believe that there’s a way of building an unaligned powerful system that only involves training it from scratch once.) But even if we train all our systems from scratch separately, it’s pretty plausible to me that models will collude, either via acausal trade or because the systems need to be able to communicate with each other for some competitiveness reason.
So our research is most useful if we’re able to assume a lack of such collusion.
I think that some people think you might be able to apply these techniques even in cases where you don’t have an a priori reason to be confident that the models won’t collude; I don’t have a strong opinion on this.
Hm, could you expand on why collusion is one of the most salient ways in which “it’s possible to build systems that are performance-competitive and training-competitive, and do well on average on their training distribution” could fail?
Is the thought here that — if models can collude — then they can do badly on the training distribution in an unnoticeable way, because they’re being checked by models that they can collude with?
Do you think that different trajectories of prosaic TAI have big impacts on the usefulness of your current project? (For example, perhaps you think that TAI that is agentic would just be taught to deceive). If so, which? If not, could you say something about why it seems general?
(NB: the above is not supposed to imply criticism of a plan that only works in some worlds).
I think this is a great question.
We are researching techniques that are simpler precursors to adversarial training techniques that seem most likely to work if you assume that it’s possible to build systems that are performance-competitive and training-competitive, and do well on average on their training distribution.
There are a variety of reasons to worry that this assumption won’t hold. In particular, it seems plausible that humanity will only have the ability to produce AGIs that will collude with each other if it’s possible for them to do so. This seems especially likely if it’s only affordable to train your AGI from scratch a few times, because then all the systems you’re using are similar to each other and will find collusion easier. (It’s not training-competitive to assume you’re able to train the AGI from scratch multiple times, if you believe that there’s a way of building an unaligned powerful system that only involves training it from scratch once.) But even if we train all our systems from scratch separately, it’s pretty plausible to me that models will collude, either via acausal trade or because the systems need to be able to communicate with each other for some competitiveness reason.
So our research is most useful if we’re able to assume a lack of such collusion.
I think that some people think you might be able to apply these techniques even in cases where you don’t have an a priori reason to be confident that the models won’t collude; I don’t have a strong opinion on this.
Hm, could you expand on why collusion is one of the most salient ways in which “it’s possible to build systems that are performance-competitive and training-competitive, and do well on average on their training distribution” could fail?
Is the thought here that — if models can collude — then they can do badly on the training distribution in an unnoticeable way, because they’re being checked by models that they can collude with?
Yeah basically.