Hm, could you expand on why collusion is one of the most salient ways in which “it’s possible to build systems that are performance-competitive and training-competitive, and do well on average on their training distribution” could fail?
Is the thought here that — if models can collude — then they can do badly on the training distribution in an unnoticeable way, because they’re being checked by models that they can collude with?
Hm, could you expand on why collusion is one of the most salient ways in which “it’s possible to build systems that are performance-competitive and training-competitive, and do well on average on their training distribution” could fail?
Is the thought here that — if models can collude — then they can do badly on the training distribution in an unnoticeable way, because they’re being checked by models that they can collude with?
Yeah basically.