I think this is a very good point, and it definitely gives me some pause—and probably my original statement there was too strong. Certainly I agree that you need to do evaluations using the best possible scaffolding that you have, but overall my sense is that this problem is not that bad. Some reasons to think that:
At least currently, scaffolding-related performance improvements don’t seem to generally be that large (e.g. chain-of-thought is just not that helpful on most tasks), especially relative to the gains from scaling.
You can evaluate pretty directly for the sorts of capabilities that would help make scaffolding way better, like the model being able to correct its own errors, so you don’t have to just evaluate the whole system + scaffolding end-to-end.
This is mostly just a problem for large-scale model deployments. If you instead keep your largest model mostly in-house for alignment research, or only give it to a small number of external partners whose scaffolding you can directly evaluate, it makes this problem way less bad.
That last point is probably the most important here, since it demonstrates that you easily can (and should) absorb this sort of concern into an RSP. For example, you could set a capabilities threshold for models’ ability to do self-correction, and once your models pass that threshold you restrict deployment except in contexts where you can directly evaluate the relevant scaffolding that will be used in advance.
I think this is a very good point, and it definitely gives me some pause—and probably my original statement there was too strong. Certainly I agree that you need to do evaluations using the best possible scaffolding that you have, but overall my sense is that this problem is not that bad. Some reasons to think that:
At least currently, scaffolding-related performance improvements don’t seem to generally be that large (e.g. chain-of-thought is just not that helpful on most tasks), especially relative to the gains from scaling.
You can evaluate pretty directly for the sorts of capabilities that would help make scaffolding way better, like the model being able to correct its own errors, so you don’t have to just evaluate the whole system + scaffolding end-to-end.
This is mostly just a problem for large-scale model deployments. If you instead keep your largest model mostly in-house for alignment research, or only give it to a small number of external partners whose scaffolding you can directly evaluate, it makes this problem way less bad.
That last point is probably the most important here, since it demonstrates that you easily can (and should) absorb this sort of concern into an RSP. For example, you could set a capabilities threshold for models’ ability to do self-correction, and once your models pass that threshold you restrict deployment except in contexts where you can directly evaluate the relevant scaffolding that will be used in advance.