David Krueger comments on RSPs are pauses done right

David Krueger 22 Nov 2023 23:11 UTC
11 points
4 ∶ 0
“With the use of fine-tuning, and a bunch of careful engineering work, capabilities evaluations can be done reliably and robustly.”

I strongly disagree with this (and the title of the piece). I’ve been having these arguments a lot recently, and I think these sorts of claims are emblamatic of a dangerously narrow view on the problem of AI x-safety, which I am disappointed to see seems quite popular.

A few reasons why this statement is misleading:
* New capabilities ellicitation techniques arrive frequently and unpredictably (think chain of thought, e.g.)
* The capabilities of a system could be much greater than any particular LLM involved in that system (think tool use and coding). On the current trajectory, LLMs will increasingly be heavily integrated into complex socio-technical systems. The outcomes are unpredictable, but it’s likely such systems will exhibit capabilities significantly beyond what can be predicted from evaluations.

You can try to account for the fact that you’re competing against the entire world’s ingenuity by your privileged access (e.g. for fine-tuning or white-box capabilities ellicitation methods), but this is unlikely to provide sufficient coverage.

EtA: Understanding whether and to what extent the original claim is true is something that would likely require years of research at a minimum.
- evhub 23 Nov 2023 1:33 UTC
  2 points
  0 ∶ 0
  Parent
  I think this is a very good point, and it definitely gives me some pause—and probably my original statement there was too strong. Certainly I agree that you need to do evaluations using the best possible scaffolding that you have, but overall my sense is that this problem is not that bad. Some reasons to think that:
  - At least currently, scaffolding-related performance improvements don’t seem to generally be that large (e.g. chain-of-thought is just not that helpful on most tasks), especially relative to the gains from scaling.
  - You can evaluate pretty directly for the sorts of capabilities that would help make scaffolding way better, like the model being able to correct its own errors, so you don’t have to just evaluate the whole system + scaffolding end-to-end.
  - This is mostly just a problem for large-scale model deployments. If you instead keep your largest model mostly in-house for alignment research, or only give it to a small number of external partners whose scaffolding you can directly evaluate, it makes this problem way less bad.
  That last point is probably the most important here, since it demonstrates that you easily can (and should) absorb this sort of concern into an RSP. For example, you could set a capabilities threshold for models’ ability to do self-correction, and once your models pass that threshold you restrict deployment except in contexts where you can directly evaluate the relevant scaffolding that will be used in advance.