I recently learned that in law, there is a breakdown as:
Intent (~=misuse)
Oblique Intent (i.e. a known side effect)
Recklessness (known chance of side effect)
Negligence (should’ve known chance of side effect)
Accident (couldn’t have been expected to know)
This seems like a good categorization.
“With the use of fine-tuning, and a bunch of careful engineering work, capabilities evaluations can be done reliably and robustly.”
I strongly disagree with this (and the title of the piece). I’ve been having these arguments a lot recently, and I think these sorts of claims are emblamatic of a dangerously narrow view on the problem of AI x-safety, which I am disappointed to see seems quite popular.
A few reasons why this statement is misleading:
* New capabilities ellicitation techniques arrive frequently and unpredictably (think chain of thought, e.g.)
* The capabilities of a system could be much greater than any particular LLM involved in that system (think tool use and coding). On the current trajectory, LLMs will increasingly be heavily integrated into complex socio-technical systems. The outcomes are unpredictable, but it’s likely such systems will exhibit capabilities significantly beyond what can be predicted from evaluations.
You can try to account for the fact that you’re competing against the entire world’s ingenuity by your privileged access (e.g. for fine-tuning or white-box capabilities ellicitation methods), but this is unlikely to provide sufficient coverage.
EtA: Understanding whether and to what extent the original claim is true is something that would likely require years of research at a minimum.