There’s an important difference between pausing and evals: evals gets you loads of additional information. We can look at the results of the evals, discuss them and determine in what ways a model might have misuse potential (and thus try to mitigate it) or if the model is simply undeployable. If we’re still unsure, we can gather more data and additionally refine our ability to perform and interpret evals.
If we (i.e. the ML community) repeatedly do this we build up a better picture of where our current capabilities lie, how evals relate to real-world impact and so on. I think this makes evals much better, and the effect will compound over time. Evals also produce concrete data that can convince skeptics (such as me—I am currently pretty skeptical of much regulation but can easily imagine eval results that would convince me). To stick with your analogy, each time we do evals we thin out the fog a bit, with the intention of clearing it before we reach the edge, as well as improving our ability to stop.
To stick with your analogy, each time we do evals we thin out the fog a bit, with the intention of clearing it before we reach the edge, as well as improving our ability to stop.
How does doing evals improve your ability to stop? What concrete actions will you take when an eval shows a dangerous result? Do none of them overlap with pausing?
Evals showing dangerous capabilities (such as how to build a nuclear weapon) can be used to convince lawmakers that this stuff is real and imminent.
Of course, you don’t need that if lawmakers already agree with you – in that case, it’s strictly best to not tinker with anything dangerous.
But assuming that many lawmakers will remain skeptical, one function of evals could be “drawing out an AI warning shot, making it happen in a contained and controlled environment where there’s no damage.”
Of course, we wouldn’t want evals teams to come up with AI capability improvements, so evals shouldn’t become dangerous AI gain-of-function research. Still, it’s a spectrum because even just clever prompting or small tricks can sometimes unearth hidden capabilities that the model had to begin with, and that’s the sort of thing that evals should warn us about.
There’s an important difference between pausing and evals: evals gets you loads of additional information. We can look at the results of the evals, discuss them and determine in what ways a model might have misuse potential (and thus try to mitigate it) or if the model is simply undeployable. If we’re still unsure, we can gather more data and additionally refine our ability to perform and interpret evals.
If we (i.e. the ML community) repeatedly do this we build up a better picture of where our current capabilities lie, how evals relate to real-world impact and so on. I think this makes evals much better, and the effect will compound over time. Evals also produce concrete data that can convince skeptics (such as me—I am currently pretty skeptical of much regulation but can easily imagine eval results that would convince me). To stick with your analogy, each time we do evals we thin out the fog a bit, with the intention of clearing it before we reach the edge, as well as improving our ability to stop.
How does doing evals improve your ability to stop? What concrete actions will you take when an eval shows a dangerous result? Do none of them overlap with pausing?
Evals showing dangerous capabilities (such as how to build a nuclear weapon) can be used to convince lawmakers that this stuff is real and imminent.
Of course, you don’t need that if lawmakers already agree with you – in that case, it’s strictly best to not tinker with anything dangerous.
But assuming that many lawmakers will remain skeptical, one function of evals could be “drawing out an AI warning shot, making it happen in a contained and controlled environment where there’s no damage.”
Of course, we wouldn’t want evals teams to come up with AI capability improvements, so evals shouldn’t become dangerous AI gain-of-function research. Still, it’s a spectrum because even just clever prompting or small tricks can sometimes unearth hidden capabilities that the model had to begin with, and that’s the sort of thing that evals should warn us about.