To stick with your analogy, each time we do evals we thin out the fog a bit, with the intention of clearing it before we reach the edge, as well as improving our ability to stop.
How does doing evals improve your ability to stop? What concrete actions will you take when an eval shows a dangerous result? Do none of them overlap with pausing?
Evals showing dangerous capabilities (such as how to build a nuclear weapon) can be used to convince lawmakers that this stuff is real and imminent.
Of course, you don’t need that if lawmakers already agree with you – in that case, it’s strictly best to not tinker with anything dangerous.
But assuming that many lawmakers will remain skeptical, one function of evals could be “drawing out an AI warning shot, making it happen in a contained and controlled environment where there’s no damage.”
Of course, we wouldn’t want evals teams to come up with AI capability improvements, so evals shouldn’t become dangerous AI gain-of-function research. Still, it’s a spectrum because even just clever prompting or small tricks can sometimes unearth hidden capabilities that the model had to begin with, and that’s the sort of thing that evals should warn us about.
How does doing evals improve your ability to stop? What concrete actions will you take when an eval shows a dangerous result? Do none of them overlap with pausing?
Evals showing dangerous capabilities (such as how to build a nuclear weapon) can be used to convince lawmakers that this stuff is real and imminent.
Of course, you don’t need that if lawmakers already agree with you – in that case, it’s strictly best to not tinker with anything dangerous.
But assuming that many lawmakers will remain skeptical, one function of evals could be “drawing out an AI warning shot, making it happen in a contained and controlled environment where there’s no damage.”
Of course, we wouldn’t want evals teams to come up with AI capability improvements, so evals shouldn’t become dangerous AI gain-of-function research. Still, it’s a spectrum because even just clever prompting or small tricks can sometimes unearth hidden capabilities that the model had to begin with, and that’s the sort of thing that evals should warn us about.