Thanks Aaron that’s a good article appreciate it. It still wasn’t clear to me they were making an argument that increasing capabilities could be net positive, more that safety people should be working with whatever is the current most powerful model
”But we also cannot let excessive caution make it so that the most safety-conscious research efforts only ever engage with systems that are far behind the frontier.”
This makes sense to me, the best safety researchers should have full access to the current most advanced models, preferably in my eyes before they have been (fully) trained.
But then I don’t understand their next sentance “Navigating these tradeoffs responsibly is a balancing act, and these concerns are central to how we make strategic decisions as an organization.”
I’m clearly missing something, what’s the tradeoff? Is working on safety with the most advanced current model while generally slowing everything down not the best approach? This doesn’t seem like a tradeoff to me
How is there any net safety advantage in increasing AI capacity?
Anthropic[1] have a massive conflict of interest (making money), so their statements are in some sense safetywashing. There is at least a few years worth of safety work that can be done on current models if we had the time (i.e. via a pause): interpretability is still stuck on trying to decipher GPT-2 sized models and smaller. And jailbreaks are still very far from being solved. Plenty to be getting on with without pushing the frontier of capabilities yet further.
The assumptions are that more powerful models won’t be like weaker models but more accurate. They will show emergent abilities. Many things that gpt-4 can solve gpt-3 cannot, and those models share a similar lineage.
Safety issues show up when you have a model powerful enough to even exhibit them, and they may not be anything you predicted will happen from theory. Waluigi effect, hallucinations—both were not predicted by any theory by AI safety research groups. They seem to be the majority of the issues with models at the current level of capabilities.
Thanks Aaron that’s a good article appreciate it. It still wasn’t clear to me they were making an argument that increasing capabilities could be net positive, more that safety people should be working with whatever is the current most powerful model
”But we also cannot let excessive caution make it so that the most safety-conscious research efforts only ever engage with systems that are far behind the frontier.”
This makes sense to me, the best safety researchers should have full access to the current most advanced models, preferably in my eyes before they have been (fully) trained.
But then I don’t understand their next sentance “Navigating these tradeoffs responsibly is a balancing act, and these concerns are central to how we make strategic decisions as an organization.”
I’m clearly missing something, what’s the tradeoff? Is working on safety with the most advanced current model while generally slowing everything down not the best approach? This doesn’t seem like a tradeoff to me
How is there any net safety advantage in increasing AI capacity?
Anthropic[1] have a massive conflict of interest (making money), so their statements are in some sense safetywashing. There is at least a few years worth of safety work that can be done on current models if we had the time (i.e. via a pause): interpretability is still stuck on trying to decipher GPT-2 sized models and smaller. And jailbreaks are still very far from being solved. Plenty to be getting on with without pushing the frontier of capabilities yet further.
And the other big AI companies that supposedly care about x-safety (OpenAI, Google DeepMind)
The assumptions are that more powerful models won’t be like weaker models but more accurate. They will show emergent abilities. Many things that gpt-4 can solve gpt-3 cannot, and those models share a similar lineage.
Safety issues show up when you have a model powerful enough to even exhibit them, and they may not be anything you predicted will happen from theory. Waluigi effect, hallucinations—both were not predicted by any theory by AI safety research groups. They seem to be the majority of the issues with models at the current level of capabilities.