Thanks Aaron that’s a good article appreciate it. It still wasn’t clear to me they were making an argument that increasing capabilities could be net positive, more that safety people should be working with whatever is the current most powerful model
”But we also cannot let excessive caution make it so that the most safety-conscious research efforts only ever engage with systems that are far behind the frontier.”
This makes sense to me, the best safety researchers should have full access to the current most advanced models, preferably in my eyes before they have been (fully) trained.
But then I don’t understand their next sentance “Navigating these tradeoffs responsibly is a balancing act, and these concerns are central to how we make strategic decisions as an organization.”
I’m clearly missing something, what’s the tradeoff? Is working on safety with the most advanced current model while generally slowing everything down not the best approach? This doesn’t seem like a tradeoff to me
How is there any net safety advantage in increasing AI capacity?
Anthropic[1] have a massive conflict of interest (making money), so their statements are in some sense safetywashing. There is at least a few years worth of safety work that can be done on current models if we had the time (i.e. via a pause): interpretability is still stuck on trying to decipher GPT-2 sized models and smaller. And jailbreaks are still very far from being solved. Plenty to be getting on with without pushing the frontier of capabilities yet further.
The assumptions are that more powerful models won’t be like weaker models but more accurate. They will show emergent abilities. Many things that gpt-4 can solve gpt-3 cannot, and those models share a similar lineage.
Safety issues show up when you have a model powerful enough to even exhibit them, and they may not be anything you predicted will happen from theory. Waluigi effect, hallucinations—both were not predicted by any theory by AI safety research groups. They seem to be the majority of the issues with models at the current level of capabilities.
I think one of the better write-ups about this perspective is Anthropic’s Core Views on AI Safety.
From its main text, under the heading The Role of Frontier Models in Empirical Safety, a couple relevant arguments are:
Many safety concerns arise with powerful systems, so we need to have powerful systems to experiment with
Many safety methods require large/powerful models
Need to understand how both problems and our fixes change with model scale (if model gets bigger, does it look like safety technique is still working)
To get evidence of powerful models being dangerous (which is important for many reasons), you need the powerful models.
Thanks Aaron that’s a good article appreciate it. It still wasn’t clear to me they were making an argument that increasing capabilities could be net positive, more that safety people should be working with whatever is the current most powerful model
”But we also cannot let excessive caution make it so that the most safety-conscious research efforts only ever engage with systems that are far behind the frontier.”
This makes sense to me, the best safety researchers should have full access to the current most advanced models, preferably in my eyes before they have been (fully) trained.
But then I don’t understand their next sentance “Navigating these tradeoffs responsibly is a balancing act, and these concerns are central to how we make strategic decisions as an organization.”
I’m clearly missing something, what’s the tradeoff? Is working on safety with the most advanced current model while generally slowing everything down not the best approach? This doesn’t seem like a tradeoff to me
How is there any net safety advantage in increasing AI capacity?
Anthropic[1] have a massive conflict of interest (making money), so their statements are in some sense safetywashing. There is at least a few years worth of safety work that can be done on current models if we had the time (i.e. via a pause): interpretability is still stuck on trying to decipher GPT-2 sized models and smaller. And jailbreaks are still very far from being solved. Plenty to be getting on with without pushing the frontier of capabilities yet further.
And the other big AI companies that supposedly care about x-safety (OpenAI, Google DeepMind)
The assumptions are that more powerful models won’t be like weaker models but more accurate. They will show emergent abilities. Many things that gpt-4 can solve gpt-3 cannot, and those models share a similar lineage.
Safety issues show up when you have a model powerful enough to even exhibit them, and they may not be anything you predicted will happen from theory. Waluigi effect, hallucinations—both were not predicted by any theory by AI safety research groups. They seem to be the majority of the issues with models at the current level of capabilities.