I think this is one reasonable avenue to explore alignment, but I don’t want everybody doing it.
My impression is that AI researchers exist on a spectrum from only doing empirical work (of the kind you describe) to only doing theoretical work (like Agent Foundations), and most fall in the middle, doing some theory to figure out what kind of experiment to run, and using empirical data to improve their theories (a lot of science looks like this!).
I think all (or even a majority of) AI safety researchers moving to doing empirical work on current AI systems is unwise, for two reasons:
Bigger models have bigger problems.
Lessons learned from current misalignment may be necessary for aligning future models, but will certainly not be sufficient. For instance, GPT-3 will (we assume) never demonstrate deceptive alignment, because its model of the world is not broad enough to do so, but more complex AIs may do.
This is particularly worrying because we may only get one shot at spotting deceptive alignment! Thinking about problems in this class before we have direct access to models that could, even in theory, exhibit these problems seems both mandatory and a key reason alignment seems hard to me.
AI researchers are sub-specialised.
Many current researchers working in non-technical alignment, while they presumably have a decent technical background, are not cutting-edge ML engineers. There’s not a 1:1 skill translation from ‘current alignment researcher’ to ‘GPT-3 alignment researcher’
There is maybe some claim here that you could save money on current alignment researchers and fund a whole bunch of GPT alignment researchers, but I expect the exchange rate is pretty poor, or it’s just not possible in the medium term to find sufficient people with a deep understanding of both ML and alignment.
The first one is the biggy. I can imagine this approach working (perhaps inefficiently) in a world were (1) were false and (2) were true, but I can’t imagine this approach working in any worlds where (1) holds.
I think this is one reasonable avenue to explore alignment, but I don’t want everybody doing it.
My impression is that AI researchers exist on a spectrum from only doing empirical work (of the kind you describe) to only doing theoretical work (like Agent Foundations), and most fall in the middle, doing some theory to figure out what kind of experiment to run, and using empirical data to improve their theories (a lot of science looks like this!).
I think all (or even a majority of) AI safety researchers moving to doing empirical work on current AI systems is unwise, for two reasons:
Bigger models have bigger problems.
Lessons learned from current misalignment may be necessary for aligning future models, but will certainly not be sufficient. For instance, GPT-3 will (we assume) never demonstrate deceptive alignment, because its model of the world is not broad enough to do so, but more complex AIs may do.
This is particularly worrying because we may only get one shot at spotting deceptive alignment! Thinking about problems in this class before we have direct access to models that could, even in theory, exhibit these problems seems both mandatory and a key reason alignment seems hard to me.
AI researchers are sub-specialised.
Many current researchers working in non-technical alignment, while they presumably have a decent technical background, are not cutting-edge ML engineers. There’s not a 1:1 skill translation from ‘current alignment researcher’ to ‘GPT-3 alignment researcher’
There is maybe some claim here that you could save money on current alignment researchers and fund a whole bunch of GPT alignment researchers, but I expect the exchange rate is pretty poor, or it’s just not possible in the medium term to find sufficient people with a deep understanding of both ML and alignment.
The first one is the biggy. I can imagine this approach working (perhaps inefficiently) in a world were (1) were false and (2) were true, but I can’t imagine this approach working in any worlds where (1) holds.