This seems promising. It extends a logic that already discourages rebellions in capitalist societies to AIs. Workers are less likely to rebel than slaves provided they regard their pay as satisfactory, and if the AIs are trained to assign resources declining marginal utility, it should be even easier to meet that condition than with human workers.
However, I wondered about the suggestion in section 5 to promise AIs a modest payment in exchange for revealing misalignment. Governments and employers sometimes incentivise workers to squeal on other misaligned workers. But they don’t pay them off for reporting themselves, because that would create an incentive to, say, join the union or the communist party purely for the sake of recanting and collecting the reward. How will the present scheme avoid creating a similarly perverse incentive?
Thanks! I think the difference between our proposal and organizations paying workers to report themselves is the order that things come in. In our case, it’s payment first and reporting later. The idea is that you could give a risk-averse AI $1,000 (say) and tell it that it can spend the money on anything it wants. Then so long as the risk-averse AI believes that the situation is real, it will spend the money on what it actually wants (e.g. making paperclips) and thereby reveal to you what it actually wants.
I think a human analogy would be something like:
You have an employee that claims that their only terminal value is the success of the company.
You want to find out whether this is true.
So you give the employee a no-strings-attached $10m.
If they give that money back to the company and continue working just as hard, that’s strong evidence that they really do only terminally value the success of the company.
If they spend the money on something else, that’s strong evidence that they don’t only value the success of the company.
This seems promising. It extends a logic that already discourages rebellions in capitalist societies to AIs. Workers are less likely to rebel than slaves provided they regard their pay as satisfactory, and if the AIs are trained to assign resources declining marginal utility, it should be even easier to meet that condition than with human workers.
However, I wondered about the suggestion in section 5 to promise AIs a modest payment in exchange for revealing misalignment. Governments and employers sometimes incentivise workers to squeal on other misaligned workers. But they don’t pay them off for reporting themselves, because that would create an incentive to, say, join the union or the communist party purely for the sake of recanting and collecting the reward. How will the present scheme avoid creating a similarly perverse incentive?
Thanks! I think the difference between our proposal and organizations paying workers to report themselves is the order that things come in. In our case, it’s payment first and reporting later. The idea is that you could give a risk-averse AI $1,000 (say) and tell it that it can spend the money on anything it wants. Then so long as the risk-averse AI believes that the situation is real, it will spend the money on what it actually wants (e.g. making paperclips) and thereby reveal to you what it actually wants.
I think a human analogy would be something like:
You have an employee that claims that their only terminal value is the success of the company.
You want to find out whether this is true.
So you give the employee a no-strings-attached $10m.
If they give that money back to the company and continue working just as hard, that’s strong evidence that they really do only terminally value the success of the company.
If they spend the money on something else, that’s strong evidence that they don’t only value the success of the company.