I do think there’s still more thinking to be done here, but, since I recorded the episode, Alexis Carlier and Tom Davidson have actually done some good work in response to Hanson’s critique. I was pretty persuaded of their conclusion:
There are similarities between the AI alignment and principal-agent problems, suggesting that PAL could teach us about AI risk. However, the situations economists have studied are very different to those discussed by proponents of AI risk, meaning that findings from PAL don’t transfer easily to this context. There are a few main issues. The principal-agent setup is only a part of AI risk scenarios, making agency rents too narrow a metric. PAL models rarely consider agents more intelligent than their principals and the models are very brittle. And the lack of insight from PAL unawareness models severely restricts their usefulness for understanding the accident risk scenario.
Nevertheless, extensions to PAL might still be useful. Agency rents are what might allow AI agents to accumulate wealth and influence, and agency models are the best way we have to learn about the size of these rents. These findings should inform a wide range of future scenarios, perhaps barring extreme ones like Bostrom/Yudkowsky.
Do you still think that Robin Hanson’s critique of Christiano’s scenario is worth exploring in more detail?
I do think there’s still more thinking to be done here, but, since I recorded the episode, Alexis Carlier and Tom Davidson have actually done some good work in response to Hanson’s critique. I was pretty persuaded of their conclusion: