See Rohin Shah’s (I think correct) objection to the use of “coherence arguments” to support AI risk concerns.
Fwiw I’d say this somewhat differently.
I object to a specific way in which one could use coherence arguments to support AI risk: namely, “AI is intelligent --> AI satisfies coherence arguments better than we do --> AI looks as though it is maximizing a utility function from our perspective --> Convergent instrumental subgoals --> Doom”.
As far as I know, anyone who has spent ~an hour reading my post and thinking about it basically agrees with that particular narrow point.
This doesn’t rule out other ways that one could use coherence arguments to support AI risk, such as “coherence arguments show that achieving stuff can typically be factored into beliefs about the world and goals that you want to achieve; since we’ll be building AIs to achieve stuff, it seems likely they’ll work by having separated beliefs and goals; if they have bad goals, then we die because of convergent instrumental subgoals”. I’m more sympathetic to this argument (though not nearly as much as Eliezer appears to be).
I agree that the intro talk that you link to would likely cause people to think of the first pathway (which I object to) rather than the second pathway. Similar rhetoric caused me to believe the first pathway for a while.
But it also looks like the sort of talk you might give if you were thinking about the second pathway, and then compressed it losing a bunch of nuance, and didn’t notice that people might then instead think of the first pathway.
(It’s not clear whether any of this changes the upshot of your post. I am mostly trying to preserve nuance so I get fewer people saying “I thought you thought utility functions are fake” which is definitely not what I said or believed.)
Fwiw I’d say this somewhat differently.
I object to a specific way in which one could use coherence arguments to support AI risk: namely, “AI is intelligent --> AI satisfies coherence arguments better than we do --> AI looks as though it is maximizing a utility function from our perspective --> Convergent instrumental subgoals --> Doom”.
As far as I know, anyone who has spent ~an hour reading my post and thinking about it basically agrees with that particular narrow point.
This doesn’t rule out other ways that one could use coherence arguments to support AI risk, such as “coherence arguments show that achieving stuff can typically be factored into beliefs about the world and goals that you want to achieve; since we’ll be building AIs to achieve stuff, it seems likely they’ll work by having separated beliefs and goals; if they have bad goals, then we die because of convergent instrumental subgoals”. I’m more sympathetic to this argument (though not nearly as much as Eliezer appears to be).
I agree that the intro talk that you link to would likely cause people to think of the first pathway (which I object to) rather than the second pathway. Similar rhetoric caused me to believe the first pathway for a while.
But it also looks like the sort of talk you might give if you were thinking about the second pathway, and then compressed it losing a bunch of nuance, and didn’t notice that people might then instead think of the first pathway.
(It’s not clear whether any of this changes the upshot of your post. I am mostly trying to preserve nuance so I get fewer people saying “I thought you thought utility functions are fake” which is definitely not what I said or believed.)