“I’d be especially excited for people to spread messages that help others understand—at a mechanistic level—how and why AI systems could end up with dangerous goals of their own, deceptive behavior, etc. I worry that by default, the concern sounds like lazy anthropomorphism (thinking of AIs just like humans).”
I agree that this seems good for avoiding the anthropomorphism (in perception and in one’s own thought!) but I think it’ll be important to emphasise when doing this that these are conceivable ways and ultimately possible examples rather than the whole risk-case. Why?
People might otherwise think that they have solved the problem when they’ve ruled out or fixed that particular problematic mechanism, when really they haven’t. Or when the more specific mechanistic descriptions probably end up wrong in some way, the whole case might be dismissed—when the argument for risk didn’t ultimately depend on those particulars.
(this only applies if you are pretty unconfident confident in the particular mechanisms that will be risky vs. safe)
Nice post. One thought on this—you wrote:
I agree that this seems good for avoiding the anthropomorphism (in perception and in one’s own thought!) but I think it’ll be important to emphasise when doing this that these are conceivable ways and ultimately possible examples rather than the whole risk-case. Why? People might otherwise think that they have solved the problem when they’ve ruled out or fixed that particular problematic mechanism, when really they haven’t. Or when the more specific mechanistic descriptions probably end up wrong in some way, the whole case might be dismissed—when the argument for risk didn’t ultimately depend on those particulars.
(this only applies if you are pretty unconfident confident in the particular mechanisms that will be risky vs. safe)
[written in my personal capacity]
Agreed!