I think the idea is that if we can get the A.I. to have the right values, then it won’t matter if it could theoretically take-over and overpower us, because it won’t want to. A more sinister variant of this, which I suspect a lot of people at MIRI believe in, and perhaps Bostrom also (but I have no direct evidence of this, other than a vague sense from things I’ve seen said over the years) is that if we can get an A.I. with the right values, it would be great if it took over and optimized everything towards those values (sure, power would corrupt humans, but that’s not a fact about all possible minds, and genuinely having the right values would prevent this). I am not terribly worried in itself about MIRI people believing the latter, because I don’t think they’ll build AGI, but I am a little worried about people at DeepMind (who I think take MIRI people, or at least Yudkowsky) more seriously than you’d intuitively guess, taking up these ideas. (Though I am much less confident than most EAs that world changing A.I. is imminent.)
Thank you, that is helpful. I still don’t see, I think, why we think an AGI would be incapable of assessing its own values and potentially altering them, if it’s intelligent enough to be an existential risk to humanity—but we’re hoping that the result of any such assessment would be “the values humans instilled in me seem optimal”? Is that it? Because then my question is which values exactly we’re attempting to instill. At the risk of being downvoted to hell I will share that the thought of a superpowerful AI that shares the value system of e.g. LessWrong is slightly terrifying to me. Relatedly(?) I studied a humanities subject :)
I think the idea is that it will only change its values in a particular direction if that helped it realise its current values. So it won’t changes its values if doing so would mean that it would do horrible things according to its current values. A philosophical thing lurking in the background is that you can’t work out the correct values just by good thinking, rather basic starting values are thinking-independent, as long as your consistent: no amount of intelligence and reasoning will make you arrive at the correct ones. (They call this the “orthagonality thesis”, but a similar idea is known in academic philosophy as Humeanism about moral motivation. It’s quite mainstream but not without its critics).
‘...the thought of a superpowerful AI that shares the value system of e.g. LessWrong is slightly terrifying to me.‘
Old post, but I’ve meant to say this for several months: Whilst I am not a fan of Yudkowsky, I do think that his stuff about this showed a fair amount of sensitivity to the idea that it would be unfair if a particular group of people just programed their values into the AI, taking no heed of the fact that humans disagree. (Not that that means there is no reason to worry about the proposal to build a “good” AI that runs everything).
His original (since abandoned I think) proposal, was that we would get the AI to have goal like ‘maximizes things all or pretty much all fully informed humans would agree are good, minimizes things all or almost all fully informed would humans agree are bad, and where humans would disagree on whether something is good or bad even after being fully informed of all relevant facts, try and minimize your impact on that thing, and leave it up to humans to sort out amongst themselves.’ (Not an exact rendition, but close enough for present purposes.) Of course, there’s a sense in which that still embodies liberal democratic values about what is fair, but I’m guessing if your a contemporary person with a humanities degree, you probably share those very broad and abstract values.
I think the idea is that if we can get the A.I. to have the right values, then it won’t matter if it could theoretically take-over and overpower us, because it won’t want to. A more sinister variant of this, which I suspect a lot of people at MIRI believe in, and perhaps Bostrom also (but I have no direct evidence of this, other than a vague sense from things I’ve seen said over the years) is that if we can get an A.I. with the right values, it would be great if it took over and optimized everything towards those values (sure, power would corrupt humans, but that’s not a fact about all possible minds, and genuinely having the right values would prevent this). I am not terribly worried in itself about MIRI people believing the latter, because I don’t think they’ll build AGI, but I am a little worried about people at DeepMind (who I think take MIRI people, or at least Yudkowsky) more seriously than you’d intuitively guess, taking up these ideas. (Though I am much less confident than most EAs that world changing A.I. is imminent.)
Thank you, that is helpful. I still don’t see, I think, why we think an AGI would be incapable of assessing its own values and potentially altering them, if it’s intelligent enough to be an existential risk to humanity—but we’re hoping that the result of any such assessment would be “the values humans instilled in me seem optimal”? Is that it? Because then my question is which values exactly we’re attempting to instill. At the risk of being downvoted to hell I will share that the thought of a superpowerful AI that shares the value system of e.g. LessWrong is slightly terrifying to me. Relatedly(?) I studied a humanities subject :)
Thank you again!
I think the idea is that it will only change its values in a particular direction if that helped it realise its current values. So it won’t changes its values if doing so would mean that it would do horrible things according to its current values. A philosophical thing lurking in the background is that you can’t work out the correct values just by good thinking, rather basic starting values are thinking-independent, as long as your consistent: no amount of intelligence and reasoning will make you arrive at the correct ones. (They call this the “orthagonality thesis”, but a similar idea is known in academic philosophy as Humeanism about moral motivation. It’s quite mainstream but not without its critics).
‘...the thought of a superpowerful AI that shares the value system of e.g. LessWrong is slightly terrifying to me.‘
Old post, but I’ve meant to say this for several months: Whilst I am not a fan of Yudkowsky, I do think that his stuff about this showed a fair amount of sensitivity to the idea that it would be unfair if a particular group of people just programed their values into the AI, taking no heed of the fact that humans disagree. (Not that that means there is no reason to worry about the proposal to build a “good” AI that runs everything).
His original (since abandoned I think) proposal, was that we would get the AI to have goal like ‘maximizes things all or pretty much all fully informed humans would agree are good, minimizes things all or almost all fully informed would humans agree are bad, and where humans would disagree on whether something is good or bad even after being fully informed of all relevant facts, try and minimize your impact on that thing, and leave it up to humans to sort out amongst themselves.’ (Not an exact rendition, but close enough for present purposes.) Of course, there’s a sense in which that still embodies liberal democratic values about what is fair, but I’m guessing if your a contemporary person with a humanities degree, you probably share those very broad and abstract values.