Yeah—is your sense that “enslave everyone” (in the context of what humans to humans) feels like an especially good handle on either of those scenarios? (That’s all I initially meant to nitpick—not whether such scenarios are plausible.)
Another nitpick: actually, I haven’t heard about (a) as described here—anything you’d suggest I look at? (I’m initially skeptical, since having such a mistaken conception for a long time doesn’t seem all that superintelligent to me. Is what you had in mind scenarios in which torture is motivated by strategic extortion or maybe sadism, since these don’t seem to require a mistaken conception that it’s helping?)
Summary: Slavery is only used as a rough analogy for either of these scenarios because there aren’t real precedents for these kinds of scenarios in human history. To understand how a machine superintelligence could do something like torturing everyone until the end of time while still being superintelligent, check out:
“Enslavement” is a rough analogy for the first scenario only because there isn’t a simple, singular concept that characterizes such a course of events without precedent in human history. The second scenario is closer to enslavement but the context is different than human slavery (or even the human ‘enslavement’ of non-human animals, such as in industrial farming). It’s more similar to the MSI being like an ant queen, but as an exponentially more rational agent, and the sub-agents are drones.
Another nitpick: actually, I haven’t heard about (a) as described here—anything you’d suggest I look at?
A classic example from the rationality community is of an AGI programmed to maximize human happiness and trained to recognize such on a dataset of smiling human faces. In theory, a failure mode therein could be the AGI producing endless copies of humans whose muscles in their faces it stimulates to always have them smiling their entire lives.
That’s an example so reductive as to be maybe too absurd for anyone to expect something like that would happen. Yet it was meant to establish proof of concept. In terms of whose making a “mistake,” it’s hard to describe without someone more about the theory of AI alignment. To clarify, what I should have said is that while such an outcome could appear to be an error on the part of the AGI, it would really be a human error for having programmed it wrong, and the AGI would be properly executing on its goal as it was programmed to do.
Complexity of value is a concept that gets at part of that kind of problem. Eliezer Yudkowsky of the Machine Intelligence Research Institute (MIRI) expanded on it in a paper he authored called “Artificial Intelligence as a Positive and Negative Factor in Global Risk” for the Global Catastrophic Risks handbook, curated by Nick Bostrom of the Future of Humanity Institute (FHI), and originally published by Oxford University Press in 2008. Bostrom’s own book from 2014, Superintelligence, comprehensively reviewed potential, anticipated failure modes for AI alignment. Bostrom would also have extensively covered this kind of failure mode but I forget in what part of the book that was.
I’m guessing there have been updates to these concepts in the several years since those works were published but I haven’t kept up to date with that research literature in the last few years. Reading one or more of those works should give you the basics/fundamentals for understanding the subject. You could use those as a jumping-off point to ask further questions on the EA Forum, LessWrong or the Alignment Forum if you want to learn more after.
Thanks for the detailed response / sharing the resources! I’m familiar with them (I had been wondering if there was a version of (a) that didn’t involve the following modification, although it seems like we’re on a similar page)
To clarify, what I should have said is that while such an outcome could appear to be an error on the part of the AGI, it would really be a human error
Yeah—is your sense that “enslave everyone” (in the context of what humans to humans) feels like an especially good handle on either of those scenarios? (That’s all I initially meant to nitpick—not whether such scenarios are plausible.)
Another nitpick: actually, I haven’t heard about (a) as described here—anything you’d suggest I look at? (I’m initially skeptical, since having such a mistaken conception for a long time doesn’t seem all that superintelligent to me. Is what you had in mind scenarios in which torture is motivated by strategic extortion or maybe sadism, since these don’t seem to require a mistaken conception that it’s helping?)
Summary: Slavery is only used as a rough analogy for either of these scenarios because there aren’t real precedents for these kinds of scenarios in human history. To understand how a machine superintelligence could do something like torturing everyone until the end of time while still being superintelligent, check out:
Complexity of value
Artificial Intelligence as a Positive and Negative Factor in Global Risk
“Enslavement” is a rough analogy for the first scenario only because there isn’t a simple, singular concept that characterizes such a course of events without precedent in human history. The second scenario is closer to enslavement but the context is different than human slavery (or even the human ‘enslavement’ of non-human animals, such as in industrial farming). It’s more similar to the MSI being like an ant queen, but as an exponentially more rational agent, and the sub-agents are drones.
A classic example from the rationality community is of an AGI programmed to maximize human happiness and trained to recognize such on a dataset of smiling human faces. In theory, a failure mode therein could be the AGI producing endless copies of humans whose muscles in their faces it stimulates to always have them smiling their entire lives.
That’s an example so reductive as to be maybe too absurd for anyone to expect something like that would happen. Yet it was meant to establish proof of concept. In terms of whose making a “mistake,” it’s hard to describe without someone more about the theory of AI alignment. To clarify, what I should have said is that while such an outcome could appear to be an error on the part of the AGI, it would really be a human error for having programmed it wrong, and the AGI would be properly executing on its goal as it was programmed to do.
Complexity of value is a concept that gets at part of that kind of problem. Eliezer Yudkowsky of the Machine Intelligence Research Institute (MIRI) expanded on it in a paper he authored called “Artificial Intelligence as a Positive and Negative Factor in Global Risk” for the Global Catastrophic Risks handbook, curated by Nick Bostrom of the Future of Humanity Institute (FHI), and originally published by Oxford University Press in 2008. Bostrom’s own book from 2014, Superintelligence, comprehensively reviewed potential, anticipated failure modes for AI alignment. Bostrom would also have extensively covered this kind of failure mode but I forget in what part of the book that was.
I’m guessing there have been updates to these concepts in the several years since those works were published but I haven’t kept up to date with that research literature in the last few years. Reading one or more of those works should give you the basics/fundamentals for understanding the subject. You could use those as a jumping-off point to ask further questions on the EA Forum, LessWrong or the Alignment Forum if you want to learn more after.
Thanks for the detailed response / sharing the resources! I’m familiar with them (I had been wondering if there was a version of (a) that didn’t involve the following modification, although it seems like we’re on a similar page)
You’re welcome :)