I want to challenge an argument that I think is drives a lot of AI risk intuitions. I think the argument goes something like this:
There is something called “human values”.
Humans broadly share “human values” with each other.
It would be catastrophic if AIs lacked “human values”.
“Human values” are an extremely narrow target, meaning that we need to put in exceptional effort in order to get AIs to be aligned with human values.
My problem with this argument is that “human values” can refer to (at least) three different things, and under every plausible interpretation, the argument appears internally inconsistent.
Broadly speaking, I think “human values” usually refers to one of three concepts:
The individual objectives that people pursue in their own life (i.e. the individual human desire desire for wealth, status, and happiness, usually for themselves or their family and friends)
The set of rules we use to socially coordinate (i.e. our laws, institutions, and social norms)
Our cultural values (i.e. the ways that human societies have broadly differed from each other, in their languages, tastes, styles, etc.)
Under the first interpretation, I think premise (2) of the original argument is undermined. In the second interpretation, premise (4) is undermined. In the third interpretation, premise (3) is undermined.
Let me elaborate.
In the first interpretation, “human values” is not a coherent target that we share with one another, since each person has their own separate, generally selfish objectives that they pursue in their own life. In other words, there isn’t one thing called human values. There are just separate, individually varying preferences for 8 billion humans. When a new human is born, a new version of “human values” comes into existence.
In this view, the set of “human values” from humans 100 years ago is almost completely different from the the set of “human values” that exists now, since almost everyone alive 100 years ago is now dead. In effect, the passage of time is itself a catastrophe. This implies that “human values” isn’t a shared property of the human species, but rather depends on the exact set of individuals who happen to exist at any moment in time. This is loosely speaking a person-affecting perspective.
In the second interpretation, “human values” simply refer to a set of coordination mechanisms that we use to get along with each other, to facilitate our separate individual ends. In this interpretation I do not think “human values” are well-modeled as an extremely narrow target inside a high dimensional space.
Consider our most basic laws: do not murder, do not steal, do not physically assault another person. These seem like very natural ideas could be stumbled upon by a large set of civilizations, even given wildly varying individual and cultural values between them. For example, the idea that it is wrong to steal from another person seems like a pretty natural idea that even aliens could converge on. Not all aliens would converge on such a value, but it seems plausible that enough of them would that we should not say it is an “extremely narrow target”.
In the third interpretation, “human values” are simply cultural values, and it is not clear to me why we would consider changes to this status quo to be literally catastrophic. It seems the most plausible way that cultural changes could be catastrophic is if they changed in a way that dramatically affected our institutions, laws, and norms. But in that case, it starts sounding more like “human values” is being used according to the second interpretation, and not the third.
When I think of values I think of interpretation #2, and I don’t think you prove that P4 is untrue under that interpretation. The idea is that humans are both a) constrained and b) generally inclined to follow some set of rules. An AI would be neither constrained nor necessarily inclined to follow these rules.
Consider our most basic laws: do not murder, do not steal, do not physically assault another person. These seem like very natural ideas could be stumbled upon by a large set of civilizations, even given wildly varying individual and cultural values between them.
Virtually all historical and present atrocities are framed in terms of determining who is a person and who is not. Why would AIs see us as having moral personhood?
When I think of values I think of interpretation #2, and I don’t think you prove that P4 is untrue under that interpretation. The idea is that humans are both a) constrained and b) generally inclined to follow some set of rules. An AI would be neither constrained nor necessarily inclined to follow these rules.
P4 is about whether human values are an extremely narrow target, not about whether AIs will be necessarily be inclined to follow them, or necessarily constrained by them. I agree it is logically possible for AIs to exist who would try to murder humans; indeed, there are already humans who try to do that to others. The primary question is instead about how narrow of a target the value “don’t murder” or “don’t steal” is, and whether we need to put in exceptional effort in order to hit these targets.
Among humans, it seems the specific target here is not very narrow, despite our greatly varying individual objectives. This fact provides a hint at how narrow our basic social mechanisms really are, in my opinion.
Virtually all historical and present atrocities are framed in terms of determining who is a person and who is not. Why would AIs see us as having moral personhood?
Here again I would say the question is more about whether thinking that humans have relevant personhood is an extremely narrow target, not about whether AIs will necessarily see us as persons. They may see us as persons, and maybe they won’t. But the idea that they would doesn’t seem very unnatural. For one, if AIs are created in something like our current legal system, the concept of legal personhood will already be extended to humans by default. It seems pretty natural for future people to inherit legal concepts from the past. And all I’m really arguing here is that this isn’t an extremely narrow target to hit, not that it must happen by necessity.
I guess “narrow target” is just an underspecified part of your argument then, because I don’t know what it’s meant to capture if not “in most plausible scenarios, AI doesn’t follow the same set of rules as humans”.
Can you outline the case for thinking that “in most plausible scenarios, AI doesn’t follow the same set of rules as humans”? To clarify, by “same set of rules” here I’m imagining basic legal rules: do not murder, do not steal etc. I’m not making a claim that specific legal statutes will persist over time.
It seems to me both that:
To the extent that AIs are our descendants, they should inherit our legal system, legal principles, and legal concepts, similar to how e.g. the United States inherited legal principles from the United Kingdom. We should certainly expect our legal system to change over time as our institutions adapt to technological change. But, absent a compelling reason otherwise, it seems wrong to think that “do not murder a human” will go out the window in “most plausible scenarios”.
Our basic legal rules seem pretty natural, rather than being highly contingent. It’s easy to imagine plenty of alien cultures stumbling upon the idea of property rights, and implementing the rule “do not steal from another legal person”.
My point is that AI could plausibly have rules for interacting with other “persons”, and those rules could look much like ours, but that we will not be “persons” under their code. Consider how “do not murder” has never applied to animals.
If AIs treat us like we treat animals then the fact that they have “values” will not be very helpful to us.
I think AIs will be trained on our data, and will be integrated into our culture, having been deliberately designed for the purpose of filling human-shaped holes in our economy, to automate labor. This means they’ll probably inherit our social concepts, in addition to most other concepts we have about the physical world. This situation seems disanalogous to the way humans interact with animals in many ways. Animals can’t even speak language.
Anyway, even the framing you have given seems like a partial concession towards my original point. A rejection of premise 4 is not equivalent to the idea that AIs will automatically follow our legal norms. Instead, it was about whether “human values” are an extremely narrow target, in the sense of being a natural vs. contingent set of values that are very hard to replicate in other circumstances.
If the way AIs relate to human values is similar to how humans relate to animals, then I’ll point out that many existing humans already find the idea of caring about animals to be quite natural, even if most ultimately decide not to take the idea very far. Compare the concept of “caring about animals” to “caring about paperclip maximization”. In the first instance, we have robust examples of people actually doing that, but hardly any examples of people in the second instance. This is after all because caring about paperclip maximization is an unnatural and arbitrary thing to care about relative to how most people conceptualize the world.
Again, I’m not saying AIs will necessarily care about human values. That was never the claim. The entire question was about whether human values are an “extremely narrow target”. And I think, within this context, given the second interpretation of human values in my original comment, the original thesis seems to have held up fine.
I want to challenge an argument that I think is drives a lot of AI risk intuitions. I think the argument goes something like this:
There is something called “human values”.
Humans broadly share “human values” with each other.
It would be catastrophic if AIs lacked “human values”.
“Human values” are an extremely narrow target, meaning that we need to put in exceptional effort in order to get AIs to be aligned with human values.
My problem with this argument is that “human values” can refer to (at least) three different things, and under every plausible interpretation, the argument appears internally inconsistent.
Broadly speaking, I think “human values” usually refers to one of three concepts:
The individual objectives that people pursue in their own life (i.e. the individual human desire desire for wealth, status, and happiness, usually for themselves or their family and friends)
The set of rules we use to socially coordinate (i.e. our laws, institutions, and social norms)
Our cultural values (i.e. the ways that human societies have broadly differed from each other, in their languages, tastes, styles, etc.)
Under the first interpretation, I think premise (2) of the original argument is undermined. In the second interpretation, premise (4) is undermined. In the third interpretation, premise (3) is undermined.
Let me elaborate.
In the first interpretation, “human values” is not a coherent target that we share with one another, since each person has their own separate, generally selfish objectives that they pursue in their own life. In other words, there isn’t one thing called human values. There are just separate, individually varying preferences for 8 billion humans. When a new human is born, a new version of “human values” comes into existence.
In this view, the set of “human values” from humans 100 years ago is almost completely different from the the set of “human values” that exists now, since almost everyone alive 100 years ago is now dead. In effect, the passage of time is itself a catastrophe. This implies that “human values” isn’t a shared property of the human species, but rather depends on the exact set of individuals who happen to exist at any moment in time. This is loosely speaking a person-affecting perspective.
In the second interpretation, “human values” simply refer to a set of coordination mechanisms that we use to get along with each other, to facilitate our separate individual ends. In this interpretation I do not think “human values” are well-modeled as an extremely narrow target inside a high dimensional space.
Consider our most basic laws: do not murder, do not steal, do not physically assault another person. These seem like very natural ideas could be stumbled upon by a large set of civilizations, even given wildly varying individual and cultural values between them. For example, the idea that it is wrong to steal from another person seems like a pretty natural idea that even aliens could converge on. Not all aliens would converge on such a value, but it seems plausible that enough of them would that we should not say it is an “extremely narrow target”.
In the third interpretation, “human values” are simply cultural values, and it is not clear to me why we would consider changes to this status quo to be literally catastrophic. It seems the most plausible way that cultural changes could be catastrophic is if they changed in a way that dramatically affected our institutions, laws, and norms. But in that case, it starts sounding more like “human values” is being used according to the second interpretation, and not the third.
When I think of values I think of interpretation #2, and I don’t think you prove that P4 is untrue under that interpretation. The idea is that humans are both a) constrained and b) generally inclined to follow some set of rules. An AI would be neither constrained nor necessarily inclined to follow these rules.
Virtually all historical and present atrocities are framed in terms of determining who is a person and who is not. Why would AIs see us as having moral personhood?
P4 is about whether human values are an extremely narrow target, not about whether AIs will be necessarily be inclined to follow them, or necessarily constrained by them. I agree it is logically possible for AIs to exist who would try to murder humans; indeed, there are already humans who try to do that to others. The primary question is instead about how narrow of a target the value “don’t murder” or “don’t steal” is, and whether we need to put in exceptional effort in order to hit these targets.
Among humans, it seems the specific target here is not very narrow, despite our greatly varying individual objectives. This fact provides a hint at how narrow our basic social mechanisms really are, in my opinion.
Here again I would say the question is more about whether thinking that humans have relevant personhood is an extremely narrow target, not about whether AIs will necessarily see us as persons. They may see us as persons, and maybe they won’t. But the idea that they would doesn’t seem very unnatural. For one, if AIs are created in something like our current legal system, the concept of legal personhood will already be extended to humans by default. It seems pretty natural for future people to inherit legal concepts from the past. And all I’m really arguing here is that this isn’t an extremely narrow target to hit, not that it must happen by necessity.
I guess “narrow target” is just an underspecified part of your argument then, because I don’t know what it’s meant to capture if not “in most plausible scenarios, AI doesn’t follow the same set of rules as humans”.
Can you outline the case for thinking that “in most plausible scenarios, AI doesn’t follow the same set of rules as humans”? To clarify, by “same set of rules” here I’m imagining basic legal rules: do not murder, do not steal etc. I’m not making a claim that specific legal statutes will persist over time.
It seems to me both that:
To the extent that AIs are our descendants, they should inherit our legal system, legal principles, and legal concepts, similar to how e.g. the United States inherited legal principles from the United Kingdom. We should certainly expect our legal system to change over time as our institutions adapt to technological change. But, absent a compelling reason otherwise, it seems wrong to think that “do not murder a human” will go out the window in “most plausible scenarios”.
Our basic legal rules seem pretty natural, rather than being highly contingent. It’s easy to imagine plenty of alien cultures stumbling upon the idea of property rights, and implementing the rule “do not steal from another legal person”.
My point is that AI could plausibly have rules for interacting with other “persons”, and those rules could look much like ours, but that we will not be “persons” under their code. Consider how “do not murder” has never applied to animals.
If AIs treat us like we treat animals then the fact that they have “values” will not be very helpful to us.
I think AIs will be trained on our data, and will be integrated into our culture, having been deliberately designed for the purpose of filling human-shaped holes in our economy, to automate labor. This means they’ll probably inherit our social concepts, in addition to most other concepts we have about the physical world. This situation seems disanalogous to the way humans interact with animals in many ways. Animals can’t even speak language.
Anyway, even the framing you have given seems like a partial concession towards my original point. A rejection of premise 4 is not equivalent to the idea that AIs will automatically follow our legal norms. Instead, it was about whether “human values” are an extremely narrow target, in the sense of being a natural vs. contingent set of values that are very hard to replicate in other circumstances.
If the way AIs relate to human values is similar to how humans relate to animals, then I’ll point out that many existing humans already find the idea of caring about animals to be quite natural, even if most ultimately decide not to take the idea very far. Compare the concept of “caring about animals” to “caring about paperclip maximization”. In the first instance, we have robust examples of people actually doing that, but hardly any examples of people in the second instance. This is after all because caring about paperclip maximization is an unnatural and arbitrary thing to care about relative to how most people conceptualize the world.
Again, I’m not saying AIs will necessarily care about human values. That was never the claim. The entire question was about whether human values are an “extremely narrow target”. And I think, within this context, given the second interpretation of human values in my original comment, the original thesis seems to have held up fine.