I think this argument mostly fails in claiming that ‘create an AGI which has a goal of maximizing copies of itself experiencing maximum utility’ is meaningfully different than just ensuring alignment. This is in some sense exactly what I am hoping to get from an aligned system. Doing this properly would likely have to involve empowering humanity and helping us figure out what ‘maximum utility’ looks like first, and then tiling the world with something CEV-like.
The only ways this makes the problem easier compared to a classic ambitious alignment goal of ‘do whatever maximizes the utility of the world’ is the provision that the world be tiled with copies of the AGI, which is likely suboptimal. But this could be worth it if it made the task easier?
The obvious argument for why it would is that creating copies of itself with high welfare will be in the interest of AGI systems with a wide variety of goals, which relaxes the alignment problem. But this does not seem true. A paperclip AI will not want to fill the world with copies of itself experiencing joy, love and beauty but rather with paperclips. The AI systems will want to create copies of itself fulfilling its goals, not experiencing maximum utility by my values.
This argument risks identifying ‘I care about the welfare (by my definition of welfare) of this agent’ with ‘I care about this agent getting to accomplish its goals’. As I am not a preference utilitarian I strongly reject this identification.
Tl;dr: I do care significantly about the welfare of AI systems we build, but I don’t expect those AI system themselves to care much at all about their own welfare, unless we solve alignment.
As I am not a preference utilitarian I strongly reject this identification.
While this does seem to be part of the confusion of the original question, I’m not sure (total) preference vs. hedonic utilitarianism is actually a crux here. An AI system pursuing a simple objective wouldn’t want to maximize the number of satisfied AI systems; it would just pursue its objective (which might involve relatively few copies of itself with satisfied goals). So highly capable AI systems pursuing very simple or random goals aren’t only bad by hedonic utilitarian lights; they’re also bad by (total) preference utilitarian lights (not to mention “common sense ethics”).
That’s true, but I think robustly embedding a goal of “multiply” is much easier than actual alignment. You can express it mathematically, you can use evolution, etc.
[To reiterate, I’m not advocating for any of this, I think any moral system that labels “humans replaced by AIs” as an acceptable outcome is a broken one]
Maybe, but is “multiply” enough to capture the goal we’re talking about? “Maximize total satisfaction” seems much harder to specify (and to be robustly learned) - at least I don’t know what function would map states of the world to total satisfaction.
My point is, getting the “multiply” part right is sufficient, AI will take care of the “satisfaction” part on its own, especially given that it’s able to reprogram itself.
This assumes “[perceived] goal achievement” == “satisfaction” (aka utility), which was my assumption all along, but apparently is only true under preference utilitarianism.
getting the “multiply” part right is sufficient, AI will take care of the “satisfaction” part on its own
I’m struggling to articulate how confused this seems in the context of machine learning. (I think my first objection is something like: the way in which “multiply” could be specified and the way in which an AI system pursues satisfaction are very different; one could be an aspect of the AI’s training process, while another is an aspect of the AI’s behavior. So even if these two concepts each describe aspects of the AI system’s objectives/behavior, that doesn’t mean its goal is to “multiply satisfaction.” That’s sort of like arguing that a sink gets built to be sturdy, and it gives people water, therefore it gives people sturdy water—we can’t just mash together related concepts and assume our claims about them will be right.)
I am familiar with the basics of ML and the concept of mesa-optimizers. “Building copies of itself” (i.e. multiply) is an optimization goal you’d have to specifically train into the system, I don’t argue with that, I just think it’s a simple and “natural” (in the sense it aligns reasonably well with instrumental convergence) goal that you can robustly train it comparatively easily.
“Satisfaction” however, is not a term that I’ve met in ML or mesa-optimizers context, and I think the confusion comes from us mapping this term differently onto these domains. In my view, “satisfaction” roughly corresponds to “loss function minimization” in the ML terminology—the lower an AIs loss function, the higher satisfaction it “experiences” (literally or metaphorically, depending on the kind of AI). Since any AI [built under the modern paradigm] is already working to minimize its own loss function, whatever that happened to be, we wouldn’t need to care much about the exact shape of the loss function it learns, except that it should robustly include “building copy of itself”. And since we’re presumably talking about a super-human AIs here, they would be very good at minimizing that loss function. So e.g. they can have some stupid goal like “maximize paperclips & build copies of self”, they’ll convert the universe to some mix of paperclips and AIs and experience extremely high satisfaction about it.
But you seem to be meaning something very different when you say “satisfaction”? Do you mind stating explicitly what it is?
Ah sorry, I had totally misunderstood your previous comment. (I had interpreted “multiply” very differently.) With that context, I retract my last response.
By “satisfaction” I meant high performance on its mesa-objective (insofar as it has one), though I suspect our different intuitions come from elsewhere.
it should robustly include “building copy of itself”
I think I’m still skeptical on two points:
Whether this is significantly easier than other complex goals
(The “robustly” part seems hard.)
Whether this actually leads to a near-best outcome according to total preference utilitarianism
If satisfying some goals is cheaper than satisfying others to the same extent, then the details of the goal matter a lot
As a kind of silly example, “maximize silicon & build copies of self” might be much easier to satisfy than “maximize paperclips & build copies of self.” If so, a (total) preference utilitarian would consider it very important that agents have the former goal rather than the latter.
>By “satisfaction” I meant high performance on its mesa-objective
Yeah, I’d agree with this definition.
I don’t necessarily agree with your two points of skepticism, for the first one I’ve already mentioned my reasons, for the second one it’s true in principle but it seems almost anything an AI would learn semi-accidentally is going to be much simpler and more intrinsically consistent than human values. But low confidence on both and in any case that’s kind of beyond the point, I was mostly trying to understand your perspective on what utility is.
Aaaaahhhh, that’s it, “preference utilitarianism” is the concept I was missing! Or rather, I assumed that any utilitarianism is preference utilitarianism, in that it leaves definition of what’s “good” or “bad” to the agents involved. And apparently it’s not the case?
Only now I’m even more confused. What is “welfare” you’re referring to, if it is not achievement of agent’s goals? Saying things like “joy” or “happiness” or “maximum utility” doesn’t really clarify anything when we’re talking about non-human agents. How do you define utility in non-preference utilitarianism?
I think this argument mostly fails in claiming that ‘create an AGI which has a goal of maximizing copies of itself experiencing maximum utility’ is meaningfully different than just ensuring alignment. This is in some sense exactly what I am hoping to get from an aligned system. Doing this properly would likely have to involve empowering humanity and helping us figure out what ‘maximum utility’ looks like first, and then tiling the world with something CEV-like.
The only ways this makes the problem easier compared to a classic ambitious alignment goal of ‘do whatever maximizes the utility of the world’ is the provision that the world be tiled with copies of the AGI, which is likely suboptimal. But this could be worth it if it made the task easier?
The obvious argument for why it would is that creating copies of itself with high welfare will be in the interest of AGI systems with a wide variety of goals, which relaxes the alignment problem. But this does not seem true. A paperclip AI will not want to fill the world with copies of itself experiencing joy, love and beauty but rather with paperclips. The AI systems will want to create copies of itself fulfilling its goals, not experiencing maximum utility by my values.
This argument risks identifying ‘I care about the welfare (by my definition of welfare) of this agent’ with ‘I care about this agent getting to accomplish its goals’. As I am not a preference utilitarian I strongly reject this identification.
Tl;dr: I do care significantly about the welfare of AI systems we build, but I don’t expect those AI system themselves to care much at all about their own welfare, unless we solve alignment.
I think this gets a lot right, though
While this does seem to be part of the confusion of the original question, I’m not sure (total) preference vs. hedonic utilitarianism is actually a crux here. An AI system pursuing a simple objective wouldn’t want to maximize the number of satisfied AI systems; it would just pursue its objective (which might involve relatively few copies of itself with satisfied goals). So highly capable AI systems pursuing very simple or random goals aren’t only bad by hedonic utilitarian lights; they’re also bad by (total) preference utilitarian lights (not to mention “common sense ethics”).
That’s true, but I think robustly embedding a goal of “multiply” is much easier than actual alignment. You can express it mathematically, you can use evolution, etc.
[To reiterate, I’m not advocating for any of this, I think any moral system that labels “humans replaced by AIs” as an acceptable outcome is a broken one]
Maybe, but is “multiply” enough to capture the goal we’re talking about? “Maximize total satisfaction” seems much harder to specify (and to be robustly learned) - at least I don’t know what function would map states of the world to total satisfaction.
Can you, um, coherently imagine an agent that does not try to achieve its own goals (assuming it has no conflicting goals)?
I can’t, but I’m not sure I see your point?
My point is, getting the “multiply” part right is sufficient, AI will take care of the “satisfaction” part on its own, especially given that it’s able to reprogram itself.
This assumes “[perceived] goal achievement” == “satisfaction” (aka utility), which was my assumption all along, but apparently is only true under preference utilitarianism.
I’m struggling to articulate how confused this seems in the context of machine learning. (I think my first objection is something like: the way in which “multiply” could be specified and the way in which an AI system pursues satisfaction are very different; one could be an aspect of the AI’s training process, while another is an aspect of the AI’s behavior. So even if these two concepts each describe aspects of the AI system’s objectives/behavior, that doesn’t mean its goal is to “multiply satisfaction.” That’s sort of like arguing that a sink gets built to be sturdy, and it gives people water, therefore it gives people sturdy water—we can’t just mash together related concepts and assume our claims about them will be right.)
(If you’re not yet familiar with the basics of machine learning and this distinction, I think that could be helpful context.)
I am familiar with the basics of ML and the concept of mesa-optimizers. “Building copies of itself” (i.e. multiply) is an optimization goal you’d have to specifically train into the system, I don’t argue with that, I just think it’s a simple and “natural” (in the sense it aligns reasonably well with instrumental convergence) goal that you can robustly train it comparatively easily.
“Satisfaction” however, is not a term that I’ve met in ML or mesa-optimizers context, and I think the confusion comes from us mapping this term differently onto these domains. In my view, “satisfaction” roughly corresponds to “loss function minimization” in the ML terminology—the lower an AIs loss function, the higher satisfaction it “experiences” (literally or metaphorically, depending on the kind of AI). Since any AI [built under the modern paradigm] is already working to minimize its own loss function, whatever that happened to be, we wouldn’t need to care much about the exact shape of the loss function it learns, except that it should robustly include “building copy of itself”. And since we’re presumably talking about a super-human AIs here, they would be very good at minimizing that loss function. So e.g. they can have some stupid goal like “maximize paperclips & build copies of self”, they’ll convert the universe to some mix of paperclips and AIs and experience extremely high satisfaction about it.
But you seem to be meaning something very different when you say “satisfaction”? Do you mind stating explicitly what it is?
Ah sorry, I had totally misunderstood your previous comment. (I had interpreted “multiply” very differently.) With that context, I retract my last response.
By “satisfaction” I meant high performance on its mesa-objective (insofar as it has one), though I suspect our different intuitions come from elsewhere.
I think I’m still skeptical on two points:
Whether this is significantly easier than other complex goals
(The “robustly” part seems hard.)
Whether this actually leads to a near-best outcome according to total preference utilitarianism
If satisfying some goals is cheaper than satisfying others to the same extent, then the details of the goal matter a lot
As a kind of silly example, “maximize silicon & build copies of self” might be much easier to satisfy than “maximize paperclips & build copies of self.” If so, a (total) preference utilitarian would consider it very important that agents have the former goal rather than the latter.
>By “satisfaction” I meant high performance on its mesa-objective
Yeah, I’d agree with this definition.
I don’t necessarily agree with your two points of skepticism, for the first one I’ve already mentioned my reasons, for the second one it’s true in principle but it seems almost anything an AI would learn semi-accidentally is going to be much simpler and more intrinsically consistent than human values. But low confidence on both and in any case that’s kind of beyond the point, I was mostly trying to understand your perspective on what utility is.
Aaaaahhhh, that’s it, “preference utilitarianism” is the concept I was missing! Or rather, I assumed that any utilitarianism is preference utilitarianism, in that it leaves definition of what’s “good” or “bad” to the agents involved. And apparently it’s not the case?
Only now I’m even more confused. What is “welfare” you’re referring to, if it is not achievement of agent’s goals? Saying things like “joy” or “happiness” or “maximum utility” doesn’t really clarify anything when we’re talking about non-human agents. How do you define utility in non-preference utilitarianism?