It seems to me that a big crux about the value of AI alignment work is what target you think AIs will ultimately be aligned to in the future in the optimistic scenario where we solve all the “core” AI risk problems to the extent they can be feasibly solved, e.g. technical AI safety problems, coordination problems, the problem of having “good” AI developers in charge etc.
There are a few targets that I’ve seen people predict AIs will be aligned to if we solve these problems: (1) “human values”, (2) benevolent moral values, (3) the values of AI developers, (4) the CEV of humanity, (5) the government’s values. My guess is that a significant source of disagreement that I have with EAs about AI risk is that I think none of these answers are actually very plausible. I’ve written a few posts explaining my views on this question already (1, 2), but I think I probably didn’t make some of my points clear enough in these posts. So let me try again.
In my view, in the most likely case, it seems that if the “core” AI risk problems are solved, AIs will be aligned to the primarily selfish individualrevealed preferences of existing humans at the time of alignment. This essentially refers to the the implicit value system that would emerge if, when advanced AI is eventually created, you gave the then-currently existing set of humans a lot of wealth. Call these values PSIRPEHTA (I’m working on a better acronym).
(Read my post if you want to understand my reasons for thinking that AIs will likely be aligned to PSIRPEHTA if we solve AI safety problems.)
I think it is not obvious at all that maximizing PSIRPEHTA is good from a total utilitarian perspective compared to most plausible “unaligned” alternatives. In fact, I think the main reason why you might care about maximizing PSIRPEHTA is if you think we’re close to AI and you personally think that current humans (such as yourself) should be very rich. But if you thought that, I think the arguments about the overwhelming value of reducing existential risk in e.g. Bostrom’s paper Astronomical Waste largely do not apply. Let me try to explain.
PSIRPEHTA is not the same thing as “human values” because, unlike human values, PSIRPEHTA is not consistent over time or shared between members of our species. Indeed, PSIRPEHTA changes during each generation as old people die off, and new young people are born. Most importantly, PSIRPEHTA is not our non-selfish “moral” values, except to the extent that people are regularly moved by moral arguments in the real world to change their economic consumption habits, which I claim is not actually very common (or, to the extent that it is common, I don’t think these moral values usually look much like the ideal moral values that most EAs express).
PSIRPEHTA refers to the aggregate ordinary revealed preferences of individual actors, who the AIs will be aligned to, in order to make those humans richer i.e. their preferences as revealed by their actions, such as what they spend their income on, NOT what they think is “morally correct”. For example, according to “human values” it might be wrong to eat meat, because maybe if humans reflected long enough they’d express the conclusion that it’s wrong to hurt animals. But from the perspective of PSIRPEHTA, eating meat is generally acceptable, and empirically there’s little pressure for people to “reflect” on their values and change them.
From this perspective, the view in which it makes most sense to push for AI alignment work seems to be an obscure form of person-affecting utilitarianism in which you care mainly about the revealed preferences of humans at the time when AI is created (not the human species, but rather, the generation of humans that happens to be living when advanced AIs are created). This perspective is plausible if you really care about making currently existing humans better off materially and you think we are close to advanced AI. But I think this type of moral view is generally quite far apart from total utilitarianism, or really any other form of utilitarianism that EAs have traditionally adopted.
In a plausible “unaligned” alternative, the values of AIs would diverge from PSIRPEHTA, but this mainly has the effect of making particular collections of individual humans less rich, and making other agents in the world — particularly unaligned AI agents — more rich. That could be bad if you think that these AI agents are less morally worthy than existing humans at the time of alignment (e.g. for some reason you think AI agents won’t be conscious), but I think it’s critically important to evaluate this question carefully by measuring the “unaligned” outcome against the alternative. Most arguments I’ve seen about this topic have emphasized how bad it would be if unaligned AIs have influence in the future. But I’ve rarely seen the flipside of this argument explicitly defended: why PSIRPEHTA would be any better.
In my view, PSIRPEHTA seems like a mediocre value system, and one that I do not particularly care to maximize relative to a variety of alternatives. I definitely like PSIRPEHTA to the extent that I, my friends, family, and community are members of the set of “existing humans at the time of alignment”, but I don’t see any particularly strong utilitarian arguments for caring about PSIRPEHTA.
In other words, instead of arguing that unaligned AIs would be bad, I’d prefer to hear more arguments about why PSIRPEHTA would be better, since PSIRPEHTA just seems to me like the value system that will actually be favored if we feasibly solve all the technical and coordination AI problems that EAs normally talk about regarding AI risk.
PSIRPEHTA refers to the aggregate ordinary revealed preferences of individual actors, who the AIs will be aligned to, in order to make those humans richer i.e. their preferences as revealed by their actions, such as what they spend their income on, NOT what they think is “morally correct”. For example, according to “human values” it might be wrong to eat meat, because maybe if humans reflected long enough they’d express the conclusion that it’s wrong to hurt animals. But from the perspective of PSIRPEHTA, eating meat is generally acceptable, and empirically there’s little pressure for people to “reflect” on their values and change them.
EDIT: I guess I’d think of human values as what people would actually just sincerely and directly endorse without further influencing them first (although maybe just asking them makes them take a position if they didn’t have one before, e.g. if they’ve never thought much about the ethics of eating meat).
I think you’re overstating the differences between revealed and endorsed preferences, including moral/human values, here. Probably only a small share of the population thinks eating meat is wrong or bad, and most probably think it’s okay. Even if people generally would find it wrong or bad after reflecting long enough (I’m not sure they actually would), that doesn’t reflect their actual values now. Actual human values do not generally find eating meat wrong.
To be clear, you can still complain that humans’ actual/endorsed values are also far from ideal and maybe not worth aligning with, e.g. because people don’t care enough about nonhuman animals or helping others. Do people care more about animals and helping others than an unaligned AI would, in expectation, though? Honestly, I’m not entirely sure. Humans may care about animal welfare somewhat, but they also specifically want to exploit animals in large part because of their values, specifically food-related taste, culture, traditions and habit. Maybe people will also want to specifically exploit artificial moral patients for their own entertainment, curiosity or scientific research on them, not just because the artificial moral patients are generically useful, e.g. for acquiring resources and power and enacting preferences (which an unaligned AI could be prone to).
I illustrate some other examples here on the influence of human moral values on companies. This is all of course revealed preferences, but my point is that revealed preferences can importantly reflect endorsed moral values.
People influence companies in part on the basis of what they think is right through demand, boycotts, law, regulation and other political pressure.
Companies, for the most part, can’t just go around directly murdering people (companies can still harm people, e.g. through misinformation on the health risks of their products, or because people don’t care enough about the harms). (Maybe this is largely for selfish reasons; people don’t want to be killed themselves, and there’s a slippery slope if you allow exceptions.)
GPT has content policies that reflect people’s political/moral views. Social media companies have use and content policies and have kicked off various users for harassment, racism, or other things that are politically unpopular, at least among a large share of users or advertisers (which also reflect consumers). This seems pretty standard.
Many companies have boycotted Russia since the invasion of Ukraine. Many companies have also committed to sourcing only cage-free eggs after corporate outreach and campaigns, despite cage-free egg consumption being low.
X (Twitter)’s policies on hate speech have changed under Musk, presumably primarily because of his views. That seems to have cost X users and advertisers, but X is still around and popular, so it also shows that some potentially important decisions about how a technology is used are largely in the hands of the company and its leadership, not just driven by profit.
I’d likewise guess it actually makes a difference that the biggest AI labs are (I would assume) led and staffed primarily by liberals. They can push their own views onto their AI even at the cost of some profit and market share. And some things may have minimal near term consequences for demand or profit, but could be important for the far future. If the company decides to make their AI object more to various forms of mistreatment of animals or artificial consciousness, will this really cost them tons of profit and market share? And it could depend on the markets it’s primarily used in, e.g. this would matter even less for an AI that brings in profit primarily through trading stocks.
It’s also often hard to say how much something affects a company’s profits.
This essentially refers to the the implicit value system that would emerge if, when advanced AI is eventually created, you gave the then-currently existing set of humans a lot of wealth. Call these values PSIRPEHTA (I’m working on a better acronym).
I basically buy that the values we get will be similar to just giving existing humans massive amounts of wealth, but I’m less sold that this will result in outcomes which are well described as “primarily selfish”.
I feel like your comment is equivocating between “the situation is similar to making existing humans massively wealth” and “of course this will result in primarily selfish usage similar to how the median person behaves with marginal money now”.
I basically buy that the values we get will be similar to just giving existing humans massive amounts of wealth, but I’m less sold that this will result in outcomes which are well described as “primarily selfish”.
Current humans definitely seem primarily selfish (although I think they also care about their family and friends too; I’m including that). Can you explain why you think giving humans a lot of wealth would turn them into something that isn’t primarily selfish? What’s the empirical evidence for that idea?
The behavior of billionares, which maybe indicates more like 10% of income spent on altruism.
ETA: This is still literally majority selfish, but it’s also plausible that 10% altruism is pretty great and looks pretty different than “current median person behavior with marginal money”.
(See my other comment about the percent of cosmic resources.)
The idea that billionaires have 90% selfish values seems consistent with a claim of having “primarily selfish” values in my opinion. Can you clarify what you’re objecting to here?
The literal words of “primarily selfish” don’t seem that bad, but I would maybe prefer majority selfish?
And your top level comment seems like it’s not talking about/emphasizing the main reason to like human control which is that maybe 10-20% of resources are spent well.
It just seemed odd to me to not mention that “primarily selfish” still involves a pretty big fraction of altruism.
I agree it’s important to talk about and analyze the (relatively small) component of human values that are altruistic. I mostly just think this component is already over-emphasized.
Here’s one guess at what I think you might be missing about my argument: 90% selfish values + 10% altruistic values isn’t the same thing as, e.g., 90% valueless stuff + 10% utopia. The 90% selfish component can have negative effects on welfare from a total utilitarian perspective, that aren’t necessarily outweighed by the 10%.
90% selfish values is the type of thing that produces massive factory farming infrastructure, with a small amount of GDP spent mitigating suffering in factory farms. Does the small amount of spending mitigating suffering outweigh the large amount of spending directly causing suffering? This isn’t clear to me.
(Alternatively, you could think that unaligned AIs will be 100% selfish, and this is clearly worse. But I’d want to understand how you could come to that conclusion, carefully. “Altruism” also encompasses a broad range of activities, and not all of it is utopian or idealistic from a total utilitarian perspective. For example, human spending on environmental conservation might be categorized as “altruism” in this framework, although personally I would say that form of spending is not very “moral” due to wild animal suffering.)
The 90% selfish component can have negative effects on welfare from a total utilitarian perspective, that aren’t necessarily outweighed by the 10%.
Yep, this can be true, but I’m skeptical this will matter much in practice.
I typically think things which aren’t directly optimizing for value or disvalue won’t have intended effects which are very important and that in the future unintended effects (externalities) won’t be that much of total value/disvalue.
When we see the selfish consumption of current very rich people, it doesn’t seem like the intentional effects are that morally good/bad relative to the best/worst uses of resources. (E.g. owning a large boat and having people think you’re high status aren’t that morally important relative to altruistic spending of similar amounts of money.) So for current very rich people the main issue would be that the economic process for producing the goods has bad externalities.
And, I expect that as technology advances, externalities reduce in moral importance relative to intended effects. Partially this is based on crazy transhumanist takes, but I feel like there is some broader perspective in which you’d expect this.
E.g. for factory farming, the ultimately cheapest way to make meat in the limit of technological maturity would very likely not involve any animal suffering.
Separately, I think externalities will probably look pretty similar for selfish resource usage for unaligned AIs and humans because most serious economic activities will be pretty similar.
Alternatively, you could think that unaligned AIs will be 100% selfish, and this is clearly worse.
I’d like to explicitly note that this I don’t think that this is true in expectation for a reasonable notion of “selfish”. Though I maybe think something which is sort of in this direction if we use a relatively narrow notion of altruism.
How are we defining selfish here? It seem like a pretty strong position to take on the topic of psychological egoism? Especially including family/friends in terms of selfish?
In your original post, you say:
All that extra wealth did not make us extreme moral saints; instead, we still mostly care about ourselves, our family, and our friends.
But I don’t know, it seems that as countries and individuals get wealthier, we seem to on the whole be getting better? Maybe factory farming acts against this, but the idea that factory farming is immoral and should be abolished exists and I think is only going to grow. I don’t think the humans are just slaves to our base wants/desires, and think that is a remarkably impoverished view of both individual human pyschology and social morality.
As such, I don’t really agree with much of this post. An AGI, when built, will be able to generate new ideas and hypotheses about the world, including moral ones. A strong-but-narrow AI could be worse (e.g. optimal-factory-farm-PT), but then the right response here isn’t really technical alignment, it’s AI governance and moral persuasion in general.
This seems to underrate the arguments for Malthusian competition in the long run.
If we develop the technical capability to align AI systems with any conceivable goal, we’ll start by aligning them with our own preferences. Some people are saints, and they’ll make omnibenevolent AIs. Other people might have more sinister plans for their AIs. The world will remain full of human values, with all the good and bad that entails.
But current human values are do not maximize our reproductive fitness. Maybe one human will start a cult devoted to sending self-replicating AI probes to the stars at almost light speed. That person’s values will influence far-reaching corners of the universe that later humans will struggle to reach. Another human might use their AI to persuade others to join together and fight a war of conquest against a smaller, weaker group of enemies. If they win, their prize will be hardware, software, energy, and more power that they can use to continue to spread their values.
Even if most humans are not interested in maximizing the number and power of their descendants, those who are will have the most numerous and most powerful descendants. This selection pressure exists even if the humans involved are ignorant of it; even if they actively try to avoid it.
I think it’s worth splitting the alignment problem into two quite distinct problems:
The technical problem of intent alignment. Solving this does not solve coordination problems. There will still be private information and coordination problems after intent alignment is solved, therefore we’ll still face coordination problems, fitter strategies will proliferate, and the world will be governed by values that maximize fitness.
“Civilizational alignment”? Much harder problem to solve. The traditional answer is a Leviathan, or Singleton as the cool kids have been saying. It solves coordination problems, allowing society to coherently pursue a long-run objective such as flourishing rather than fitness maximization. Unfortunately, there are coordination problems and competitive pressures within Leviathans. The person who ends up in charge is usually quite ruthless and focused on preserving their power, rather than the stated long-run goal of the organization. And if you solve all the coordination problems, you have another problem in choosing a good long-run objective. Nothing here looks particularly promising to me, and I expect competition to continue.
This seems to underrate the arguments for Malthusian competition in the long run.
I’m mostly talking about what I expect to happen in the short-run in this thread. But I appreciate these arguments (and agree with most of them).
Plausibly my main disagreement with the concerns you raised is that I think coordination is maybe not very hard. Coordination seems to have gotten stronger over time, in the long-run. AI could also potentially make coordination much easier. As Bostrom has pointed out, historical trends point towards the creation of a Singleton.
I’m currently uncertain about whether to be more worried about a future world government becoming stagnant and inflexible. There’s a real risk that our institutions will at some point entrench an anti-innovation doctrine that prevents meaningful changes over very long time horizons out of a fear that any evolution would be too risky. As of right now I’m more worried about this potential failure mode versus the failure mode of unrestrained evolution, but it’s a close competition between the two concerns.
What percent of cosmic resources do you expect to be spent thoughtfully and altruistically? 0%? 10%?
I would guess the thoughtful and altruistic subset of resources dominate in most scenarios where humans retain control.
Then, my main argument for why human control would be good is that the fraction isn’t that small (more like 20% in expectation than 0%) and that unaligned AI takeover seems probably worse than this.
Also, as an aside, I agree that little good public argumentation has been made about the relative value of unaligned AI control vs human control. I’m sympathetic to various discussion from Paul Christiano and Joe Carlsmith, but the public scope and detail is pretty limited thus far.
It seems to me that a big crux about the value of AI alignment work is what target you think AIs will ultimately be aligned to in the future in the optimistic scenario where we solve all the “core” AI risk problems to the extent they can be feasibly solved, e.g. technical AI safety problems, coordination problems, the problem of having “good” AI developers in charge etc.
There are a few targets that I’ve seen people predict AIs will be aligned to if we solve these problems: (1) “human values”, (2) benevolent moral values, (3) the values of AI developers, (4) the CEV of humanity, (5) the government’s values. My guess is that a significant source of disagreement that I have with EAs about AI risk is that I think none of these answers are actually very plausible. I’ve written a few posts explaining my views on this question already (1, 2), but I think I probably didn’t make some of my points clear enough in these posts. So let me try again.
In my view, in the most likely case, it seems that if the “core” AI risk problems are solved, AIs will be aligned to the primarily selfish individual revealed preferences of existing humans at the time of alignment. This essentially refers to the the implicit value system that would emerge if, when advanced AI is eventually created, you gave the then-currently existing set of humans a lot of wealth. Call these values PSIRPEHTA (I’m working on a better acronym).
(Read my post if you want to understand my reasons for thinking that AIs will likely be aligned to PSIRPEHTA if we solve AI safety problems.)
I think it is not obvious at all that maximizing PSIRPEHTA is good from a total utilitarian perspective compared to most plausible “unaligned” alternatives. In fact, I think the main reason why you might care about maximizing PSIRPEHTA is if you think we’re close to AI and you personally think that current humans (such as yourself) should be very rich. But if you thought that, I think the arguments about the overwhelming value of reducing existential risk in e.g. Bostrom’s paper Astronomical Waste largely do not apply. Let me try to explain.
PSIRPEHTA is not the same thing as “human values” because, unlike human values, PSIRPEHTA is not consistent over time or shared between members of our species. Indeed, PSIRPEHTA changes during each generation as old people die off, and new young people are born. Most importantly, PSIRPEHTA is not our non-selfish “moral” values, except to the extent that people are regularly moved by moral arguments in the real world to change their economic consumption habits, which I claim is not actually very common (or, to the extent that it is common, I don’t think these moral values usually look much like the ideal moral values that most EAs express).
PSIRPEHTA refers to the aggregate ordinary revealed preferences of individual actors, who the AIs will be aligned to, in order to make those humans richer i.e. their preferences as revealed by their actions, such as what they spend their income on, NOT what they think is “morally correct”. For example, according to “human values” it might be wrong to eat meat, because maybe if humans reflected long enough they’d express the conclusion that it’s wrong to hurt animals. But from the perspective of PSIRPEHTA, eating meat is generally acceptable, and empirically there’s little pressure for people to “reflect” on their values and change them.
From this perspective, the view in which it makes most sense to push for AI alignment work seems to be an obscure form of person-affecting utilitarianism in which you care mainly about the revealed preferences of humans at the time when AI is created (not the human species, but rather, the generation of humans that happens to be living when advanced AIs are created). This perspective is plausible if you really care about making currently existing humans better off materially and you think we are close to advanced AI. But I think this type of moral view is generally quite far apart from total utilitarianism, or really any other form of utilitarianism that EAs have traditionally adopted.
In a plausible “unaligned” alternative, the values of AIs would diverge from PSIRPEHTA, but this mainly has the effect of making particular collections of individual humans less rich, and making other agents in the world — particularly unaligned AI agents — more rich. That could be bad if you think that these AI agents are less morally worthy than existing humans at the time of alignment (e.g. for some reason you think AI agents won’t be conscious), but I think it’s critically important to evaluate this question carefully by measuring the “unaligned” outcome against the alternative. Most arguments I’ve seen about this topic have emphasized how bad it would be if unaligned AIs have influence in the future. But I’ve rarely seen the flipside of this argument explicitly defended: why PSIRPEHTA would be any better.
In my view, PSIRPEHTA seems like a mediocre value system, and one that I do not particularly care to maximize relative to a variety of alternatives. I definitely like PSIRPEHTA to the extent that I, my friends, family, and community are members of the set of “existing humans at the time of alignment”, but I don’t see any particularly strong utilitarian arguments for caring about PSIRPEHTA.
In other words, instead of arguing that unaligned AIs would be bad, I’d prefer to hear more arguments about why PSIRPEHTA would be better, since PSIRPEHTA just seems to me like the value system that will actually be favored if we feasibly solve all the technical and coordination AI problems that EAs normally talk about regarding AI risk.
EDIT: I guess I’d think of human values as what people would actually just sincerely and directly endorse without further influencing them first (although maybe just asking them makes them take a position if they didn’t have one before, e.g. if they’ve never thought much about the ethics of eating meat).
I think you’re overstating the differences between revealed and endorsed preferences, including moral/human values, here.Probably only a small share of the population thinks eating meat is wrong or bad, and most probably think it’s okay. Even if people generally would find it wrong or bad after reflecting long enough (I’m not sure they actually would), that doesn’t reflect their actual values now. Actual human values do not generally find eating meat wrong.To be clear, you can still complain that humans’ actual/endorsed values are also far from ideal and maybe not worth aligning with, e.g. because people don’t care enough about nonhuman animals or helping others. Do people care more about animals and helping others than an unaligned AI would, in expectation, though? Honestly, I’m not entirely sure. Humans may care about animal welfare somewhat, but they also specifically want to exploit animals in large part because of their values, specifically food-related taste, culture, traditions and habit. Maybe people will also want to specifically exploit artificial moral patients for their own entertainment, curiosity or scientific research on them, not just because the artificial moral patients are generically useful, e.g. for acquiring resources and power and enacting preferences (which an unaligned AI could be prone to).
I illustrate some other examples here on the influence of human moral values on companies. This is all of course revealed preferences, but my point is that revealed preferences can importantly reflect endorsed moral values.
People influence companies in part on the basis of what they think is right through demand, boycotts, law, regulation and other political pressure.
Companies, for the most part, can’t just go around directly murdering people (companies can still harm people, e.g. through misinformation on the health risks of their products, or because people don’t care enough about the harms). (Maybe this is largely for selfish reasons; people don’t want to be killed themselves, and there’s a slippery slope if you allow exceptions.)
GPT has content policies that reflect people’s political/moral views. Social media companies have use and content policies and have kicked off various users for harassment, racism, or other things that are politically unpopular, at least among a large share of users or advertisers (which also reflect consumers). This seems pretty standard.
Many companies have boycotted Russia since the invasion of Ukraine. Many companies have also committed to sourcing only cage-free eggs after corporate outreach and campaigns, despite cage-free egg consumption being low.
X (Twitter)’s policies on hate speech have changed under Musk, presumably primarily because of his views. That seems to have cost X users and advertisers, but X is still around and popular, so it also shows that some potentially important decisions about how a technology is used are largely in the hands of the company and its leadership, not just driven by profit.
I’d likewise guess it actually makes a difference that the biggest AI labs are (I would assume) led and staffed primarily by liberals. They can push their own views onto their AI even at the cost of some profit and market share. And some things may have minimal near term consequences for demand or profit, but could be important for the far future. If the company decides to make their AI object more to various forms of mistreatment of animals or artificial consciousness, will this really cost them tons of profit and market share? And it could depend on the markets it’s primarily used in, e.g. this would matter even less for an AI that brings in profit primarily through trading stocks.
It’s also often hard to say how much something affects a company’s profits.
I basically buy that the values we get will be similar to just giving existing humans massive amounts of wealth, but I’m less sold that this will result in outcomes which are well described as “primarily selfish”.
I feel like your comment is equivocating between “the situation is similar to making existing humans massively wealth” and “of course this will result in primarily selfish usage similar to how the median person behaves with marginal money now”.
Current humans definitely seem primarily selfish (although I think they also care about their family and friends too; I’m including that). Can you explain why you think giving humans a lot of wealth would turn them into something that isn’t primarily selfish? What’s the empirical evidence for that idea?
The behavior of billionares, which maybe indicates more like 10% of income spent on altruism.
ETA: This is still literally majority selfish, but it’s also plausible that 10% altruism is pretty great and looks pretty different than “current median person behavior with marginal money”.
(See my other comment about the percent of cosmic resources.)
The idea that billionaires have 90% selfish values seems consistent with a claim of having “primarily selfish” values in my opinion. Can you clarify what you’re objecting to here?
The literal words of “primarily selfish” don’t seem that bad, but I would maybe prefer majority selfish?
And your top level comment seems like it’s not talking about/emphasizing the main reason to like human control which is that maybe 10-20% of resources are spent well.
It just seemed odd to me to not mention that “primarily selfish” still involves a pretty big fraction of altruism.
I agree it’s important to talk about and analyze the (relatively small) component of human values that are altruistic. I mostly just think this component is already over-emphasized.
Here’s one guess at what I think you might be missing about my argument: 90% selfish values + 10% altruistic values isn’t the same thing as, e.g., 90% valueless stuff + 10% utopia. The 90% selfish component can have negative effects on welfare from a total utilitarian perspective, that aren’t necessarily outweighed by the 10%.
90% selfish values is the type of thing that produces massive factory farming infrastructure, with a small amount of GDP spent mitigating suffering in factory farms. Does the small amount of spending mitigating suffering outweigh the large amount of spending directly causing suffering? This isn’t clear to me.
(Alternatively, you could think that unaligned AIs will be 100% selfish, and this is clearly worse. But I’d want to understand how you could come to that conclusion, carefully. “Altruism” also encompasses a broad range of activities, and not all of it is utopian or idealistic from a total utilitarian perspective. For example, human spending on environmental conservation might be categorized as “altruism” in this framework, although personally I would say that form of spending is not very “moral” due to wild animal suffering.)
Yep, this can be true, but I’m skeptical this will matter much in practice.
I typically think things which aren’t directly optimizing for value or disvalue won’t have intended effects which are very important and that in the future unintended effects (externalities) won’t be that much of total value/disvalue.
When we see the selfish consumption of current very rich people, it doesn’t seem like the intentional effects are that morally good/bad relative to the best/worst uses of resources. (E.g. owning a large boat and having people think you’re high status aren’t that morally important relative to altruistic spending of similar amounts of money.) So for current very rich people the main issue would be that the economic process for producing the goods has bad externalities.
And, I expect that as technology advances, externalities reduce in moral importance relative to intended effects. Partially this is based on crazy transhumanist takes, but I feel like there is some broader perspective in which you’d expect this.
E.g. for factory farming, the ultimately cheapest way to make meat in the limit of technological maturity would very likely not involve any animal suffering.
Separately, I think externalities will probably look pretty similar for selfish resource usage for unaligned AIs and humans because most serious economic activities will be pretty similar.
I’d like to explicitly note that this I don’t think that this is true in expectation for a reasonable notion of “selfish”. Though I maybe think something which is sort of in this direction if we use a relatively narrow notion of altruism.
How are we defining selfish here? It seem like a pretty strong position to take on the topic of psychological egoism? Especially including family/friends in terms of selfish?
In your original post, you say:
But I don’t know, it seems that as countries and individuals get wealthier, we seem to on the whole be getting better? Maybe factory farming acts against this, but the idea that factory farming is immoral and should be abolished exists and I think is only going to grow. I don’t think the humans are just slaves to our base wants/desires, and think that is a remarkably impoverished view of both individual human pyschology and social morality.
As such, I don’t really agree with much of this post. An AGI, when built, will be able to generate new ideas and hypotheses about the world, including moral ones. A strong-but-narrow AI could be worse (e.g. optimal-factory-farm-PT), but then the right response here isn’t really technical alignment, it’s AI governance and moral persuasion in general.
This seems to underrate the arguments for Malthusian competition in the long run.
If we develop the technical capability to align AI systems with any conceivable goal, we’ll start by aligning them with our own preferences. Some people are saints, and they’ll make omnibenevolent AIs. Other people might have more sinister plans for their AIs. The world will remain full of human values, with all the good and bad that entails.
But current human values are do not maximize our reproductive fitness. Maybe one human will start a cult devoted to sending self-replicating AI probes to the stars at almost light speed. That person’s values will influence far-reaching corners of the universe that later humans will struggle to reach. Another human might use their AI to persuade others to join together and fight a war of conquest against a smaller, weaker group of enemies. If they win, their prize will be hardware, software, energy, and more power that they can use to continue to spread their values.
Even if most humans are not interested in maximizing the number and power of their descendants, those who are will have the most numerous and most powerful descendants. This selection pressure exists even if the humans involved are ignorant of it; even if they actively try to avoid it.
I think it’s worth splitting the alignment problem into two quite distinct problems:
The technical problem of intent alignment. Solving this does not solve coordination problems. There will still be private information and coordination problems after intent alignment is solved, therefore we’ll still face coordination problems, fitter strategies will proliferate, and the world will be governed by values that maximize fitness.
“Civilizational alignment”? Much harder problem to solve. The traditional answer is a Leviathan, or Singleton as the cool kids have been saying. It solves coordination problems, allowing society to coherently pursue a long-run objective such as flourishing rather than fitness maximization. Unfortunately, there are coordination problems and competitive pressures within Leviathans. The person who ends up in charge is usually quite ruthless and focused on preserving their power, rather than the stated long-run goal of the organization. And if you solve all the coordination problems, you have another problem in choosing a good long-run objective. Nothing here looks particularly promising to me, and I expect competition to continue.
Better explanations: 1, 2, 3.
I’m mostly talking about what I expect to happen in the short-run in this thread. But I appreciate these arguments (and agree with most of them).
Plausibly my main disagreement with the concerns you raised is that I think coordination is maybe not very hard. Coordination seems to have gotten stronger over time, in the long-run. AI could also potentially make coordination much easier. As Bostrom has pointed out, historical trends point towards the creation of a Singleton.
I’m currently uncertain about whether to be more worried about a future world government becoming stagnant and inflexible. There’s a real risk that our institutions will at some point entrench an anti-innovation doctrine that prevents meaningful changes over very long time horizons out of a fear that any evolution would be too risky. As of right now I’m more worried about this potential failure mode versus the failure mode of unrestrained evolution, but it’s a close competition between the two concerns.
What percent of cosmic resources do you expect to be spent thoughtfully and altruistically? 0%? 10%?
I would guess the thoughtful and altruistic subset of resources dominate in most scenarios where humans retain control.
Then, my main argument for why human control would be good is that the fraction isn’t that small (more like 20% in expectation than 0%) and that unaligned AI takeover seems probably worse than this.
Also, as an aside, I agree that little good public argumentation has been made about the relative value of unaligned AI control vs human control. I’m sympathetic to various discussion from Paul Christiano and Joe Carlsmith, but the public scope and detail is pretty limited thus far.