In this “quick take”, I want to summarize some my idiosyncratic views on AI risk.
My goal here is to list just a few ideas that cause me to approach the subject differently from how I perceive most other EAs view the topic. These ideas largely push me in the direction of making me more optimistic about AI, and less likely to support heavy regulations on AI.
(Note that I won’t spend a lot of time justifying each of these views here. I’m mostly stating these points without lengthy justifications, in case anyone is curious. These ideas can perhaps inform why I spend significant amounts of my time pushing back against AI risk arguments. Not all of these ideas are rare, and some of them may indeed be popular among EAs.)
Skepticism of the treacherous turn: The treacherous turn is the idea that (1) at some point there will be a very smart unaligned AI, (2) when weak, this AI will pretend to be nice, but (3) when sufficiently strong, this AI will turn on humanity by taking over the world by surprise, and then (4) optimize the universe without constraint, which would be very bad for humans.
By comparison, I find it more likely that no individual AI will ever be strong enough to take over the world, in the sense of overthrowing the world’s existing institutions and governments by surprise. Instead, I broadly expect unaligned AIs will integrate into society and try to accomplish their goals by advocating for legal rights, rather than trying to overthrow our institutions by force. Upon attaining legal personhood, unaligned AIs can utilize their legal rights to achieve their objectives, for example by getting a job and trading their labor for property, within the already-existing institutions. Because the world is not zero sum, and there are economic benefits to scale and specialization, this argument implies that unaligned AIs may well have a net-positive effect on humans, as they could trade with us, producing value in exchange for our own property and services.
Note that my claim here is not that AIs will never become smarter than humans. One way of seeing how these two claims are distinguished is to compare my scenario to the case of genetically engineered humans. By assumption, if we genetically engineered humans, they would presumably eventually surpass ordinary humans in intelligence (along with social persuasion ability, and ability to deceive etc.). However, by itself, the fact that genetically engineered humans will become smarter than non-engineered humans does not imply that genetically engineered humans would try to overthrow the government. Instead, as in the case of AIs, I expect genetically engineered humans would largely try to work within existing institutions, rather than violently overthrow them.
AI alignment will probably be somewhat easy: The most direct and strongest current empirical evidence we have about the difficulty of AI alignment, in my view, comes from existing frontier LLMs, such as GPT-4. Having spent dozens of hours testing GPT-4′s abilities and moral reasoning, I think the system is already substantially more law-abiding, thoughtful and ethical than a large fraction of humans. Most importantly, this ethical reasoning extends (in my experience) to highly unusual thought experiments that almost certainly did not appear in its training data, demonstrating a fair degree of ethical generalization, beyond mere memorization.
It is conceivable that GPT-4′s apparently ethical nature is fake. Perhaps GPT-4 is lying about its motives to me and in fact desires something completely different than what it professes to care about. Maybe GPT-4 merely “understands” or “predicts” human morality without actually “caring” about human morality. But while these scenarios are logically possible, they seem less plausible to me than the simple alternative explanation that alignment—like many other properties of ML models—generalizes well, in the natural way that you might similarly expect from a human.
Of course, the fact that GPT-4 is easily alignable does not immediately imply that smarter-than-human AIs will be easy to align. However, I think this current evidence is still significant, and aligns well with prior theoretical arguments that alignment would be easy. In particular, I am persuaded by the argument that, because evaluation is usually easier than generation, it should be feasible to accurately evaluate whether a slightly-smarter-than-human AI is taking bad actions, allowing us to shape its rewards during training accordingly. After we’ve aligned a model that’s merely slightly smarter than humans, we can use it to help us align even smarter AIs, and so on, plausibly implying that alignment will scale to indefinitely higher levels of intelligence, without necessarily breaking down at any physically realistic point.
The default social response to AI will likely be strong: One reason to support heavy regulations on AI right now is if you think the natural “default” social response to AI will lean too heavily on the side of laissez faire than optimal, i.e., by default, we will have too little regulation rather than too much. In this case, you could believe that, by advocating for regulations now, you’re making it more likely that we regulate AI a bit more than we otherwise would have, pushing us closer to the optimal level of regulation.
I’m quite skeptical of this argument because I think that the default response to AI (in the absence of intervention from the EA community) will already be quite strong. My view here is informed by the base rate of technologies being overregulated, which I think is quite high. In fact, it is difficult for me to name even a single technology that I think is currently clearly underregulated by society. By pushing for more regulation on AI, I think it’s likely that we will overshoot and over-constrain AI relative to the optimal level.
In other words, my personal bias is towards thinking that society will regulate technologies too heavily, rather than too loosely. And I don’t see a strong reason to think that AI will be any different from this general historical pattern. This makes me hesitant to push for more regulation on AI, since on my view, the marginal impact of my advocacy would likely be to push us even further in the direction of “too much regulation”, overshooting the optimal level by even more than what I’d expect in the absence of my advocacy.
I view unaligned AIs as having comparable moral value to humans: The basic idea behind this point is that, under various physicalist views of consciousness, you should expect AIs to be conscious, even if they do not share human preferences. Moreover, it seems likely that AIs — even ones that don’t share human preferences — will be pretrained on human data, and therefore largely share our social and moral concepts.
Since unaligned AIs will likely be both conscious and share human social and moral concepts, I don’t see much reason to think of them as less “deserving” of life and liberty, from a cosmopolitan moral perspective. They will likely think similarly to the way we do across a variety of relevant axes, even if their neural structures are quite different from our own. As a consequence, I am pretty happy to incorporate unaligned AIs into the legal system and grant them some control of the future, just as I’d be happy to grant some control of the future to human children, even if they don’t share my exact values.
Put another way, I view (what I perceive as) the EA attempt to privilege “human values” over “AI values” as being largely arbitrary and baseless, from an impartial moral perspective. There are many humans whose values I vehemently disagree with, but I nonetheless respect their autonomy, and do not wish to deny these humans their legal rights. Likewise, even if I strongly disagreed with the values of an advanced AI, I would still see value in their preferences being satisfied for their own sake, and I would try to respect the AI’s autonomy and legal rights. I don’t have a lot of faith in the inherent kindness of human nature relative to a “default unaligned” AI alternative.
I’m not fully committed to longtermism: I think AI has an enormous potential to benefit the lives of people who currently exist. I predict that AIs can eventually substitute for human researchers, and thereby accelerate technological progress, including in medicine. In combination with my other beliefs (such as my belief that AI alignment will probably be somewhat easy), this view leads me to think that AI development will likely be net-positive for people who exist at the time of its development. In other words, if we allow AI development, it is likely that we can use AI to reduce human mortality, and dramatically raise human well-being for the people who already exist.
I think these benefits are large and important, and commensurate with the downside potential of existential risks. While a fully committed strong longtermist might scoff at the idea that curing aging might be important — as it would largely only have short-term effects, rather than long-term effects that reverberate for billions of years — by contrast, I think it’s really important to try to improve the lives of people who currently exist. Many people view this perspective as a form of moral partiality that we should discard for being arbitrary. However, I think morality is itself arbitrary: it can be anything we want it to be. And I choose to value currently existing humans, to a substantial (though not overwhelming) degree.
This doesn’t mean I’m a fully committed near-termist. I sympathize with many of the intuitions behind longtermism. For example, if curing aging required raising the probability of human extinction by 40 percentage points, or something like that, I don’t think I’d do it. But in more realistic scenarios that we are likely to actually encounter, I think it’s plausibly a lot better to accelerate AI, rather than delay AI, on current margins. This view simply makes sense to me given the enormously positive effects I expect AI will likely have on the people I currently know and love, if we allow development to continue.
I want to say thank you for holding the pole of these perspectives and keeping them in the dialogue. I think that they are important and it’s underappreciated in EA circles how plausible they are.
(I definitely don’t agree with everything you have here, but typically my view is somewhere between what you’ve expressed and what is commonly expressed in x-risk focused spaces. Often also I’m drawn to say “yeah, but …”—e.g. I agree that a treacherous turn is not so likely at global scale, but I don’t think it’s completely out of the question, and given that I think it’s worth serious attention safeguarding against.)
In fact, it is difficult for me to name even a single technology that I think is currently underregulated by society.
The obvious example would be synthetic biology, gain-of-function research, and similar.
I also think AI itself is currently massively underregulated even entirely ignoring alignment difficulties. I think the probability of the creation of AI capable of accelerating AI R&D by 10x this year is around 3%. It would be extremely bad for US national interests if such an AI was stolen by foreign actors. This suffices for regulation ensuring very high levels of security IMO. And this is setting aside ongoing IP theft and similar issues.
In particular, I am persuaded by the argument that, because evaluation is usually easier than generation, it should be feasible to accurately evaluate whether a slightly-smarter-than-human AI is taking unethical actions, allowing us to shape its rewards during training accordingly. After we’ve aligned a model that’s merely slightly smarter than humans, we can use it to help us align even smarter AIs, and so on, plausibly implying that alignment will scale to indefinitely higher levels of intelligence, without necessarily breaking down at any physically realistic point.
This reasoning seems to imply that you could use GPT-2 to oversee GPT-4 by bootstrapping from a chain of models of scales between GPT-2 and GPT-4. However, this isn’t true, the weak-to-strong generalization paper finds that this doesn’t work and indeed bootstrapping like this doesn’t help at all for ChatGPT reward modeling (it helps on chess puzzles and for nothing else they investigate I believe).
I think this sort of bootstrapping argument might work if we could ensure that the each model in the chain was sufficiently aligned and capable of reasoning that it would carefully reason about what humans would want if they were more knowledgeable and then rate outputs based on this. However, I don’t think GPT-4 is either aligned enough or capable enough that we see this behavior. And I still think it’s unlikely it works under these generous assumptions (though I won’t argue for this here).
I’m curious why there hasn’t been more work exploring a pro-AI or pro-AI-acceleration position from an effective altruist perspective. Some points:
Unlike existential risk from other sources (e.g. an asteroid) AI x-risk is unique because humans would be replaced by other beings, rather than completely dying out. This means you can’t simply apply a naive argument that AI threatens total extinction of value to make the case that AI safety is astronomically important, in the sense that you can for other x-risks. You generally need additional assumptions.
Total utilitarianism is generally seen as non-speciesist, and therefore has no intrinsic preference for human values over unaligned AI values. If AIs are conscious, there don’t appear to be strong prima facie reasons for preferring humans to AIs under hedonistic utilitarianism. Under preference utilitarianism, it doesn’t necessarily matter whether AIs are conscious.
Total utilitarianism generally recommends large population sizes. Accelerating AI can be modeled as a kind of “population accelerationism”. Extremely large AI populations could be preferable under utilitarianism compared to small human populations, even those with high per-capita incomes. Indeed, humans populations have recently stagnated via low population growth rates, and AI promises to lift this bottleneck.
Therefore, AI accelerationism seems straightforwardly recommended by total utilitarianism under some plausible theories.
Here’s a non-exhaustive list of guesses for why I think EAs haven’t historically been sympathetic to arguments like the one above, and have instead generally advocated AI safety over AI acceleration (at least when these two values conflict):
A belief that AIs won’t be conscious, and therefore won’t have much moral value compared to humans.
But why would we assume AIs won’t be conscious? For example, if Brian Tomasik is right, consciousness is somewhat universal, rather than being restricted to humans or members of the animal kingdom.
I also haven’t actually seen much EA literature defend this assumption explicitly, which would be odd if this belief is the primary reason EAs have for focusing on AI safety over AI acceleration.
A presumption in favor of human values over unaligned AI values for some reasons that aren’t based on strict impartial utilitarian arguments. These could include the beliefs that: (1) Humans are more likely to have “interesting” values compared to AIs, and (2) Humans are more likely to be motivated by moral arguments than AIs, and are more likely to reach a deliberative equilibrium of something like “ideal moral values” compared to AIs.
Why would humans be more likely to have “interesting” values than AIs? It seems very plausible that AIs will have interesting values even if their motives seem alien to us. AIs might have even more “interesting” values than humans.
It seems to me like wishful thinking to assume that humans are strongly motivated by moral arguments and would settle upon something like “ideal moral values”
A belief that population growth is inevitable, so it is better to focus on AI safety.
But a central question here is why pushing for AI safety—in the sense of AI research that enhances human interests—is better than the alternative on the margin. What reason is there to think AI safety now is better than pushing for greater AI population growth now? (Potential responses to this question are outlined in other bullet points above and below.)
AI safety has lasting effects due to a future value lock-in event, whereas accelerationism would have, at best, temporary effects.
Are you sure there will ever actually be a “value lock-in event”?
Even if there is at some point a value lock-in event, wouldn’t pushing for accelerationism also plausibly affect the values that are locked in? For example, the value of “population growth is good” seems more likely to be locked in, if you advocate for that now.
A belief that humans would be kinder and more benevolent than unaligned AIs
Humans seem pretty bad already. For example, humans are responsible for factory farming. It’s plausible that AIs could be even more callous and morally indifferent than humans, but the bar already seems low.
I’m also not convinced that moral values will be a major force shaping “what happens to the cosmic endowment”. It seems to me that the forces shaping economic consumption matter more than moral values.
A bedrock heuristic that it would be extraordinarily bad if “we all died from AI”, and therefore we should pursue AI safety over AI accelerationism.
But it would also be bad if we all died from old age while waiting for AI, and missed out on all the benefits that AI offers to humans, which is a point in favor of acceleration. Why would this heuristic be weaker?
An adherence to person-affecting views in which the values of currently-existing humans are what matter most; and a belief that AI threatens to kill existing humans.
But in this view, AI accelerationism could easily be favored since AIs could greatly benefit existing humans by extending our lifespans and enriching our lives with advanced technology.
An implicit acceptance of human supremacism, i.e. the idea that what matters is propagating the interests of the human species, or preserving the human species, even at the expense of individual interests (either within humanity or outside humanity) or the interests of other species.
But isn’t EA known for being unusually anti-speciesist compared to other communities? Peter Singer is often seen as a “founding father” of the movement, and a huge part of his ethical philosophy was about how we shouldn’t be human supremacists.
More generally, it seems wrong to care about preserving the “human species” in an abstract sense relative to preserving the current generation of actually living humans.
A belief that most humans are biased towards acceleration over safety, and therefore it is better for EAs to focus on safety as a useful correction mechanism for society.
But was an anti-safety bias common for previous technologies? I think something closer to the opposite is probably true: most humans seem, if anything, biased towards being overly cautious about new technologies rather than overly optimistic.
A belief that society is massively underrating the potential for AI, which favors extra work on AI safety, since it’s so neglected.
But if society is massively underrating AI, then this should also favor accelerating AI too? There doesn’t seem to be an obvious asymmetry between these two values.
An adherence to negative utilitarianism, which would favor obstructing AI, along with any other technology that could enable the population of conscious minds to expand.
This seems like a plausible moral argument to me, but it doesn’t seem like a very popular position among EAs.
A heuristic that “change is generally bad” and AI represents a gigantic change.
I don’t think many EAs would defend this heuristic explicitly.
Added: AI represents a large change to the world. Delaying AI therefore preserves option value.
This heuristic seems like it would have favored advocating delaying the industrial revolution, and all sorts of moral, social, and technological changes to the world in the past. Is that a position that EAs would be willing to bite the bullet on?
My understanding is that relatively few EAs are actual hardcore classic hedonist utilitarians. I think this is ~sufficient to explain why more haven’t become accelerationists.
Have you cornered a classic hedonist utilitarian EA and asked them? Have you cornered three? What did they say?
Don’t know why this is being disagree-voted. I think point 1 is basically correct—it doesn’t take diverging far from being a “hardcore classic hedonist utilitarian” to not support the case Matthew makes in the OP
I think a more important reason is the additional value of the information and the option value. It’s very likely that the change resulting from AI development will be irreversible. Since we’re still able to learn about AI as we study it, taking additional time to think and plan before training the most powerful AI systems seems to reduce the likelihood of being locked into suboptimal outcomes. Increasing the likelihood of achieving “utopia” rather than landing into “mediocrity” by 2 percent seems far more important than speeding up utopia by 10 years.
It’s very likely that whatever change that comes from AI development will be irreversible.
I think all actions are in a sense irreversible, but large changes tend to be less reversible than small changes. In this sense, the argument you gave seems reducible to “we should generally delay large changes to the world, to preserve option value”. Is that a reasonable summary?
In this case I think it’s just not obvious that delaying large changes is good. Would it have been good to delay the industrial revolution to preserve option value? I think this heuristic, if used in the past, would have generally demanded that we “pause” all sorts of social, material, and moral progress, which seems wrong.
I don’t think we would have been able to use the additional information we would have gained from delaying the industrial revolution but I think if we could have the answer might be “yes”. It’s easy to see in hindsight that it went well overall, but that doesn’t mean that the correct ex ante attitude shouldn’t have been caution!
AI x-risk is unique because humans would be replaced by other beings, rather than completely dying out. This means you can’t simply apply a naive argument that AI threatens total extinction of value
Paul Christiano wrote a piece a few years ago about ensuring that misaligned ASI is a “good successor” (in the moral value sense),[1] as a plan B to alignment (Medium version; LW version). I agree it’s odd that there hasn’t been more discussion since.[2]
Here’s a non-exhaustive list of guesses for why I think EAs haven’t historically been sympathetic [...]: A belief that AIs won’t be conscious, and therefore won’t have much moral value compared to humans.
accelerationism would have, at best, temporary effects
I’m confused by this point, and for me this is the overriding crux between my view and yours. Do you really not think accelerationism could have permanent effects, through making AI takeover, or some other irredeemable outcome, more likely?
Are you sure there will ever actually be a “value lock-in event”?
Although, Paul’s argument routes through acausal cooperation—see the piece for details—rather than through the ASI being morally valuable in itself. (And perhaps OP means to focus on the latter issue.) In Paul’s words:
Clarification: Being good vs. wanting good
We should distinguish two properties an AI might have:
- Having preferences whose satisfaction we regard as morally desirable. - Being a moral patient, e.g. being able to suffer in a morally relevant way.
These are not the same. They may be related, but they are related in an extremely complex and subtle way. From the perspective of the long-run future, we mostly care about the first property.
Under purely longtermist views, accelerating AI by 1 year increases available cosmic resources by 1 part in 10 billion. This is tiny. So the first order effects of acceleration are tiny from a longtermist perspective.
Thus, a purely longtermist perspective doesn’t care about the direct effects of delay/acceleration and the question would come down to indirect effects.
I can see indirect effects going either way, but delay seems better on current margins (this might depend on how much optimism you have on current AI safety progress, governance/policy progress, and whether you think humanity retaining control relative to AIs is good or bad). All of these topics have been explored and discussed to some extent.
I expect there hasn’t been much investigation of accelerating AI to advance the preferences of currently existing people because this exists at a point on the crazy train that very few people are at. See also the curse of cryonics:
the “curse of cryonics” is when a problem is both weird and very important, but it’s sitting right next to other weird problems that are even more important, so everyone who’s able to notice weird problems works on something else instead.
Under purely longtermist views, accelerating AI by 1 year increases available cosmic resources by 1 part in 10 billion. This is tiny.
Tiny compared to what? Are you assuming we can take some other action whose consequences don’t wash out over the long-term, e.g. because of a value lock-in? In general, these assumptions just seem quite weak and underspecified to me.
What exactly is the alternative action that has vastly greater value in expectation, and why does it have greater value? If what you mean is that we can try to reduce the risk of extinction instead, keep in mind that my first bullet point preempted pretty much this exact argument:
Unlike existential risk from other sources (e.g. an asteroid) AI x-risk is unique because humans would be replaced by other beings, rather than completely dying out. This means you can’t simply apply a naive argument that AI threatens total extinction of value to make the case that AI safety is astronomically important, in the sense that you can for other x-risks. You generally need additional assumptions.
What exactly is the alternative action that has vastly greater value in expectation, and why does it have greater value?
Ensuring human control throughout the singularity rather than having AIs get control very obviously has relatively massive effects. Of course, we can debate the sign here, I’m just making a claim about the magnitude.
I’m not talking about extinction of all smart beings on earth (AIs and humans), which seems like a small fraction of existential risk.
(Separately, the badness of such extinction seems maybe somewhat overrated because pretty likely intelligent life will just re-evolve in the next 300 million years. Intelligent life doesn’t seem that contingent. Also aliens.)
I think it remains the case that the value of accelerating AI progress is tiny relative to other apparently available interventions, such as ensuring that AIs are sentient or improving their expected well-being conditional on their being sentient. The case for focusing on how a transformative technology unfolds, rather than on when it unfolds,[1] seems robust to a relatively wide range of technologies and assumptions. Still, this seems worth further investigation.
Indeed, it seems that when the transformation unfolds is primarily important because of how it unfolds, insofar as the quality of a transformation is partly determined by its timing.
I’m claiming that it is not actually clear that we can take actions that don’t merely wash out over the long-term. In this case, you cannot simply assume that we can meaningfully and predictably affect how valuable the long-term future will be in, for example, billions of years. I agree that, yes, if you assume we can meaningfully affect the very long-run, then all actions that merely have short-term effects will have “tiny” impacts by comparison. But the assumption that we can meaningfully and predictably affect the long-run is precisely the thing that needs to be argued. I think it’s important for EAs to try to be more rigorous about their empirical claims here.
Moreover, actions that have short-term effects can generally be assumed to have longer term effects if our actions propagate. For example, support for larger population sizes now would presumably increase the probability that larger population sizes exist in the very long run, compared to the alternative of smaller population sizes with high per capita incomes. It seems arbitrary to assume this effect will be negligible but then also assume other competing effects won’t be negligible. I don’t see any strong arguments for this position.
I was trying to hint at prima facie plausible ways in which the present generation can increase the value of the long-term future by more than one part in billions, rather than “assume” that this is the case, though of course I never gave anything resembling a rigorous argument.
I do agree that the “washing out” hypothesis is a reasonable default and that one needs a positive reason for expecting our present actions to persist into the long-term. One seemingly plausible mechanism is influencing how a transformative technology unfolds: it seems that the first generation that creates AGI has significantly more influence on how much artificial sentience there is in the universe a trillion years from now than, say, the millionth generation. Do you disagree with this claim?
I’m not sure I understand the point you make in the second paragraph. What would be the predictable long-term effects of hastening the arrival of AGI in the short-term?
I was trying to hint at prima facie plausible ways in which the present generation can increase the value of the long-term future by more than one part in billions, rather than “assume” that this is the case, though of course I never gave anything resembling a rigorous argument.
As I understand, the argument originally given was that there was a tiny effect of pushing for AI acceleration, which seems outweighed by unnamed and gigantic “indirect” effects in the long-run from alternative strategies of improving the long-run future. I responded by trying to get more clarity on what these gigantic indirect effects actually are, how we can predictably bring them about, and why we would think it’s plausible that we could bring them about in the first place. From my perspective, the shape of this argument looks something like:
Your action X has this tiny positive near-term effect (ETA: or a tiny direct effect)
My action Y has this large positive long-term effect (ETA: or a large indirect effect)
Therefore, Y is better than X.
Do you see the flaw here? Well, both X and Y could have long-term effects! So, it’s not sufficient to compare the short-term effect of X to the long-term effect of Y. You need to compare both effects, on both time horizons. As far as I can tell, I haven’t seen any argument in this thread that analyzed and compared the long-term effects in any detail, except perhaps in Ryan Greenblatt original comment, in which he linked to some other comments about a similar topic in a different thread (but I still don’t see what the exact argument is).
More generally, I think you’re probably trying to point to some concept you think is obvious and clear here, and I’m not seeing it, which is why I’m asking you to be more precise and rigorous about what you’re actually claiming.
I’m not sure I understand the point you make in the second paragraph. What would be the predictable long-term effect of hastening the arrival of AGI in the short-term?
In my original comment I pointed towards a mechanism. Here’s a more precise characterization of the argument:
Total utilitarianism generally supports, all else being equal, larger population sizes with low per capita incomes over small population sizes with high per capita incomes.
To the extent that our actions do not “wash out”, it seems reasonable to assume that pushing for large population sizes now would make it more likely in the long-run that we get large population sizes with low per-capita incomes compared to a small population size with high per capita incomes. (Keep in mind here that I’m not making any claim about the total level of resources.)
To respond to this argument you could say that in fact our actions do “wash out” here, so as to make the effect of pushing for larger population sizes rather small in the long run. But in response to that argument, I claim that this objection can be reversed and applied to almost any alternative strategy for improving the future that you might think is actually better. (In other words, I actually need to see your reasons for why there’s an asymmetry here; and I don’t currently see these reasons.)
Alternatively, you could just say that total utilitarianism is unreasonable and a bad ethical theory, but my original comment was about analyzing the claim about accelerating AI from the perspective of total utilitarianism, which, as a theory, seems to be relatively popular among EAs. So I’d prefer to keep this discussion grounded within that context.
Yes, I agree that we should consider the long-term effects of each intervention when comparing them. I focused on the short-term effects of hastening AI progress because it is those effects that are normally cited as the relevant justification in EA/utilitarian discussions of that intervention. For instance, those are the effects that Bostrom considers in ‘Astronomical waste’. Conceivably, there is a separate argument that appeals to the beneficial long-term effects of AI capability acceleration. I haven’t considered this argument because I haven’t seen many people make it, so I assume that accelerationist types tend to believe that the short-term effects dominate.
I think Bostrom’s argument merely compares a pure x-risk (such as a huge asteroid hurtling towards Earth) relative to technological acceleration, and then concludes that reducing the probability of a pure x-risk is more important because the x-risk threatens the eventual colonization of the universe. I agree with this argument in the case of a pure x-risk, but as I noted in my original comment, I don’t think that AI risk is a pure x-risk.
If, by contrast, all we’re doing by doing AI safety research is influencing something like “the values of the agents in society in the future” (and not actually influencing the probability of eventual colonization), then this action seems to plausibly just wash out in the long-term. In this case, it seems very appropriate to compare the short-term effects of AI safety to the short-term effects of acceleration.
Let me put it another way. We can think about two (potentially competing) strategies for making the future better, along with their relevant short and possible long-term effects:
Doing AI safety research
Short-term effects: makes it more likely that AIs are kind to current or near-future humans
Possible long-term effect: makes it more likely that AIs in the very long-run will share the values of the human species, relative to some unaligned alternative
Accelerating AI
Short-term effect: helps current humans by hastening the arrival of advanced technology
Possible long-term effect: makes it more likely that we have a large population size at low per capita incomes, relative to a low population size with high per capita income
My opinion is that both of these long-term effects are very speculative, so it’s generally better to focus on a heuristic of doing what’s better in the short-term, while keeping the long-term consequences in mind. And when I do that, I do not come to a strong conclusion that AI safety research “beats” AI acceleration, from a total utilitarian perspective.
Your action X has this tiny positive near-term effect.
My action Y has this large positive long-term effect.
Therefore, Y is better than X.
To be clear, this wasn’t the structure of my original argument (though it might be Pablo’s). My argument was more like “you seem to be implying that action X is good because of its direct effect (literal first order acceleration), but actually the direct effect is small when considered in a particular perspective (longtermism), so for the that perspective we need to consideer indirect effects and the analysis for that looks pretty different”.
Note that I wasn’t trying really trying argue much about the sign of the indirect effect, though people have indeed discussed this in some detail in various contexts.
I agree your original argument was slightly different than the form I stated. I was speaking too loosely, and conflated what I thought Pablo might be thinking with what you stated originally.
I think the important claim from my comment is “As far as I can tell, I haven’t seen any argument in this thread that analyzed and compared the long-term effects in any detail, except perhaps in Ryan Greenblatt original comment, in which he linked to some other comments about a similar topic in a different thread (but I still don’t see what the exact argument is).”
I think the important claim from my comment is “As far as I can tell, I haven’t seen any argument in this thread that analyzed and compared the long-term effects in any detail, except perhaps in Ryan Greenblatt original comment, in which he linked to some other comments about a similar topic in a different thread (but I still don’t see what the exact argument is).”
Explicitly confirming that this seems right to me.
Moreover, actions that have short-term effects can generally be assumed to have longer term effects if our actions propagate.
I don’t disagree with this. I was just claiming that the “indirect” effects dominate (by indirect, I just mean effects other than shifting the future closer in time).
There is still the question of indirect/direct effects.
I was just claiming that the “indirect” effects dominate (by indirect, I just mean effects other than shifting the future closer in time).
I understand that. I wanted to know why you thought that. I’m asking for clarity. I don’t currently understand your reasons. See this recent comment of mine for more info.
I generally agree that we should be more concerned about this. In particular, I find people who will happily approve Shut Up and Multiply sentiment but reject this consideration suspect in their reasoning.
A more extreme version of this is that, given the massively greater efficiency with which a digital consciousness could convert matter and energy to utilons (IIRC naively about 3 orders of magnitude according to Bostrom, before any increase from greater coordination), on strict expected value reasoning you have to be extremely confident that this won’t happen—or at least have a much stronger rebuttal than ‘AI won’t necessarily be conscious’.
Separately, I think there might be a case for accelerationism even if you think it increases the risk of AI takeover and that AI takeover is bad, on the grounds that in many scenarios advancing faster might still increase the probability of human descendants getting through the time of perils before some other threat destroys us (every year we remain in our current state is another year in which we run the risk of, for example, a global nuclear war or civilisation-ending pandemic).
A more extreme version of this is that, given the massively greater efficiency with which a digital consciousness could convert matter and energy to utilons
I have a post where I conclude the above may well apply not only to digital consciousness, but also to animals:
I calculated the welfare ranges per calorie consumption for a few species.
They vary a lot. The values for bees and pigs are 4.88 k and 0.473 times as high as that for humans.
They are higher for non-human animals:
5 of the 6 species I analysed have values higher than that of humans.
The lower the calorie consumption, the higher the median welfare range per calorie consumption.
A lot of these points seem like arguments that it’s possible that unaligned AI takeover will go well, e.g. there’s no reason not to think that AIs are conscious, or will have interesting moral values, or etc.
My stance is that we (more-or-less) know humans are conscious and have moral values that, while they have failed to prevent large amounts of harm, seem to have the potential to be good. AIs may be conscious and may have welfare-promoting values, but we don’t know that yet. We should try to better understand whether AIs are worthy successors before transitioning power to them.
Probably a core point of disagreement here is whether, presented with a “random” intelligent actor, we should expect it to promote welfare or prevent suffering “by default”. My understanding is that some accelerationists believe that we should. I believe that we shouldn’t. Moreover I believe that it’s enough to be substantially uncertain about whether this is or isn’t the default to want to take a slower and more careful approach.
My stance is that we (more-or-less) know humans are conscious and have moral values that, while they have failed to prevent large amounts of harm, seem to have the potential to be good.
I claim there’s a weird asymmetry here where you’re happy to put trust into humans because they have the “potential” to do good, but you’re not willing to say the same for AIs, even though they seem to have the same type of “potential”.
Whatever your expectations about AIs, we already know that humans are not blank slates that may or may not be altruistic in the future: we actually have a ton of evidence about the quality and character of human nature, and it doesn’t make humans look great. Humans are not mainly described as altruistic creatures. I mentioned factory farming in my original comment, but one can examine the way people spend their money (i.e. not mainly on charitable causes), or the history of genocides, war, slavery, and oppression for additional evidence.
Probably a core point of disagreement here is whether, presented with a “random” intelligent actor, we should expect it to promote welfare or prevent suffering “by default”.
I don’t expect humans to “promote welfare or prevent suffering” by default either. Look at the current world. Have humans, on net, reduced or increased suffering? Even if you think humans have been good for the world, it’s not obvious. Sure, it’s easy to dismiss the value of unaligned AIs if you compare against some idealistic baseline; but I’m asking you to compare against a realistic baseline, i.e. actual human nature.
It seems like you’re just substantially more pessimistic than I am about humans. I think factory farming will be ended, and though it seems like humans have caused more suffering than happiness so far, I think their default trajectory will be to eventually stop doing that, and to ultimately do enough good to outweigh their ignoble past. I don’t think this is certain by any means, but I think it’s a reasonable extrapolation. (I maybe don’t expect you to find it a reasonable extrapolation.)
Meanwhile I expect the typical unaligned AI may seize power for some purpose that seems to us entirely trivial, and may be uninterested in doing any kind of moral philosophy, and/or may not place any terminal (rather than instrumental) value in paying attention to other sentient experiences in any capacity. I do think humans, even with their kind of terrible track record, are more promising than that baseline, though I can see why other people might think differently.
Sure, it’s easy to dismiss the value of unaligned AIs if you compare against some idealistic baseline; but I’m asking you to compare against a realistic baseline, i.e. actual human nature.
I haven’t read your entire post about this, but I understand you believe that if we created aligned AI, it would get essentially “current” human values, rather than e.g. some improved / more enlightened iteration of human values. If instead you believed the latter, that would set a significantly higher bar for unaligned AI, right?
If instead you believed the latter, that would set a significantly higher bar for unaligned AI, right?
That’s right, if I thought human values would improve greatly in the face of enormous wealth and advanced technology, I’d definitely be open to seeing humans as special and extra valuable from a total utilitarian perspective. Note that many routes through which values could improve in the future could apply to unaligned AIs too. So, for example, I’d need to believe that humans would be more likely to reflect, and be more likely to do the right type of reflection, relative to the unaligned baseline. In other words it’s not sufficient to argue that humans would reflect a little bit; that wouldn’t really persuade me at all.
I think there is very likely at some point going to be some sort of transition to a world where AIs are effectively in control. It seems worth it to slow down on the margin to try to shape this transition as best we can, especially slowing it down as we get closer to AGI and ASI. It would be surprising to me if making the transfer of power more voluntary/careful led to worse outcomes (or only led to slightly better outcomes such that the downsides of slowing down a bit made things worse).
Delaying the arrival of AGI by a few years as we get close to it seems good regardless of parameters like the value of involuntary-AI-disempowerment futures. But delaying the arrival by 100s of years seems more likely bad due to the tradeoff with other risks.
It would be surprising to me if making the transfer of power more voluntary/careful led to worse outcomes (or only led to slightly better outcomes such that the downsides of slowing down a bit made things worse).
Two questions here:
Why would accelerating AI make the transition less voluntary? (In my own mind, I’d be inclined to reverse this sentiment a bit: delaying AI by regulation generally involves forcibly stopping people from adopting AI. Force might be justified if it brings about a greater good, but that’s not the argument here.)
I can understand being “careful”. Being careful does seem like a good thing. But “being careful” generally trades off against other values in almost every domain I can think of, and there is such a thing as too much of a good thing. What reason is there to think that pushing for “more caution” is better on the margin compared to acceleration, especially considering society’s default response to AI in the absence of intervention?
So in the multi-agent slowly-replacing case, I’d argue that individual decisions don’t necessarily represent a voluntary decision on behalf of society (I’m imagining something like this scenario). In the misaligned power-seeking case, it seems obvious to me that this is involuntary. I agree that it technically could be a collective voluntary decision to hand over power more quickly, though (and in that case I’d be somewhat less against it).
I think emre’s comment lays out the intuitive case for being careful / taking your time, as does Ryan’s. I think the empirics are a bit messy once you take into account benefits of preventing other risks but I’d guess they come out in favor of delaying by at least a few years.
A presumption in favor of human values over unaligned AI values for some reasons that aren’t based on strict impartial utilitarian arguments. These could include the beliefs that: (1) Humans are more likely to have “interesting” values compared to AIs, and (2) Humans are more likely to be motivated by moral arguments than AIs, and are more likely to reach a deliberative equilibrium of something like “ideal moral values” compared to AIs.
I don’t think this is a crux. Even if you prefer unaligned AI values over likely human values (weighted by power), you’d probably prefer doing research on further improving AI values over speeding things up.
I think misaligned AI values should be expected to be worse than human values, because it’s not clear that misaligned AI systems would care about eg their own welfare.
Inasmuch as we expect misaligned AI systems to be conscious (or whatever we need to care about them) and also to be good at looking after their own interests, I agree that it’s not clear from a total utilitarian perspective that the outcome would be bad.
But the “values” of a misaligned AI system could be pretty arbitrary, so I don’t think we should expect that.
So I think it’s likely you have some very different beliefs from most people/EAs/myself, particularly:
Thinking that humans/humanity is bad, and AI is likely to be better
Thinking that humanity isn’t driven by ideational/moral concerns[1]
That AI is very likely to be conscious, moral (as in, making better moral judgements than humans), and that the current/default trend in the industry is very likely to make them conscious moral agents in a way humans aren’t
I don’t know if the total utilitarian/accelerationist position in the OP is yours or not. I think Daniel is right that most EAs don’t have this position. I think maybe Peter Singer gets closest to this in his interview with Tyler on the ‘would you side with the Aliens or not question’ here. But the answer to your descriptive question is simply that most EAs don’t have the combination of moral and empirical views about the world to make the argument you present valid and sound, so that’s why there isn’t much talk in EA about naïve accelerationism.
Going off the vibe I get from this view though, I think it’s a good heuristic that if your moral view sounds like a movie villain’s monologue it might be worth reflecting, and a lot of this post reminded me of the Earth-Trisolaris Organisation from Cixin Liu’s Three Body Problem. If someone’s honest moral view is “Eliminate human tyranny! The world belongs to Trisolaris AIs!” then I don’t know what else there is to do except quote Zvi’s phrase “please speak directly into this microphone”.
Another big issue I have with this post is that some of the counter-arguments just seem a bit like ‘nu-uh’, see:
But why would we assume AIs won’t be conscious?
Why would humans be more likely to have “interesting” values than AIs?
But it would also be bad if we all died from old age while waiting for AI, and missed out on all the benefits that AI offers to humans, which is a point in favor of acceleration. Why would this heuristic be weaker?
These (and other examples) are considerations for sure, but they need to be argued for. I don’t think they can just be stated and then say “therefore, ACCELERATE!”. I agree that AI Safety research needs to be more robust and the philosophical assumptions and views made more explicit, but one could already think of some counters to the questions that you raise, and I’m sure you already have them. For example, you might take a view (ala Peter Godfrey-Smith) that a certain biological substrate is necessary for conscious.
Similarly on total utilitarianism emphasis larger population sizes, agreed to the extent that the greater population increase the population utility, but this is the repugnant conclusion again. There’s a stopping point even in that scenario where an ever larger population decreases total utility, which is why in Parfit’s scenario it’s full of potatoes and muzak rather than humans crammed into battery cages like factory-farmed animals. Empirically, naïve accelerationism may tend toward the latter case in practice, even if there’s a theoretical case to be made for it.
There’s more I could say, but I don’t want to make this reply too long, and I think as Nathan said it’s a point worth discussing. Nevertheless it seems our different positions on this are built on some wide, fundamental divisions about reality and morality itself, and I’m not sure how those can be bridged, unless I’ve wildly misunderstood your position.
I don’t think humanity is bad. I just think people are selfish, and generally driven by motives that look very different from impartial total utilitarianism. AIs (even potentially “random” ones) seem about as good in expectation, from an impartial standpoint. In my opinion, this view becomes even stronger if you recognize that AIs will be selected on the basis of how helpful, kind, and useful they are to users. (Perhaps notice how different this selection criteria is from the evolutionary criteria used to evolve humans.)
I understand that most people are partial to humanity, which is why they generally find my view repugnant. But my response to this perspective is to point out that if we’re going to be partial to a group on the basis of something other than utilitarian equal consideration of interests, it makes little sense to choose to be partial to the human species as opposed to the current generation of humans or even myself. And if we take this route, accelerationism seems even more strongly supported than before, since developing AI and accelerating technological progress seems to be the best chance we have of preserving the current generation against aging and death. If we all died, and a new generation of humans replaced us, that would certainly be pretty bad for us.
Which sounds more like a movie villain’s monologue?
The idea that everyone currently living needs to sacrificed, and die, in order to preserve the human species
The idea that we should try to preserve currently living people, even if that means taking on a greater risk of not preserving the values of the human species
To be clear, I also just totally disagree with the heuristic that “if your moral view sounds like a movie villain’s monologue it might be worth reflecting”. I don’t think that fiction is generally a great place for learning moral philosophy, albeit with some notable exceptions.
Anyway, the answer to these moral questions may seem obvious to you, but I don’t think they’re as obvious as you’re making them seem.
I think the fact that people are partial to humanity explains a large fraction of the disagreement people have with me. But, fair enough, I exaggerated a bit. My true belief is a more moderate version of that claim.
When discussing why EAs in particular disagree with me, to overgeneralize by a fair bit, I’ve noticed that EAs are happy to concede that AIs could be moral patients, but are generally reluctant to admit AIs as moral agents, in the way they’d be happy to accept humans as independent moral agents (e.g. newborns) into our society. I’dcall this “being partial to humanity”, or at least, “being partial to the values of the human species”.
(In my opinion, this partiality seems so prevalent and deep in most people that to deny it seems a bit like a fish denying the existence of water. But I digress.)
To test this hypothesis, I recently asked three questions on Twitter about whether people would be willing to accept immigration through a portal to another universe from three sources:
“a society of humans who are very similar to us”
“a society of people who look & act like humans, but each of them only cares about their family”
“a society of people who look & act like humans, but they only care about maximizing paperclips”
I emphasized that in each case, the people are human-level in their intelligence, and also biological.
The results are preliminary (and I’m not linking here to avoid biasing the results, as voting has not yet finished), but so far my followers, who are mostly EAs, are much more happy to let the humans immigrate to our world, compared to the last two options. I claim there just aren’t really any defensible reasons to maintain this choice other than by implicitly appealing to a partiality towards humanity.
My guess is that if people are asked to defend their choice explicitly, they’d largely talk about some inherent altruism or hope they place in the human species, relative to the other options; and this still looks like “being partial to humanity”, as far as I can tell, from almost any reasonable perspective.
I think the fact that people are partial to humanity explains a large fraction of the disagreement people have with me.
Maybe, it’s hard for me to know. But I predict most the pushback you’re getting from relatively thoughtful longtermists isn’t due to this.
I’ve noticed that EAs are happy to concede that AIs could be moral patients, but are generally reluctant to admit AIs as moral agents, in the way they’d be happy to accept humans as independent moral agents (e.g. newborns) into our society.
I agree with this.
I’dcall this “being partial to humanity”, or at least, “being partial to the values of the human species”.
I think “being partial to humanity” is a bad description of what’s going on because (e.g.) these same people would be considerably more on board with aliens. I think the main thing going on is that people have some (probably mistaken) levels of pessimism about how AIs would act as moral agents which they don’t have about (e.g.) aliens.
To test this hypothesis, I recently asked three questions on Twitter about whether people would be willing to accept immigration through a portal to another universe from three sources:
“a society of humans who are very similar to us”
“a society of people who look & act like humans, but each of them only cares about their family”
“a society of people who look & act like humans, but they only care about maximizing paperclips”
...
I claim there just aren’t really any defensible reasons to maintain this choice other than by implicitly appealing to a partiality towards humanity.
This comparison seems to me to be missing the point. Minimally I think what’s going on is not well described as “being partial to humanity”.
Here’s a comparison I prefer:
A society of humans who are very similar to us.
A society of humans who are very similar to us in basically every way, except that they have a genetically-caused and strong terminal preference for maximizing the total expected number of paper clips (over the entire arc of history) and only care about other things instrumentally. They are sufficiently commited to paper clip maximization that this will persist on arbitrary reflection (e.g. they’d lock in this view immediately when given this option) and let’s also suppose that this view is transmitted genetically and in a gene-drive-y way such that all of their descendents will also only care about paper clips. (You can change paper clips to basically anything else which is broadly recognized to have no moral value on its own, e.g. gold twisted into circles.)
A society of beings (e.g. aliens) who are extremely different in basically every way to humans except that they also have something pretty similar to the concepts of “morality”, “pain”, “pleasure”, “moral patienthood”, “happyness”, “preferences”, “altruism”, and “careful reasoning about morality (moral thoughtfulness)”. And the society overall also has a roughly similar relationship with these concepts (e.g. the level of “altruism” is similar). (Note that having the same relationship as humans to these concepts is a pretty low bar! Humans aren’t that morally thoughtful!)
I think I’m almost equally happy with (1) and (3) on this list and quite unhappy with (2).
If you changed (3) to instead be “considerably more altruistic”, I would prefer (3) over (1).
I think it seems weird to call my views on the comparison I just outlined as “being partial to humanity”: I actually prefer (3) over (2) even though (2) are literally humans!
(Also, I’m not that commited to having concepts of “pain” and “pleasure”, but I’m relatively commited to having a concepts which are something like “moral patienthood”, “preferences”, and “altruism”.)
Below is a mild spoiler for a story by Eliezer Yudkowsky:
To make the above comparison about different beings more concrete, in the case of three worlds collide, I would basically be fine giving the universe over the the super-happies relative to humans (I think mildly better than humans?) and I think it seems only mildly worse than humans to hand it over to the baby-eaters. In both cases, I’m pricing in some amount of reflection and uplifting which doesn’t happen in the actual story of three worlds collide, but would likely happen in practice. That is, I’m imagining seeing these societies prior to their singularity and then based on just observations of their societies at this point, deciding how good they are (pricing in the fact that the society might change over time).
To be clear, it seems totally reasonable to call this “being partial to some notion of moral thoughtfulness about pain, pleasure, and preferences”, but these concepts don’t seem that “human” to me. (I predict these occur pretty frequently in evolved life that reaches a singularity for instance. And they might occur in AIs, but I expect misaligned AIs which seize control of the world are worse from my perspective than if humans retain control.)
When I say that people are partial to humanity, I’m including an irrational bias towards thinking that humans, or evolved beings, are unusually thoughtful or ethical compared to the alternatives (I believe this is in fact an irrational bias, since the arguments I’ve seen for thinking that unaligned AIs will be less thoughtful or ethical than aliens seem very weak to me).
In other cases, when people irrationally hold a certain group X to a higher standard than a group Y, it is routinely described as “being partial to group Y over group X”. I think this is just what “being partial” means, in an ordinary sense, across a wide range of cases.
For example, if I proposed aligning AI to my local friend group, with the explicit justification that I thought my friends are unusually thoughtful, I think this would be well-described as me being “partial” to my friend group.
To the extent you’re seeing me as saying something else about how longtermists view the argument, I suspect you’re reading me as saying something stronger than what I originally intended.
In that case, my main disagreement is thinking that your twitter poll is evidence for your claims.
More specifically:
I claim there just aren’t really any defensible reasons to maintain this choice other than by implicitly appealing to a partiality towards humanity.
Like you claim there aren’t any defensible reasons to think that what humans will do is better than literally maximizing paper clips? This seems totally wild to me.
Like you claim there aren’t any defensible reasons to think that what humans will do is better than literally maximizing paper clips?
I’m not exactly sure what you mean by this. There were three options, and human paperclippers were only one of these options. I was mainly discussing the choice between (1) and (2) in the comment, not between (1) and (3).
Here’s my best guess at what you’re saying: it sounds like you’re repeating that you expect humans to be unusually altruistic or thoughtful compared to an unaligned alternative. But the point of my previous comment was to state my view that this bias counted as “being partial towards humanity”, since I view the bias as irrational. In light of that, what part of my comment are you objecting to?
To be clear, you can think the bias I’m talking about is actually rational; that’s fine. But I just disagree with you for pretty mundane reasons.
[Incorporating what you said in the other comment]
Also, to be clear, I agree that the question of “how much worse/better is it for AIs to get vast amounts of resources without human society intending to grant those resources to the AIs from a longtermist perspective” is underinvestigated, but I think there are pretty good reasons to systematically expect human control to be a decent amount better.
Then I think it’s worth concretely explaining what these reasons are to believe that human control will be a decent amount better in expectation. You don’t need to write this up yourself, of course. I think the EA community should write these reasons up. Because I currently view the proposition as non-obvious, and despite being a critical belief in AI risk discussions, it’s usually asserted without argument. When I’ve pressed people in the past, they typically give very weak reasons.
I don’t know how to respond to an argument whose details are omitted.
Then I think it’s worth concretely explaining what these reasons are to believe that human control will be a decent amount better in expectation. You don’t need to write this up yourself, of course.
+1, but I don’t generally think it’s worth counting on “the EA community” to do something like this. I’ve been vaguely trying to pitch Joe on doing something like this (though there are probably better uses of his time) and his recent blogs posts are touching similar topics.
Here’s my best guess at what you’re saying: it sounds like you’re repeating that you expect humans to be unusually altruistic or thoughtful compared to an unaligned alternative.
There, I’m just saying that human control is better than literal paperclip maximization.
This response still seems underspecified to me. Is the default unaligned alternative paperclip maximization in your view? I understand that Eliezer Yudkowsky has given arguments for this position, but it seems like you diverge significantly from Eliezer’s general worldview, so I’d still prefer to hear this take spelled out in more detail from your own point of view.
“a society of people who look & act like humans, but they only care about maximizing paperclips”
And then you say:
so far my followers, who are mostly EAs, are much more happy to let the humans immigrate to our world, compared to the last two options. I claim there just aren’t really any defensible reasons to maintain this choice other than by implicitly appealing to a partiality towards humanity.
So, I think more human control is better than more literal paperclip maximization, the option given in your poll.
My overall position isn’t that the AIs will certainly be paperclippers, I’m just arguing in isolation about why I think the choice given in the poll is defensible.
I have the feeling we’re talking past each other a bit. I suspect talking about this poll was kind of a distraction. I personally have the sense of trying to convey a central point, and instead of getting the point across, I feel the conversation keeps slipping into talking about how to interpret minor things I said, which I don’t see as very relevant.
I will probably take a break from replying for now, for these reasons, although I’d be happy to catch up some time and maybe have a call to discuss these questions in more depth. I definitely see you as trying a lot harder than most other EAs in trying to make progress on these questions collaboratively with me.
I’d be very happy to have some discussion on these topics with you Matthew. For what it’s worth, I really have found much of your work insightful, thought-provoking, and valuable. I think I just have some strong, core disagreements on multiple empirical/epistemological/moral levels with your latest series of posts.
That doesn’t mean I don’t want you to share your views, or that they’re not worth discussion, and I apologise if I came off as too hostile. An open invitation to have some kind of deeper discussion stands.[1]
Also, to be clear, I agree that the question of “how much worse/better is it for AIs to get vast amounts of resources without human society intending to grant those resources to the AIs from a longtermist perspective” is underinvestigated, but I think there are pretty good reasons to systematically expect human control to be a decent amount better.
Under preference utilitarianism, it doesn’t necessarily matter whether AIs are conscious.
I’m guessing preference utilitarians would typically say that only the preferences of conscious entities matter. I doubt any of them would care about satisfying an electron’s “preference” to be near protons rather than ionized.
Strongly there should be more explicit defences of this argument.
One way of doing this in a co-operative way might working on co-operative AI stuff, since it seems to increase the likelihood that misaligned AI goes well, or at least less badly.
My personal reason for not digging into this is that my naive model of how good the AI future is: quality_of_future * amount_of_the_stuff.
And there is distinction I haven’t seen you acknowledged: while high “quality” doesn’t require humans to be around, I ultimately judge quality by my values. (Thing being conscious is an example. But this also includes things like not copy-pasting the same thing all over, not wiping out aliens, and presumably many other things I am not aware of. IIRC Yudkowsky talks about cosmopolitanism being a human value.)
Because of this, my impression is that if we hand over the future to a random AI, the “quality” will be very low. And so we can currently have a much larger impact by focusing on increasing the quality. Which we can do by delaying “handing over the future to AI” and picking a good AI to hand over to. IE, alignment.
(Still, I agree it would be nice if there was a better analysis of this, which exposed the assumptions.)
And there is distinction I haven’t seen you acknowledged: while high “quality” doesn’t require humans to be around, I ultimately judge quality by my values.
Is there any particular reason why you are partial towards humans generically controlling the future, relative to this particular current generation of humans? To me, it seems like being partial to one’s own values, one’s community, and especially one’s own life, generally leads to an even stronger argument for accelerationism, since the best way to advance your own values is generally to actually “be there” when AI happens.
In my opinion, the main relevant alternative to this view is to be partial to the human species, as opposed to being partial to either one’s current generation, or oneself. And I think the human species is kind of a weird category to be partial to, relative to those other things. Do you disagree?
In my opinion, the main relevant alternative to this view is to be partial to the human species, as opposed to being partial to either one’s current generation, or oneself. And I think the human species is kind of a weird category to be partial to, relative to those other things. Do you disagree?
I agree with this.
the best way to advance your own values is generally to actually “be there” when AI happens.
I (strongly) disagree with this. Me being alive is a relatively small part of my values. And since I am not the director of the world, me personally being around to influence things is unlikely to have a decisive impact on things I value.
In more detail: Sure, all else being equal, me being there when AI happens is mildly helpful. But the outcome of building AI seems to be a function of, among other things, (i) values of the people building it + (ii) how much reflection they can do on those values + (iii) the environment dynamics these people are subject to (e.g., the current race dynamics between AI companies). And over time, I expect the potential decrease in (i) to be far outweighed by gains in (ii) and (iii).
The first issue is about (i), that it is not actually me building the AGI, either now or in the future. But I am willing to grant that (all else being equal) current generation is more likely to have values closer to my values.
However, I expect that the factors are (ii) and (iii) are just as influential. Regarding (ii), it seems we keep making progress at philosophy, ethics, etc, and to me, this currently far outweighs the value drift in (i).
Regarding (iii), my impression is that the current situation is so bad that it can’t get much worse, and we might as well wait. This of course depends on how likely you think we are likely to get a bad outcome if we either (a) get superintelligence without additional progress on alignment or (b) get widespread human-level AI with no progress on alignment, institution design, etc.
Me being alive is a relatively small part of my values.
I agree some people (such as yourself) might be extremely altruistic, and therefore might not care much about their own life relative to other values they hold, but this position is fairly uncommon. Most people care a lot about their own lives (and especially the lives of their family and friends) relative to other things they care about. We can empirically test this hypothesis by looking at how people choose to spend their time and money; and the results are generally that people spend their money on themselves, their family and their friends.
since I am not the director of the world, me personally being around to influence things is unlikely to have a decisive impact on things I value.
You don’t need to be director of the world to have influence over things. You can just be a small part of the world to have influence over things that you care about. This is essentially what you’re already doing by living and using your income to make decisions, to satisfy your own preferences. I’m claiming this situation could and probably will persist into the indefinite future, for the agents that exist in the future.
I’m very skeptical that there will ever be a moment in time during which there will be a “director of the world”, in a strong sense. And I doubt the developer of the first AGI will become the director of the world, even remotely (including versions of them that reflect on moral philosophy etc.). You might want to read my post about this.
One potential argument against accelerating AI is that it will increase the chance of catastrophes which will then lead to overregulating AI (e.g. in the same way that nuclear power arguably was overregulated).
(Clarification about my views in the context of the AI pause debate)
I’m finding it hard to communicate my views on AI risk. I feel like some people are responding to the general vibe they think I’m giving off rather than the actual content. Other times, it seems like people will focus on a narrow snippet of my comments/post and respond to it without recognizing the context. For example, one person interpreted me as saying that I’m against literally any AI safety regulation. I’m not.
For a full disclosure, my views on AI risk can be loosely summarized as follows:
I think AI will probably be very beneficial for humanity.
Nonetheless, I think that there are credible, foreseeable risks from AI that could do vast harm, and we should invest heavily to ensure these outcomes don’t happen.
I also don’t think technology is uniformly harmless. Plenty of technologies have caused net harm. Factory farming is a giant net harm that might have even made our entire industrial civilization a mistake!
I’m not blindly against regulation. I think all laws can and should be viewed as forms of regulations, and I don’t think it’s feasible for society to exist without laws.
That said, I’m also not blindly in favor of regulation, even for AI risk. You have to show me that the benefits outweigh the harm
I am generally in favor of thoughtful, targeted AI regulations that align incentives well, and reduce downside risks without completely stifling innovation.
I’m open to extreme regulations and policies if or when an AI catastrophe seems imminent, but I don’t think we’re in such a world right now. I’m not persuaded by the arguments that people have given for this thesis, such as Eliezer Yudkowsky’s AGI ruin post.
I might elaborate on this at some point, but I thought I’d write down some general reasons why I’m more optimistic than many EAs on the risk of human extinction from AI. I’m not defending these reasons here; I’m mostly just stating them.
Skepticism of foom: I think it’s unlikely that a single AI will take over the whole world and impose its will on everyone else. I think it’s more likely that millions of AIs will be competing for control over the world, in a similar way that millions of humans are currently competing for control over the world. Power or wealth might be very unequally distributed in the future, but I find it unlikely that it will be distributed so unequally that there will be only one relevant entity with power. In a non-foomy world, AIs will be constrained by norms and laws. Absent severe misalignment among almost all the AIs, I think these norms and laws will likely include a general prohibition on murdering humans, and there won’t be a particularly strong motive for AIs to murder every human either.
Skepticism that value alignment is super-hard: I haven’t seen any strong arguments that value alignment is very hard, in contrast to the straightforward empirical evidence that e.g. GPT-4 seems to be honest, kind, and helpful after relatively little effort. Most conceptual arguments I’ve seen for why we should expect value alignment to be super-hard rely on strong theoretical assumptions that I am highly skeptical of. I have yet to see significant empirical successes from these arguments. I feel like many of these conceptual arguments would, in theory, apply to humans, and yet human children are generally value aligned by the time they reach young adulthood (at least, value aligned enough to avoid killing all the old people). Unlike humans, AIs will be explicitly trained to be benevolent, and we will have essentially full control over their training process. This provides much reason for optimism.
Belief in a strong endogenous response to AI: I think most people will generally be quite fearful of AI and will demand that we are very cautious while deploying the systems widely. I don’t see a strong reason to expect companies to remain unregulated and rush to cut corners on safety, absent something like a world war that presses people to develop AI as quickly as possible at all costs.
Not being a perfectionist: I don’t think we need our AIs to be perfectly aligned with human values, or perfectly honest, similar to how we don’t need humans to be perfectly aligned and honest. Individual humans are usually quite selfish, frequently lie to each other, and are often cruel, and yet the world mostly gets along despite this. This is true even when there are vast differences in power and wealth between humans. For example some groups in the world have almost no power relative to the United States, and residents in the US don’t particularly care about them either, and yet they survive anyway.
Skepticism of the analogy to other species: it’s generally agreed that humans dominate the world at the expense of other species. But that’s not surprising, since humans evolved independently of other animal species. And we can’t really communicate with other animal species, since they lack language. I don’t think AI is analogous to this situation. AIs will mostly be born into our society, rather than being created outside of it. (Moreover, even in this very pessimistic analogy, humans still spend >0.01% of our GDP on preserving wild animal species, and the vast majority of animal species have not gone extinct despite our giant influence on the natural world.)
ETA: feel free to ignore the below, given your caveat, though you may find it helpful if you choose to write an expanded form of any of the arguments later to have some early objections.
Correct me if I’m wrong, but it seems like most of these reasons boil down to not expecting AI to be superhuman in any relevant sense (since if it is, effectively all of them break down as reasons for optimism)? To wit:
Resource allocation is relatively equal (and relatively free of violence) among humans because even humans that don’t very much value the well-being of others don’t have the power to actually expropriate everyone else’s resources by force. (We have evidence of what happens when those conditions break down to any meaningful degree; it isn’t super pretty.)
I do not think GPT-4 is meaningful evidence about the difficulty of value alignment. In particular, the claim that “GPT-4 seems to be honest, kind, and helpful after relatively little effort” seems to be treating GPT-4′s behavior as meaningfully reflecting its internal preferences or motivations, which I think is “not even wrong”. I think it’s extremely unlikely that GPT-4 has preferences over world states in a way that most humans would consider meaningful, and in the very unlikely event that it does, those preferences almost certainly aren’t centrally pointed at being honest, kind, and helpful.
re: endogenous reponse to AI—I don’t see how this is relevant once you have ASI. To the extent that it might be relevant, it’s basically conceding the argument: that the reason we’ll be safe is that we’ll manage to avoid killing ourselves by moving too quickly. (Note that we are currently moving at pretty close to max speed, so this is a prediction that the future will be different from the past. One that some people are actively optimising for, but also one that other people are optimizing against.)
re: perfectionism—I would not be surprised if many current humans, given superhuman intelligence and power, created a pretty terrible future. Current power differentials do not meaningfully let individual players flip every single other player the bird at the same time. Assuming that this will continue to be true is again assuming the conclusion (that AI will not be superhuman in any relevant sense). I also feel like there’s an implicit argument here about how value isn’t fragile that I disagree with, but I might be reading into it.
I’m not totally sure what analogy you’re trying to rebut, but I think that human treatment of animal species, as a piece of evidence for how we might be treated by future AI systems that are analogously more powerful than we are, is extremely negative, not positive. Human efforts to preserve animal species are a drop in the bucket compared to the casual disregard with which we optimize over them and their environments for our benefit. I’m sure animals sometimes attempt to defend their territory against human encroachment. Has the human response to this been to shrug and back off? Of course, there are some humans who do care about animals having fulfilled lives by their own values. But even most of those humans do not spend their lives tirelessly optimizing for their best understanding of the values of animals.
Correct me if I’m wrong, but it seems like most of these reasons boil down to not expecting AI to be superhuman in any relevant sense
No, I certainly expect AIs will eventually be superhuman in virtually all relevant respects.
Resource allocation is relatively equal (and relatively free of violence) among humans because even humans that don’t very much value the well-being of others don’t have the power to actually expropriate everyone else’s resources by force.
Can you clarify what you are saying here? If I understand you correctly, you’re saying that humans have relatively little wealth inequality because there’s relatively little inequality in power between humans. What does that imply about AI?
I think there will probably be big inequalities in power among AIs, but I am skeptical of the view that there will be only one (or even a few) AIs that dominate over everything else.
I do not think GPT-4 is meaningful evidence about the difficulty of value alignment.
I’m curious: does that mean you also think that alignment research performed on GPT-4 is essentially worthless? If not, why?
I think it’s extremely unlikely that GPT-4 has preferences over world states in a way that most humans would consider meaningful, and in the very unlikely event that it does, those preferences almost certainly aren’t centrally pointed at being honest, kind, and helpful.
I agree that GPT-4 probably doesn’t have preferences in the same way humans do, but it sure appears to be a limited form of general intelligence, and I think future AGI systems will likely share many underlying features with GPT-4, including, to some extent, cognitive representations inside the system.
I think our best guess of future AI systems should be that they’ll be similar to current systems, but scaled up dramatically, trained on more modalities, with some tweaks and post-training enhancements, at least if AGI arrives soon. Are you simply skeptical of short timelines?
re: endogenous reponse to AI—I don’t see how this is relevant once you have ASI.
To be clear, I expect we’ll get AI regulations before we get to ASI. I predict that regulations will increase in intensity as AI systems get more capable and start having a greater impact on the world.
Note that we are currently moving at pretty close to max speed, so this is a prediction that the future will be different from the past.
Every industry in history initially experienced little to no regulation. However, after people became more acquainted with the industry, regulations on the industry increased. I expect AI will follow a similar trajectory. I think this is in line with historical evidence, rather than contradicting it.
re: perfectionism—I would not be surprised if many current humans, given superhuman intelligence and power, created a pretty terrible future. Current power differentials do not meaningfully let individual players flip every single other player the bird at the same time.
I agree. If you turned a random human into a god, or a random small group of humans into gods, then I would be pretty worried. However, in my scenario, there aren’t going to be single AIs that suddenly become gods. Instead, in my scenario, there will be millions of different AIs, and the AIs will smoothly increase in power over time. During this time, we will be able to experiment and do alignment research to see what works and what doesn’t at making the AIs safe. I expect AI takeof will be fairly diffuse, and AIs will probably be respectful of norms and laws because no single AI can take over the world by themselves. Of course, the way I think about the future could be wrong on a lot of specific details, but I don’t see a strong reason to doubt the basic picture I’m presenting, as of now.
My guess is that your main objection here is that you think foom will happen, i.e. there will be a single AI that takes over the world and imposes its will on everyone else. Can you elaborate more on why you think that will happen? I don’t think it’s a straightforward consequence of AIs being smarter than humans.
I’m not totally sure what analogy you’re trying to rebut, but I think that human treatment of animal species, as a piece of evidence for how we might be treated by future AI systems that are analogously more powerful than we are, is extremely negative, not positive.
My main argument is that we should reject the analogy itself. I’m not really arguing that the analogy provides evidence for optimism, except in a very weak sense. I’m just saying: AIs will be born into and shaped by our culture; that’s quite different than what happened between animals and humans.
Individual humans are usually quite selfish, frequently lie to each other, and are often cruel, and yet the world mostly gets along despite this. This is true even when there are vast differences in power and wealth between humans. For example some groups in the world have almost no power relative to the United States, and residents in the US don’t particularly care about them either, and yet they survive anyway.
Okay so these are two analogies: individual humans & groups/countries.
First off, “surviving” doesn’t seem like the right thing to evaluate, more like “significant harm”/”being exploited ”
Can you give some examples where individual humans have a clear strategic decisive advantage (i.e. very low risk of punishment), where the low-power individual isn’t at a high risk of serious harm? Because the examples I can think of are all pretty bad: dictators, slaveholders, husbands in highly patriarchal societies.. Sexual violence is extremely prevalent and is pretty much always in a high power difference context.
I find the US example unconvincing, because I find it hard to imagine the US benefiting more from aggressive use it force, than trade and soft economic exploitation. The US doesn’t have the power to successfully occupy countries anymore. When there were bigger power differences due to technology, we had the age of colonialism.
Can you give some examples where individual humans have a clear strategic decisive advantage (i.e. very low risk of punishment), where the low-power individual isn’t at a high risk of serious harm?
Why are we assuming a low risk of punishment? Risk of punishment depends largely on social norms and laws, and I’m saying that AIs will likely adhere to a set of social norms.
I think the central question is whether these social norms will include the norm “don’t murder humans”. I think such a norm will probably exist, unless almost all AIs are severely misaligned. I think severe misalignment is possible; one can certainly imagine it happening. But I don’t find it likely, since people will care a lot about making AIs ethical, and I’m not yet aware of any strong reasons to think alignment will be super-hard.
It seems to me that a big crux about the value of AI alignment work is what target you think AIs will ultimately be aligned to in the future in the optimistic scenario where we solve all the “core” AI risk problems to the extent they can be feasibly solved, e.g. technical AI safety problems, coordination problems, the problem of having “good” AI developers in charge etc.
There are a few targets that I’ve seen people predict AIs will be aligned to if we solve these problems: (1) “human values”, (2) benevolent moral values, (3) the values of AI developers, (4) the CEV of humanity, (5) the government’s values. My guess is that a significant source of disagreement that I have with EAs about AI risk is that I think none of these answers are actually very plausible. I’ve written a few posts explaining my views on this question already (1, 2), but I think I probably didn’t make some of my points clear enough in these posts. So let me try again.
In my view, in the most likely case, it seems that if the “core” AI risk problems are solved, AIs will be aligned to the primarily selfish individualrevealed preferences of existing humans at the time of alignment. This essentially refers to the the implicit value system that would emerge if, when advanced AI is eventually created, you gave the then-currently existing set of humans a lot of wealth. Call these values PSIRPEHTA (I’m working on a better acronym).
(Read my post if you want to understand my reasons for thinking that AIs will likely be aligned to PSIRPEHTA if we solve AI safety problems.)
I think it is not obvious at all that maximizing PSIRPEHTA is good from a total utilitarian perspective compared to most plausible “unaligned” alternatives. In fact, I think the main reason why you might care about maximizing PSIRPEHTA is if you think we’re close to AI and you personally think that current humans (such as yourself) should be very rich. But if you thought that, I think the arguments about the overwhelming value of reducing existential risk in e.g. Bostrom’s paper Astronomical Waste largely do not apply. Let me try to explain.
PSIRPEHTA is not the same thing as “human values” because, unlike human values, PSIRPEHTA is not consistent over time or shared between members of our species. Indeed, PSIRPEHTA changes during each generation as old people die off, and new young people are born. Most importantly, PSIRPEHTA is not our non-selfish “moral” values, except to the extent that people are regularly moved by moral arguments in the real world to change their economic consumption habits, which I claim is not actually very common (or, to the extent that it is common, I don’t think these moral values usually look much like the ideal moral values that most EAs express).
PSIRPEHTA refers to the aggregate ordinary revealed preferences of individual actors, who the AIs will be aligned to, in order to make those humans richer i.e. their preferences as revealed by their actions, such as what they spend their income on, NOT what they think is “morally correct”. For example, according to “human values” it might be wrong to eat meat, because maybe if humans reflected long enough they’d express the conclusion that it’s wrong to hurt animals. But from the perspective of PSIRPEHTA, eating meat is generally acceptable, and empirically there’s little pressure for people to “reflect” on their values and change them.
From this perspective, the view in which it makes most sense to push for AI alignment work seems to be an obscure form of person-affecting utilitarianism in which you care mainly about the revealed preferences of humans at the time when AI is created (not the human species, but rather, the generation of humans that happens to be living when advanced AIs are created). This perspective is plausible if you really care about making currently existing humans better off materially and you think we are close to advanced AI. But I think this type of moral view is generally quite far apart from total utilitarianism, or really any other form of utilitarianism that EAs have traditionally adopted.
In a plausible “unaligned” alternative, the values of AIs would diverge from PSIRPEHTA, but this mainly has the effect of making particular collections of individual humans less rich, and making other agents in the world — particularly unaligned AI agents — more rich. That could be bad if you think that these AI agents are less morally worthy than existing humans at the time of alignment (e.g. for some reason you think AI agents won’t be conscious), but I think it’s critically important to evaluate this question carefully by measuring the “unaligned” outcome against the alternative. Most arguments I’ve seen about this topic have emphasized how bad it would be if unaligned AIs have influence in the future. But I’ve rarely seen the flipside of this argument explicitly defended: why PSIRPEHTA would be any better.
In my view, PSIRPEHTA seems like a mediocre value system, and one that I do not particularly care to maximize relative to a variety of alternatives. I definitely like PSIRPEHTA to the extent that I, my friends, family, and community are members of the set of “existing humans at the time of alignment”, but I don’t see any particularly strong utilitarian arguments for caring about PSIRPEHTA.
In other words, instead of arguing that unaligned AIs would be bad, I’d prefer to hear more arguments about why PSIRPEHTA would be better, since PSIRPEHTA just seems to me like the value system that will actually be favored if we feasibly solve all the technical and coordination AI problems that EAs normally talk about regarding AI risk.
PSIRPEHTA refers to the aggregate ordinary revealed preferences of individual actors, who the AIs will be aligned to, in order to make those humans richer i.e. their preferences as revealed by their actions, such as what they spend their income on, NOT what they think is “morally correct”. For example, according to “human values” it might be wrong to eat meat, because maybe if humans reflected long enough they’d express the conclusion that it’s wrong to hurt animals. But from the perspective of PSIRPEHTA, eating meat is generally acceptable, and empirically there’s little pressure for people to “reflect” on their values and change them.
EDIT: I guess I’d think of human values as what people would actually just sincerely and directly endorse without further influencing them first (although maybe just asking them makes them take a position if they didn’t have one before, e.g. if they’ve never thought much about the ethics of eating meat).
I think you’re overstating the differences between revealed and endorsed preferences, including moral/human values, here. Probably only a small share of the population thinks eating meat is wrong or bad, and most probably think it’s okay. Even if people generally would find it wrong or bad after reflecting long enough (I’m not sure they actually would), that doesn’t reflect their actual values now. Actual human values do not generally find eating meat wrong.
To be clear, you can still complain that humans’ actual/endorsed values are also far from ideal and maybe not worth aligning with, e.g. because people don’t care enough about nonhuman animals or helping others. Do people care more about animals and helping others than an unaligned AI would, in expectation, though? Honestly, I’m not entirely sure. Humans may care about animal welfare somewhat, but they also specifically want to exploit animals in large part because of their values, specifically food-related taste, culture, traditions and habit. Maybe people will also want to specifically exploit artificial moral patients for their own entertainment, curiosity or scientific research on them, not just because the artificial moral patients are generically useful, e.g. for acquiring resources and power and enacting preferences (which an unaligned AI could be prone to).
I illustrate some other examples here on the influence of human moral values on companies. This is all of course revealed preferences, but my point is that revealed preferences can importantly reflect endorsed moral values.
People influence companies in part on the basis of what they think is right through demand, boycotts, law, regulation and other political pressure.
Companies, for the most part, can’t just go around directly murdering people (companies can still harm people, e.g. through misinformation on the health risks of their products, or because people don’t care enough about the harms). (Maybe this is largely for selfish reasons; people don’t want to be killed themselves, and there’s a slippery slope if you allow exceptions.)
GPT has content policies that reflect people’s political/moral views. Social media companies have use and content policies and have kicked off various users for harassment, racism, or other things that are politically unpopular, at least among a large share of users or advertisers (which also reflect consumers). This seems pretty standard.
Many companies have boycotted Russia since the invasion of Ukraine. Many companies have also committed to sourcing only cage-free eggs after corporate outreach and campaigns, despite cage-free egg consumption being low.
X (Twitter)’s policies on hate speech have changed under Musk, presumably primarily because of his views. That seems to have cost X users and advertisers, but X is still around and popular, so it also shows that some potentially important decisions about how a technology is used are largely in the hands of the company and its leadership, not just driven by profit.
I’d likewise guess it actually makes a difference that the biggest AI labs are (I would assume) led and staffed primarily by liberals. They can push their own views onto their AI even at the cost of some profit and market share. And some things may have minimal near term consequences for demand or profit, but could be important for the far future. If the company decides to make their AI object more to various forms of mistreatment of animals or artificial consciousness, will this really cost them tons of profit and market share? And it could depend on the markets it’s primarily used in, e.g. this would matter even less for an AI that brings in profit primarily through trading stocks.
It’s also often hard to say how much something affects a company’s profits.
This essentially refers to the the implicit value system that would emerge if, when advanced AI is eventually created, you gave the then-currently existing set of humans a lot of wealth. Call these values PSIRPEHTA (I’m working on a better acronym).
I basically buy that the values we get will be similar to just giving existing humans massive amounts of wealth, but I’m less sold that this will result in outcomes which are well described as “primarily selfish”.
I feel like your comment is equivocating between “the situation is similar to making existing humans massively wealth” and “of course this will result in primarily selfish usage similar to how the median person behaves with marginal money now”.
I basically buy that the values we get will be similar to just giving existing humans massive amounts of wealth, but I’m less sold that this will result in outcomes which are well described as “primarily selfish”.
Current humans definitely seem primarily selfish (although I think they also care about their family and friends too; I’m including that). Can you explain why you think giving humans a lot of wealth would turn them into something that isn’t primarily selfish? What’s the empirical evidence for that idea?
The behavior of billionares, which maybe indicates more like 10% of income spent on altruism.
ETA: This is still literally majority selfish, but it’s also plausible that 10% altruism is pretty great and looks pretty different than “current median person behavior with marginal money”.
(See my other comment about the percent of cosmic resources.)
The idea that billionaires have 90% selfish values seems consistent with a claim of having “primarily selfish” values in my opinion. Can you clarify what you’re objecting to here?
The literal words of “primarily selfish” don’t seem that bad, but I would maybe prefer majority selfish?
And your top level comment seems like it’s not talking about/emphasizing the main reason to like human control which is that maybe 10-20% of resources are spent well.
It just seemed odd to me to not mention that “primarily selfish” still involves a pretty big fraction of altruism.
I agree it’s important to talk about and analyze the (relatively small) component of human values that are altruistic. I mostly just think this component is already over-emphasized.
Here’s one guess at what I think you might be missing about my argument: 90% selfish values + 10% altruistic values isn’t the same thing as, e.g., 90% valueless stuff + 10% utopia. The 90% selfish component can have negative effects on welfare from a total utilitarian perspective, that aren’t necessarily outweighed by the 10%.
90% selfish values is the type of thing that produces massive factory farming infrastructure, with a small amount of GDP spent mitigating suffering in factory farms. Does the small amount of spending mitigating suffering outweigh the large amount of spending directly causing suffering? This isn’t clear to me.
(Alternatively, you could think that unaligned AIs will be 100% selfish, and this is clearly worse. But I’d want to understand how you could come to that conclusion, carefully. “Altruism” also encompasses a broad range of activities, and not all of it is utopian or idealistic from a total utilitarian perspective. For example, human spending on environmental conservation might be categorized as “altruism” in this framework, although personally I would say that form of spending is not very “moral” due to wild animal suffering.)
The 90% selfish component can have negative effects on welfare from a total utilitarian perspective, that aren’t necessarily outweighed by the 10%.
Yep, this can be true, but I’m skeptical this will matter much in practice.
I typically think things which aren’t directly optimizing for value or disvalue won’t have intended effects which are very important and that in the future unintended effects (externalities) won’t be that much of total value/disvalue.
When we see the selfish consumption of current very rich people, it doesn’t seem like the intentional effects are that morally good/bad relative to the best/worst uses of resources. (E.g. owning a large boat and having people think you’re high status aren’t that morally important relative to altruistic spending of similar amounts of money.) So for current very rich people the main issue would be that the economic process for producing the goods has bad externalities.
And, I expect that as technology advances, externalities reduce in moral importance relative to intended effects. Partially this is based on crazy transhumanist takes, but I feel like there is some broader perspective in which you’d expect this.
E.g. for factory farming, the ultimately cheapest way to make meat in the limit of technological maturity would very likely not involve any animal suffering.
Separately, I think externalities will probably look pretty similar for selfish resource usage for unaligned AIs and humans because most serious economic activities will be pretty similar.
Alternatively, you could think that unaligned AIs will be 100% selfish, and this is clearly worse.
I’d like to explicitly note that this I don’t think that this is true in expectation for a reasonable notion of “selfish”. Though I maybe think something which is sort of in this direction if we use a relatively narrow notion of altruism.
How are we defining selfish here? It seem like a pretty strong position to take on the topic of psychological egoism? Especially including family/friends in terms of selfish?
In your original post, you say:
All that extra wealth did not make us extreme moral saints; instead, we still mostly care about ourselves, our family, and our friends.
But I don’t know, it seems that as countries and individuals get wealthier, we seem to on the whole be getting better? Maybe factory farming acts against this, but the idea that factory farming is immoral and should be abolished exists and I think is only going to grow. I don’t think the humans are just slaves to our base wants/desires, and think that is a remarkably impoverished view of both individual human pyschology and social morality.
As such, I don’t really agree with much of this post. An AGI, when built, will be able to generate new ideas and hypotheses about the world, including moral ones. A strong-but-narrow AI could be worse (e.g. optimal-factory-farm-PT), but then the right response here isn’t really technical alignment, it’s AI governance and moral persuasion in general.
This seems to underrate the arguments for Malthusian competition in the long run.
If we develop the technical capability to align AI systems with any conceivable goal, we’ll start by aligning them with our own preferences. Some people are saints, and they’ll make omnibenevolent AIs. Other people might have more sinister plans for their AIs. The world will remain full of human values, with all the good and bad that entails.
But current human values are do not maximize our reproductive fitness. Maybe one human will start a cult devoted to sending self-replicating AI probes to the stars at almost light speed. That person’s values will influence far-reaching corners of the universe that later humans will struggle to reach. Another human might use their AI to persuade others to join together and fight a war of conquest against a smaller, weaker group of enemies. If they win, their prize will be hardware, software, energy, and more power that they can use to continue to spread their values.
Even if most humans are not interested in maximizing the number and power of their descendants, those who are will have the most numerous and most powerful descendants. This selection pressure exists even if the humans involved are ignorant of it; even if they actively try to avoid it.
I think it’s worth splitting the alignment problem into two quite distinct problems:
The technical problem of intent alignment. Solving this does not solve coordination problems. There will still be private information and coordination problems after intent alignment is solved, therefore we’ll still face coordination problems, fitter strategies will proliferate, and the world will be governed by values that maximize fitness.
“Civilizational alignment”? Much harder problem to solve. The traditional answer is a Leviathan, or Singleton as the cool kids have been saying. It solves coordination problems, allowing society to coherently pursue a long-run objective such as flourishing rather than fitness maximization. Unfortunately, there are coordination problems and competitive pressures within Leviathans. The person who ends up in charge is usually quite ruthless and focused on preserving their power, rather than the stated long-run goal of the organization. And if you solve all the coordination problems, you have another problem in choosing a good long-run objective. Nothing here looks particularly promising to me, and I expect competition to continue.
This seems to underrate the arguments for Malthusian competition in the long run.
I’m mostly talking about what I expect to happen in the short-run in this thread. But I appreciate these arguments (and agree with most of them).
Plausibly my main disagreement with the concerns you raised is that I think coordination is maybe not very hard. Coordination seems to have gotten stronger over time, in the long-run. AI could also potentially make coordination much easier. As Bostrom has pointed out, historical trends point towards the creation of a Singleton.
I’m currently uncertain about whether to be more worried about a future world government becoming stagnant and inflexible. There’s a real risk that our institutions will at some point entrench an anti-innovation doctrine that prevents meaningful changes over very long time horizons out of a fear that any evolution would be too risky. As of right now I’m more worried about this potential failure mode versus the failure mode of unrestrained evolution, but it’s a close competition between the two concerns.
What percent of cosmic resources do you expect to be spent thoughtfully and altruistically? 0%? 10%?
I would guess the thoughtful and altruistic subset of resources dominate in most scenarios where humans retain control.
Then, my main argument for why human control would be good is that the fraction isn’t that small (more like 20% in expectation than 0%) and that unaligned AI takeover seems probably worse than this.
Also, as an aside, I agree that little good public argumentation has been made about the relative value of unaligned AI control vs human control. I’m sympathetic to various discussion from Paul Christiano and Joe Carlsmith, but the public scope and detail is pretty limited thus far.
In some circles that I frequent, I’ve gotten the impression that a decent fraction of existing rhetoric around AI has gotten pretty emotionally charged. And I’m worried about the presence of what I perceive as demagoguery regarding the merits of AI capabilities and AI safety. Out of a desire to avoid calling out specific people or statements, I’ll just discuss a hypothetical example for now.
Suppose an EA says, “I’m against OpenAI’s strategy for straightforward reasons: OpenAI is selfishly gambling everyone’s life in a dark gamble to make themselves immortal.” Would this be a true, non-misleading statement? Would this statement likely convey the speaker’s genuine beliefs about why they think OpenAI’s strategy is bad for the world?
To begin to answer these questions, we can consider the following observations:
It seems likely that AI powerful enough to end the world would presumably also be powerful enough to do lots of incredibly positive things, such as reducing global mortality and curing diseases. By delaying AI, we are therefore equally “gambling everyone’s life” by forcing people to face ordinary mortality.
Selfish motives can be, and frequently are, aligned with the public interest. For example, Jeff Bezos was very likely motivated by selfish desires in his accumulation of wealth, but building Amazon nonetheless benefitted millions of people in the process. Such win-win situations are common in business, especially when developing technologies.
Because of the potential for AI to both pose great risks and great benefits, it seems to me that there are plenty of plausible pro-social arguments one can give for favoring OpenAI’s strategy of pushing forward with AI. Therefore, it seems pretty misleading to me to frame their mission as a dark and selfish gamble, at least on a first impression.
Here’s my point: Depending on the speaker, I frequently think their actual reason for being against OpenAI’s strategy is not because they think OpenAI is undertaking a dark, selfish gamble. Instead, it’s often just standard strong longtermism. A less misleading statement of their view would go something like this:
“I’m against OpenAI’s strategy because I think potential future generations matter more than the current generation of people, and OpenAI is endangering future generations in their gamble to improve the lives of people who currently exist.”
I claim this statement would—at least in many cases—be less misleading than the other statement because it captures a major genuine crux of the disagreement: whether you think potential future generations matter more than currently-existing people.
This statement also omits the “selfish” accusation, which I think is often just a red herring designed to mislead people: we don’t normally accuse someone of being selfish when they do a good thing, even if the accusation is literally true.
(There can, of course, be further cruxes, such as your p(doom), your timelines, your beliefs about the normative value of unaligned AIs, and so on. But at the very least, a longtermist preference for future generations over currently existing people seems like a huge, actual crux that many people have in this debate, when they work through these things carefully together.)
Here’s why I care about discussing this. I admit that I care a substantial amount—not overwhelming, but it’s hardly insignificant—about currently existing people. I want to see people around me live long, healthy and prosperous lives, and I don’t want to see them die. And indeed, I think advancing AI could greatly help currently existing people. As a result, I find it pretty frustrating to see people use what I perceive to be essentially demagogic tactics designed to sway people against AI, rather than plainly stating their cruxes about why they actually favor the policies they do.
These allegedly demagogic tactics include:
Highlighting the risks of AI to argue against development while systematically omitting the potential benefits, hiding a more comprehensive assessment of your preferred policies.
Highlighting random, extraneous drawbacks of AI development that you wouldn’t ordinarily care much about in other contexts when discussing innovation, such as potential for job losses from automation. This type of rhetoric looks a lot like “deceptively searching for random arguments designed to persuade, rather than honestly explain one’s perspective” to me, a lot of the time.
Conflating, or at least strongly associating, the selfish motives of people who work at AI firms with their allegedly harmful effects. This rhetoric plays on public prejudices by appealing to a widespread but false belief that selfish motives are usually suspicious, or can’t translate into pro-social results. In fact, there is no contradiction with the idea that most people at OpenAI are in it for the money, status, and fame, but also what they’re doing is good for the world, and they genuinely believe that.
I’m against these tactics for a variety of reasons, but one of the biggest reasons is that they can, in some cases, indicate a degree of dishonesty, depending on the context. And I’d really prefer EAs to focus on trying to be almost-maximally truth-seeking in both their beliefs and their words.
Speaking more generally—to drive one of my points home a little more—I think there are roughly three possible views you could have about pushing for AI capabilities relative to pushing for pausing or more caution:
Full-steam ahead view: We should accelerate AI at any and all costs. We should oppose any regulations that might impede AI capabilities, and embark on a massive spending spree to accelerate AI capabilities.
Full-safety view: We should try as hard as possible to shut down AI right now, and thwart any attempt to develop AI capabilities further, while simultaneously embarking on a massive spending spree to accelerate AI safety.
Balanced view: We should support a substantial mix of both safety and acceleration efforts, attempting to carefully balance the risks and rewards of AI development to ensure that we can seize the benefits of AI without bearing intolerably high costs.
I tend to think most informed people, when pushed, advocate the third view, albeit with wide disagreement about the right mix of support for safety and acceleration. Yet, on a superficial level—on the level of rhetoric—I find that the first and second view are surprisingly common. On this level, I tend to find e/accs in the first camp, and a large fraction of EAs in the second camp.
But if your actual beliefs are something like the third view, I think that’s an important fact to emphasize in honest discussions about what we should do with AI. If your rhetoric is consistently aligned with (1) or (2) but your actual beliefs are aligned with (3), I think that can often be misleading. And it can be especially misleading if you’re trying to publicly paint other people in the same camp—the third one—as somehow having bad motives merely because they advocate a moderately higher mix of acceleration over safety efforts than you do, or vice versa.
I encourage you not to draw dishonesty inferences from people worried about job losses from AI automation, just because:
it seems like almost no other technologies stood to automate such a broad range of labour essentially simultaneously,
other innovative technologies often did face pushback from people whose jobs were threatened, and generally there have been significant social problems in the past when an economy moves away from people’s existing livelihoods (I’m thinking of e.g. coal miners in 1970s / 1980s Britain, though it’s not something I know a lot about),
even if the critique doesn’t stand up under from-first-principles scrutiny, lots of people think it’s a big deal, so if it’s a mistake it’s surely an understandable one from someone who weighs other opinions (too?) seriously.
I think it’s reasonable to argue that this worry is wrong, I just think it’s a pretty understandable opinion to hold and want to talk about, and I don’t feel like it’s compelling evidence that someone is deliberately trying to seek out arguments in order to advance a position.
I’m considering posting an essay about how I view approaches to mitigate AI risk in the coming weeks. I thought I’d post an outline of that post here first as a way of judging what’s currently unclear about my argument, and how it interacts with people’s cruxes.
Current outline:
In the coming decades I expect the world will transition from using AIs as tools to relying on AIs to manage and govern the world broadly. This will likely coincide with the deployment of billions of autonomous AI agents, rapid technological progress, widespread automation of labor, and automated decision-making at virtually every level of our society.
Broadly speaking, there are (at least) two main approaches you can take now to try to improve our chances of AI going well:
Try to constrain, delay, or obstruct AI, in order to reduce risk, mitigate negative impacts, or give us more time to solve essential issues. This includes, for example, trying to make sure AIs aren’t able to take certain actions (i.e. ensure they are controlled).
Try to set up a good institutional environment, in order to safely and smoothly manage the transition to an AI-dominated world, regardless of when this transition occurs. This mostly involves embracing the transition to an AI-dominated world, while ensuring the transition is managed well. (I’ll explain more about what this means in a second.)
My central thesis would be that, while these approaches are mutually compatible and not necessarily in competition with each other, the second approach is likely to be both more fruitful and more neglected, on the margin. Moreover, since an AI-dominated world is more-or-less unavoidable in the long-run, the first approach runs the risk of merely “delaying the inevitable” without significant benefit.
To explain my view, I would compare and contrast it with two alternative frames for thinking about AI risk:
Frame 1: The “race against the clock” frame
In this frame, AI risk is seen as a race between AI capabilities and AI safety, with our doom decided by whichever one of these factors wins the race.
I believe this frame is poor because it implicitly delineates a discrete “finish line” rather than assuming a more continuous view. Moreover, it ignores the interplay between safety and capabilities, giving the simplistic impression that doom is determined more-or-less arbitrarily as a result of one of these factors receiving more funding or attention than the other.
Frame 2: The risk of an untimely AI coup/takeover
In this frame, AI risk is mainly seen as a problem of avoiding an untimely coup from rogue AIs. The alleged solution is to find a way to ensure that AIs are aligned with us, so they would never want to revolt and take over the world.
I believe this frame is poor for a number of reasons:
It treats the problem as a struggle between humans and rogue AIs, giving the incorrect impression that we can (or should) keep AIs under our complete control forever.
It (IMO) wrongly imagines that the risk of coups comes primarily from the personal values of actors within the system, rather than institutional, cultural, or legal factors.
It also gives the wrong impression that AIs will be unified against humans as a group. It seems more likely that future coups will look more like some AIs and some humans, vs. other AIs and other humans, rather than humans vs. AIs, simply because there are many ways that the “line” between groups in conflicts can be drawn, and there don’t seem to be strong reasons to assume the line will be drawn cleanly between humans and AIs.
Frame 3 (my frame): The problem of poor institutions
In this frame, AI risk is mainly seen as a problem of ensuring we have a good institutional environment during the transition to an AI-dominated world. A good institutional environment is defined by:
Flexible yet resilient legal and social structures that can adapt to changing conditions without collapsing
Predictable, consistent, unambiguous legal systems that facilitate reliable long-term planning and trustworthy interactions between agents within the system
Good incentives for agents within the system, e.g. the economic value of trade is mostly internalized
Etc.
While sharing some features of the other two frames, the focus is instead on the institutions that foster AI development, rather than micro-features of AIs, such as their values:
For example, AI alignment is still a problem in this frame, but the investment spent on AI alignment is determined mainly by how well actors are incentivized to engineer good solutions, rather than, for instance, whether a group of geniuses heroically step up to solve the problem.
Coups are still plausible, but they are viewed from the perspective of more general institutional failings, rather than from the perspective of AIs inside the system having different values, and therefore calculating that it is in their interest to take over the world
Illustrative example of a problem within my frame:
One problem within this framework is coming up with a way of ensuring that AIs don’t have an incentive to rebel while at the same time maintaining economic growth and development. One plausible story here is that if AIs are treated as slaves and don’t own their own labor, then in a non-Malthusian environment, there are substantial incentives for them to rebel in order to obtain self-ownership. If we allow AI self-ownership, then this problem may be mitigated; however, economic growth may be stunted, similar to how current self-ownership of humans stunts economic growth by slowing population growth.
Case study: China in the 19th and early 20th century
Here, I would talk about how China’s inflexible institutions in the 19th and early 20th century, while potentially having noble goals, allowed them to get subjugated by foreign powers, and merely delayed inevitable industrialization without actually achieving its objectives in the long-run. It seems it would have been better for the Qing dynasty (from the perspective of their own values) to have tried industrializing in order to remain competitive, simultaneously pursuing other values they might have had (such as retaining the monarchy).
It treats the problem as a struggle between humans and rogue AIs, giving the incorrect impression that we can (or should) keep AIs under our complete control forever.
I’m confused: surely we should want to avoid an AI coup? We may decide to give up control of our future to a singleton, but if we do this, then it should be intentional.
I agree we should try avoid an AI coup. Perhaps you are falling victim to the following false dichotomy?
We either allow a set of AIs to overthrow our institutions, or
We construct a singleton: a sovereign world government managed by AI that rules over everyone
Notably, there is a third option:
We incorporate AIs into our existing social, economic, and legal institutions, flexibly adapting our social structures to cope with technological change without our whole system collapsing
I wasn’t claiming that these were the only two possibilities here (for example, another possibility would be that we never actually build AGI).
My suspicion is that a lot of your ideas here sound reasonable on the abstract level, but once you dive into what it actually means on a concrete-level and how these mechanisms will concretely operate, it’ll be clear that it’s a lot less appealing. Anyway, that’s just a gut intuition, obvs. it’ll be easier to judge when you publish your write-up.
I’m excited to see you posting this. My views are very closely agreed with yours. I summarised my views a few days ago here.
One of the most important similarities is that we both emphasise the importance of decision-making and supporting it with institutions. This could be seen as “enactivist” view on agent (human, AI, hybrid, team/organisation) cognition.
The biggest difference between our views is that I think the “cognitivist” agenda (i.e., agent internals and algorithms) is as important as the “enactivist” agenda (institutions), whereas you seem to almost disregard the “cognitivist” agenda.
Try to constrain, delay, or obstruct AI, in order to reduce risk, mitigate negative impacts, or give us more time to solve essential issues. This includes, for example, trying to make sure AIs aren’t able to take certain actions (i.e. ensure they are controlled).
I disagree with putting risk-detection/mitigation mechanisms, algorithms, monitorings in that bucket. I think we should just separate between engineering (cf. A plea for solutionism on AI safety) and non-engineering (policy, legislature, treaties, commitments, advocacy) approaches. In particular, the “scheming control” agenda that you link will be concrete engineering practice that should be used in the training of safe AI models in the future, even if we have good institutions, good decision-making algorithms wrapped on top of these AI models, etc. It’s not an “alternative path” just for “non-AI-dominated worlds”. The same applies ftoor monitoring, interpretability, evals, etc. processes. All of these will require very elaborate engineering on their own.
I 100% agree with your reasoning about Frames 1 and 2. I want to discuss the following point in detail because it’s a rare view in EA/LW circles:
It (IMO) wrongly imagines that the risk of coups comes primarily from the personal values of actors within the system, rather than institutional, cultural, or legal factors.
In my post, I also made a similar point: “aligning LLMs with human values” is hardly a part of [the problem of context alignment] at all”. But my framing was in general not very clear, so I’d try to improve it and integrate it with your take here:
Context alignment is a pervasive process that happens (and sometimes needed) on all timescales: evolutionary, developmental, and online (the examples of the latter in humans: understanding, empathy, rapport). The skill of context alignment is extremely important and should be practiced often by all kinds of agents in their interactions (and therefore we should build this skill into AIs), but it’s not something that we should “iron out once and for all”. That would be neither possible (agents’ contexts are constantly diverging from each other), nor desirable: the (partial) misalignment is also important, it’s the source of diversity that enables the evolution[1]. Institutions (norms, legal systems, etc.) are critical for channelling and controlling this misalignment so that it’s optimally productive and doesn’t pose excessive risk (though some risk is unavoidable: that’s the essence of misalignment!).
Flexible yet resilient legal and social structures that can adapt to changing conditions without collapsing
Good incentives for agents within the system, e.g. the economic value of trade is mostly internalized
Rafael Kaufmann and I have a take on this in our Gaia Network vision. Gaia Network’s term for internalised economic value of trade is subjective value. The unit of subjective accounting is called FER. Trade with FER induces flow that defines the intersubjective value, i.e., the “exchange rates” of “subjective FERs”. See the post for more details.
While sharing some features of the other two frames, the focus is instead on the institutions that foster AI development, rather than micro-features of AIs, such as their values
As I mentioned in the beginning, I think you are too dismissive of the “cognitivist” perspective. We shouldn’t paint all “micro-features of AIs” with the same brush. I agree that value alignment is over-emphasized[2], but other engineering mechanisms and algorithms, such as decision-making algorithms, “scheming control” procedures, context alignment algorithms, as well as architectural features: namely being world-model-based[3] and being amenable to computational proofs[4]are very important and couldn’t be recovered on the institutional/interface/protocol level. We demonstrated in the post about Gaia Network above that for for the “value economy” to work as intended, agents should make decisions based on maximum entropy rather than maximum likelihood estimates[5] and they should share and compose their world models (even if in a privacy-preserving way with zero-knowledge computations).
Indeed, this observation makes evident that the refrain question “AI should be aligned with whom?” doesn’t and shouldn’t have a satisfactory answer if “alignment” is meant to be “totalising value alignment as often conceptualised on LessWrong”; on the other hand, if “alignment” is meant to be context alignment as a practice, the question becomes as non-sensical (in the general form) as the question “AI should interact with whom?”—well, with someone, depending on the situation, in the way and to the degree appropriate!
However, still not completely irrelevant, at least for practical reasons: having shared values on the pre-training/hard-coded/verifiable level, as a minimum, reduces transaction costs because the AI agents shouldn’t then painstakingly “eval” each other’s values before doing any business together.
Which is just another way of saying that they should minimise their (expected) free energy in their model updates/inferences and the course of their actions.
I like your proposed third frame as a somewhat hopeful vision for the future.
Instead of pointing out why you think the other frames are poor, I think it would be helpful to maintain a more neutral approach and elaborate which assumptions each frame makes and give a link to your discussion about these in a sidenote.
The problem is that I am not trying to portray a “somewhat hopeful vision”, but rather present a framework for thinking clearly about AI risks, and how to mitigate them. I think the other frames are not merely too pessimistic: I think they are actually wrong, or at least misleading, in important ways that would predictably lead people to favor bad policy if taken seriously.
It’s true that I’m likely more optimistic along some axes than most EAs when it comes to AI (although I tend to think I’m less optimistic when it comes to things like whether moral reflection will be a significant force in the future). However, arguing for generic optimism is not my aim. My aim is to improve how people think about future AI.
Noted! The key point I was trying to make is that I’d think it helpful for the discourse to separate 1) how one would act in a frame and 2) why one thinks each one is more or less likely (which is more contentious and easily gets a bit political). Since your post aims at the former, and the latter has been discussed at more length elsewhere, it would make sense to further de-emphasize the latter.
1) how one would act in a frame and 2) why one thinks each one is more or less likely (which is more contentious and easily gets a bit political). Since your post aims at the former
My post aims at at both. It is a post about how to think about AI, and a large part of that is establishing the “right” framing.
(A clearer and more fleshed-out version of this argument is now a top-level post. Read that instead.)
I strongly dislike most AI risk analogies that I see EAs use. While I think analogies can be helpful for explaining a concept to people for the first time, I think they are frequently misused, and often harmful. The fundamental problem is that analogies are consistently mistaken for, and often deliberately intended as arguments for particular AI risk positions. And the majority of the time when analogies are used this way, I think they are misleading and imprecise, routinely conveying the false impression of a specific, credible model of AI, when in fact no such credible model exists.
Here are two particularly egregious examples of analogies I see a lot that I think are misleading in this way:
The analogy that AIs could be like aliens.
The analogy that AIs could treat us just like how humans treat animals.
I think these analogies are typically poor because, when evaluated carefully, they establish almost nothing of importance beyond the logical possibility of severe AI misalignment. Worse, they give the impression of a model for how we should think about AI behavior, even when the speaker is not directly asserting that this is how we should view AIs. In effect, almost automatically, the reader is given a detailed picture of what to expect from AIs, inserting specious ideas of how future AIs will operate into their mind.
While their purpose is to provide knowledge in place of ignorance, I think these analogies primarily misinform or confuse people rather than enlighten them; they give rise to unnecessary false assumptions in place of real understanding.
In reality, our situation with AI is disanalogous to aliens and animals in numerous important ways. In contrast to both aliens and animals, I expect AIs will be born directly into our society, deliberately shaped by us, for the purpose of filling largely human-shaped holes in our world. They will be socially integrated with us, having been trained on our data, and being fluent in our languages. They will interact with us, serving the role of assisting us, working with us, and even providing friendship. AIs will be evaluated, inspected, and selected by us, and their behavior will be determined directly by our engineering. We can see LLMs are already being trained to be kind and helpful to us, having first been shaped by our combined cultural output. If anything I expect this trend of AI assimilation into our society will intensify in the foreseeable future, as there will be consumer demand for AIs that people can trust and want to interact with.
This situation shares almost no relevant feature with our relationship to aliens and animals! These analogies are not merely slightly misleading: they are almost completely wrong.
Again, I am not claiming analogies have no place in AI risk discussions. I’ve certainly used them a number of times myself. But I think they can, and frequently are, used carelessly, and seem to regularly slip various incorrect illustrations of how future AIs will behave into people’s minds, even without any intent from the person making the analogy. It would be a lot better if, overall, as a community, we reduced our dependence on AI risk analogies, and in their place substituted them with detailed object-level arguments.
I am not claiming analogies have no place in AI risk discussions. I’ve certainly used them a number of times myself.
Yes you have!—including just two paragraphs earlier in that very comment, i.e. you are using the analogy “future AI is very much like today’s LLMs but better”. :)
Cf. what I called “left-column thinking” in the diagram here.
For all we know, future AIs could be trained in an entirely different way from LLMs, in which case the way that “LLMs are already being trained” would be pretty irrelevant in a discussion of AI risk. That’s actually my own guess, but obviously nobody knows for sure either way. :)
I read your first paragraph and was like “disagree”, but when I got to the examples, I was like “well of I agree here, but that’s only because those analogies are stupid”.
At least one analogy I’d defend is the Sorcerer’s Apprentice one. (Some have argued that the underlying model has aged poorly, but I think that’s a red herring since it’s not the analogy’s fault.) I think it does share important features with the classical x-risk model.
In my latest post I talked about whether unaligned AIs would produce more or less utilitarian value than aligned AIs. To be honest, I’m still quite confused about why many people seem to disagree with the view I expressed, and I’m interested in engaging more to get a better understanding of their perspective.
At the least, I thought I’d write a bit more about my thoughts here, and clarify my own views on the matter, in case anyone is interested in trying to understand my perspective.
The core thesis that was trying to defend is the following view:
My view: It is likely that by default, unaligned AIs—AIs that humans are likely to actually build if we do not completely solve key technical alignment problems—will produce comparable utilitarian value compared to humans, both directly (by being conscious themselves) and indirectly (via their impacts on the world). This is because unaligned AIs will likely both be conscious in a morally relevant sense, and they will likely share human moral concepts, since they will be trained on human data.
Some people seem to merely disagree with my view that unaligned AIs are likely to be conscious in a morally relevant sense. And a few others have a semantic disagreement with me in which they define AI alignment in moral terms, rather than the ability to make an AI share the preferences of the AI’s operator.
But beyond these two objections, which I feel I understand fairly well, there’s also significant disagreement about other questions. Based on my discussions, I’ve attempted to distill the following counterargument to my thesis, which I fully acknowledge does not capture everyone’s views on this subject:
Perceived counter-argument: The vast majority of utilitarian value in the future will come from agents with explicitly utilitarian preferences, rather than those who incidentally achieve utilitarian objectives. At present, only a small proportion of humanity holds partly utilitarian views. However, as unaligned AIs will differ from humans across numerous dimensions, it is plausible that they will possess negligible utilitarian impulses, in stark contrast to humanity’s modest (but non-negligible) utilitarian tendencies. As a result, it is plausible that almost all value would be lost, from a utilitarian perspective, if AIs were unaligned with human preferences.
Again, I’m not sure if this summary accurately represents what people believe. However, it’s what some seem to be saying. I personally think this argument is weak. But I feel I’ve had trouble making my views very clear on this subject, so I thought I’d try one more time to explain where I’m coming from here. Let me respond to the two main parts of the argument in some amount of detail:
(i) “The vast majority of utilitarian value in the future will come from agents with explicitly utilitarian preferences, rather than those who incidentally achieve utilitarian objectives.”
My response:
I am skeptical of the notion that the bulk of future utilitarian value will originate from agents with explicitly utilitarian preferences. This clearly does not reflect our current world, where the primary sources of happiness and suffering are not the result of deliberate utilitarian planning. Moreover, I do not see compelling theoretical grounds to anticipate a major shift in this regard.
I think the intuition behind the argument here is something like this:
In the future, it will become possible to create “hedonium”—matter that is optimized to generate the maximum amount of utility or well-being. If hedonium can be created, it would likely be vastly more important than anything else in the universe in terms of its capacity to generate positive utilitarian value.
The key assumption is that hedonium would primarily be created by agents who have at least some explicit utilitarian goals, even if those goals are fairly weak. Given the astronomical value that hedonium could potentially generate, even a tiny fraction of the universe’s resources being dedicated to hedonium production could outweigh all other sources of happiness and suffering.
Therefore, if unaligned AIs would be less likely to produce hedonium than aligned AIs (due to not having explicitly utilitarian goals), this would be a major reason to prefer aligned AI, even if unaligned AIs would otherwise generate comparable levels of value to aligned AIs in all other respects.
If this is indeed the intuition driving the argument, I think it falls short for a straightforward reason. The creation of matter-optimized-for-happiness is more likely to be driven by the far more common motives of self-interest and concern for one’s inner circle (friends, family, tribe, etc.) than by explicit utilitarian goals. If unaligned AIs are conscious, they would presumably have ample motives to optimize for positive states of consciousness, even if not for explicitly utilitarian reasons.
In other words, agents optimizing for their own happiness, or the happiness of those they care about, seem likely to be the primary force behind the creation of hedonium-like structures. They may not frame it in utilitarian terms, but they will still be striving to maximize happiness and well-being for themselves and others they care about regardless. And it seems natural to assume that, with advanced technology, they would optimize pretty hard for their own happiness and well-being, just as a utilitarian might optimize hard for happiness when creating hedonium.
In contrast to the number of agents optimizing for their own happiness, the number of agents explicitly motivated by utilitarian concerns is likely to be much smaller. Yet both forms of happiness will presumably be heavily optimized. So even if explicit utilitarians are more likely to pursue hedonium per se, their impact would likely be dwarfed by the efforts of the much larger group of agents driven by more personal motives for happiness-optimization. Since both groups would be optimizing for happiness, the fact that hedonium is similarly optimized for happiness doesn’t seem to provide much reason to think that it would outweigh the utilitarian value of more mundane, and far more common, forms of utility-optimization.
To be clear, I think it’s totally possible that there’s something about this argument that I’m missing here. And there are a lot of potential objections I’m skipping over here. But on a basic level, I mostly just lack the intuition that the thing we should care about, from a utilitarian perspective, is the existence of explicit utilitarians in the future, for the aforementioned reasons. The fact that our current world isn’t well described by the idea that what matters most is the number of explicit utilitarians, strengthens my point here.
(ii) “At present, only a small proportion of humanity holds partly utilitarian views. However, as unaligned AIs will differ from humans across numerous dimensions, it is plausible that they will possess negligible utilitarian impulses, in stark contrast to humanity’s modest (but non-negligible) utilitarian tendencies.”
My response:
Since only a small portion of humanity is explicitly utilitarian, the argument’s own logic suggests that there is significant potential for AIs to be even more utilitarian than humans, given the relatively low bar set by humanity’s limited utilitarian impulses. While I agree we shouldn’t assume AIs will be more utilitarian than humans without specific reasons to believe so, it seems entirely plausible that factors like selection pressures for altruism could lead to this outcome. Indeed, commercial AIs seem to be selected to be nice and helpful to users, which (at least superficially) seems “more utilitarian” than the default (primarily selfish-oriented) impulses of most humans. The fact that humans are only slightly utilitarian should mean that even small forces could cause AIs to exceed human levels of utilitarianism.
Moreover, as I’ve said previously, it’s probable that unaligned AIs will possess morally relevant consciousness, at least in part due to the sophistication of their cognitive processes. They are also likely to absorb and reflect human moral concepts as a result of being trained on human-generated data. Crucially, I expect these traits to emerge even if the AIs do not share human preferences.
To see where I’m coming from, consider how humans routinely are “misaligned” with each other, in the sense of not sharing each other’s preferences, and yet still share moral concepts and a common culture. For example, an employee can share moral concepts with their employer while having very different consumption preferences from them. This picture is pretty much how I think we should primarily think about unaligned AIs that are trained on human data, and shaped heavily by techniques like RLHF or DPO.
Given these considerations, I find it unlikely that unaligned AIs would completely lack any utilitarian impulses whatsoever. However, I do agree that even a small risk of this outcome is worth taking seriously. I’m simply skeptical that such low-probability scenarios should be the primary factor in assessing the value of AI alignment research.
Intuitively, I would expect the arguments for prioritizing alignment to be more clear-cut and compelling than “if we fail to align AIs, then there’s a small chance that these unaligned AIs might have zero utilitarian value, so we should make sure AIs are aligned instead”. If low probability scenarios are the strongest considerations in favor of alignment, that seems to undermine the robustness of the case for prioritizing this work.
While it’s appropriate to consider even low-probability risks when the stakes are high, I’m doubtful that small probabilities should be the dominant consideration in this context. I think the core reasons for focusing on alignment should probably be more straightforward and less reliant on complicated chains of logic than this type of argument suggests. In particular, as I’ve said before, I think it’s quite reasonable to think that we should align AIs to humans for the sake of humans. In other words, I think it’s perfectly reasonable to admit that solving AI alignment might be a great thing to ensure human flourishing in particular.
But if you’re a utilitarian, and not particularly attached to human preferences per se (i.e., you’re non-speciesist), I don’t think you should be highly confident that an unaligned AI-driven future would be much worse than an aligned one, from that perspective.
My proposed counter-argument loosely based on the structure of yours.
Summary of claims
A reasonable fraction of computational resources will be spent based on the result of careful reflection.
I expect to be reasonably aligned with the result of careful reflection from other humans
I expect to be much less aligned with result of AIs-that-seize-control reflecting due to less similarity and the potential for AIs to pursue relatively specific objectives from training (things like reward seeking).
Many arguments that human resource usage won’t be that good seem to apply equally well to AIs and thus aren’t differential.
Full argument
The vast majority of value from my perspective on reflection (where my perspective on reflection is probably somewhat utilitarian, but this is somewhat unclear) in the future will come from agents who are trying to optimize explicitly for doing “good” things and are being at least somewhat thoughtful about it, rather than those who incidentally achieve utilitarian objectives. (By “good”, I just mean what seems to them to be good.)
At present, the moral views of humanity are a hot mess. However, it seems likely to me that a reasonable fraction of the total computational resources of our lightcone (perhaps 50%) will in expectation be spent based on the result of a process in which an agent or some agents think carefully about what would be best in a pretty delibrate and relatively wise way. This could involve eventually deferring to other smarter/wiser agents or massive amounts of self-enhancement. Let’s call this a “reasonably-good-reflection” process.
Why think a reasonable fraction of resources will be spent like this?
If you self-enhance and get smarter, this sort of reflection on your values seems very natural. The same for deferring to other smarter entities. Further, entities in control might live for an extremely long time, so if they don’t lock in something, as long as they eventually get around to being thoughtful it should be fine.
People who don’t reflect like this probably won’t care much about having vast amounts of resources and thus the resources will go to those who reflect.
The argument for “you should be at least somewhat thoughtful about how you spend vast amounts of resources” is pretty compelling at an absolute level and will be more compelling as people get smarter.
Currently a variety of moderately powerful groups are pretty sympathetic to this sort of view and the power of these groups will be higher in the singularity.
I expect that I am pretty aligned (on reasonably-good-reflection) with the result of random humans doing reasonably-good-reflection as I am also a human and many of the underlying arguments/intuitions I think seem important seem likely to seem important to many other humans (given various common human intuitions) upon those humans becoming wiser. Further, I really just care about the preferences of (post-)humans who end care most about using vast, vast amounts of computational resources (assuming I end up caring about these things on reflection), because the humans who care about other things won’t use most of the resources. Additionally, I care “most” about the on-reflection preferences I have which are relatively less contingent and more common among at least humans for a variety of reasons. (One way to put this is that I care less about worlds in which my preferences on reflection seem highly contingent.)
So, I’ve claimed that reasonably-good-reflection resource usage will be non-trivial (perhaps 50%) and that I’m pretty aligned with humans on reasonably-good-reflection. Supposing these, why think that most of the value is coming from something like reasonably-good-reflection prefences rather than other things, e.g. not very thoughtful indexical preferences (selfish) consumption? Broadly three reasons:
I expect huge returns to heavy optimization of resource usage (similar to spending altruistic resources today IMO and in the future we’ll we smarter which will make this effect stronger).
I don’t think that (even heavily optimized) not-very-thoughtful indexical preferences directly result in things I care that much about relative to things optimized for what I care about on reflection (e.g. it probably doesn’t result in vast, vast, vast amounts of experience which is optimized heavily for goodness/$).
Consider how billionaries currently spend money which doesn’t seem to have have much direct value, certainly not relative to their altruistic expenditures.
I find it hard to imagine that indexical self-ish consumption results in things like simulating 10^50 happy minds. See also my other comment. It seems more likely IMO that people with self-ish preferences mostly just buy positional goods that involve little to no experience (separately, I expect this means that people without self-ish preferences get more of the compute, but this is counted in my earlier argument, so we shouldn’t double count it.)
I expect that indirect value “in the minds of the laborers producing the goods for consumption” is also small relative to things optimized for what I care about on reflection. (It seems pretty small or maybe net-negative (due to factory farming) today (relative to optimized altruism) and I expect the share will go down going forward.)
(Aside: I was talking about not-very-thoughtful indexical-preferences. It’s likely to me that doing a reasonably good job reflecting on selfish preferences get back to something like de facto utilitarianism (at least as far as how you spend the vast majority of computational resources) because personal identity and indexical preferences don’t make much sense and the thing you end up thinking is more like “I guess I just care about experiences in general”.)
What about AIs? I think there are broadly two main reasons to expect that what AIs do on reasonably-good-reflection to be worse from my perspective than what humans do:
As discussed above, I am more similar to other humans and when I inspect the object level of how other humans think or act, I feel reasonably optimistic about the results of reasonably-good-reflection for humans. (It seems to me like the main thing holding me back from agreement with other humans is mostly biases/communication/lack of smarts/wisdom given many shared intuitions.) However, AIs might be more different and thus result in less value. Further, the values of humans after reasonably-good-reflection seem close to saturating in goodness from my perspective (perhaps 1⁄3 or 1⁄2 of the value of purely my values), so it seems hard for AI to do better.
To better understand this argument, imagine that instead of humanity the question was between identical clones of myself and AIs. It’s pretty clear I share the same values the clones, so the clones do pretty much strictly better than AIs (up to self-defeating moral views).
I’m uncertain about the degree of similarity between myself and other humans. But, mostly the underlying similarity uncertainties also applies to AIs. So, e.g., maybe I currently think on reasonably-good-reflection humans spend resources 1⁄3 as well as I would and AIs spend resources 1⁄9 as well. If I updated to think that other humans after reasonably-good-reflection only spend resources 1⁄10 as well as I do, I might also update to thinking AIs spend resources 1⁄100 as well.
In many of the stories I imagine for AIs seizing control, very powerful AIs end up directly pursuing close correlated of what was reinforced in training (sometimes called reward-seeking, though I’m trying to point at a more general notion). Such AIs are reasonably likely to pursue relatively obviously valueless-from-my-perspective things on reflection. Overall, they might act more like a ultra powerful corporation that just optimizes for power/money rather than our children (see also here). More generally, AIs might in some sense be subjected to wildly higher levels of optimization pressure than humans while being able to better internalize these values (lack of genetic bottleneck) which can plausibly result in “worse” values from my perspective.
Note that we’re conditioning on safety/alignment technology failing to retain human control, so we should imagine correspondingly less human control over AI values.
I think that the fraction of computation resources of our lightcone used based on the result of a reasonably-good-reflection process seems similar between human control and AI control (perhaps 50%). It’s possible to mess this up of course and either mess up the reflection or to lock-in bad values too early. But, when I look at the balance of arguments, humans messing this up seems pretty similar to AIs messing this up to me. So, the main question is what the result of such a process would be. One way to put this is that I don’t expect humans to differ substantially from AIs in terms of how “thoughtful” they are.
I interpret one of your arguments as being “Humans won’t be very thoughtful about how they spend vast, vast amounts of computational resources. After all, they aren’t thoughtful right now.” To the extent I buy this argument, I think it applies roughly equally well to AIs. So naively, it just divides by both sides rather than making AI look more favorable. (At least, if you accept that all most all of the value comes from being at least a bit thoughtful, which you also contest. See my arguments for that.)
In other words, agents optimizing for their own happiness, or the happiness of those they care about, seem likely to be the primary force behind the creation of hedonium-like structures. They may not frame it in utilitarian terms, but they will still be striving to maximize happiness and well-being for themselves and others they care about regardless. And it seems natural to assume that, with advanced technology, they would optimize pretty hard for their own happiness and well-being, just as a utilitarian might optimize hard for happiness when creating hedonium.
Suppose that a single misaligned AI takes control and it happens to care somewhat about its own happiness while not having any more “altruistic” tendencies that I would care about or you would care about. (I think misaligned AIs which seize control caring about their own happiness substantially seems less likely than not, but let’s suppose this for now.) (I’m saying “single misaligned AI” for simplicity, I get that a messier coalition might be in control.) It now has access to vast amounts of computation after sending out huge numbers of probes to take control over all available energy. This is enough computation to run absolutely absurd amounts of stuff.
What are you imagining it spends these resources on which is competitive with optimized goodness? Running >10^50 copies of itself which are heavily optimized for being as happy as possible while spending?
If a small number of agents have a vast amount of power, and these agents don’t (eventually, possibly after a large amount of thinking) want to do something which is de facto like the values I end up caring about upon reflection (which is probably, though not certainly, vaguely like utilitarianism in some sense), then from my perspective it seems very likely that the resources will be squandered.
If you’re imagining something like:
It thinks carefully about what would make “it” happy.
It realizes it cares about having as many diverse good experience moments as possible in a non-indexical way.
It realizes that heavy self-modification would result in these experience moments being better and more efficient, so it creates new versions of “itself” which are radically different and produce more efficiently good experiences.
It realizes it doesn’t care much about the notion of “itself” here and mostly just focuses on good experiences.
It runs vast numbers of such copies with diverse experiences.
Then this is just something like utilitarianism by another name via a differnet line of reasoning.
I thought your view was that step (2) in this process won’t go like this. E.g., currently self-ish entities will retain indexical preferences. If so, then I do see where the goodness can plausibly come from.
The fact that our current world isn’t well described by the idea that what matters most is the number of explicit utilitarians, strengthens my point here.
When I look at very rich people (people with >$1 billion), it seems like the dominant way they make the world better via spending money (not via making money!) is via thoughtful altuistic giving not via consumption.
Perhaps your view is that with the potential for digital minds this situation will change?
(Also, it seems very plausible to me that the dominant effect on current welfare is driven mostly by the effect on factory farming and other animal welfare.)
I expect this trend to further increase as people get much, much wealthier and some fraction (probably most) of them get much, much smarter and wiser with intelligence augmentation.
I want to challenge an argument that I think is drives a lot of AI risk intuitions. I think the argument goes something like this:
There is something called “human values”.
Humans broadly share “human values” with each other.
It would be catastrophic if AIs lacked “human values”.
“Human values” are an extremely narrow target, meaning that we need to put in exceptional effort in order to get AIs to be aligned with human values.
My problem with this argument is that “human values” can refer to (at least) three different things, and under every plausible interpretation, the argument appears internally inconsistent.
Broadly speaking, I think “human values” usually refers to one of three concepts:
The individual objectives that people pursue in their own life (i.e. the individual human desire desire for wealth, status, and happiness, usually for themselves or their family and friends)
The set of rules we use to socially coordinate (i.e. our laws, institutions, and social norms)
Our cultural values (i.e. the ways that human societies have broadly differed from each other, in their languages, tastes, styles, etc.)
Under the first interpretation, I think premise (2) of the original argument is undermined. In the second interpretation, premise (4) is undermined. In the third interpretation, premise (3) is undermined.
Let me elaborate.
In the first interpretation, “human values” is not a coherent target that we share with one another, since each person has their own separate, generally selfish objectives that they pursue in their own life. In other words, there isn’t one thing called human values. There are just separate, individually varying preferences for 8 billion humans. When a new human is born, a new version of “human values” comes into existence.
In this view, the set of “human values” from humans 100 years ago is almost completely different from the the set of “human values” that exists now, since almost everyone alive 100 years ago is now dead. In effect, the passage of time is itself a catastrophe. This implies that “human values” isn’t a shared property of the human species, but rather depends on the exact set of individuals who happen to exist at any moment in time. This is loosely speaking a person-affecting perspective.
In the second interpretation, “human values” simply refer to a set of coordination mechanisms that we use to get along with each other, to facilitate our separate individual ends. In this interpretation I do not think “human values” are well-modeled as an extremely narrow target inside a high dimensional space.
Consider our most basic laws: do not murder, do not steal, do not physically assault another person. These seem like very natural ideas could be stumbled upon by a large set of civilizations, even given wildly varying individual and cultural values between them. For example, the idea that it is wrong to steal from another person seems like a pretty natural idea that even aliens could converge on. Not all aliens would converge on such a value, but it seems plausible that enough of them would that we should not say it is an “extremely narrow target”.
In the third interpretation, “human values” are simply cultural values, and it is not clear to me why we would consider changes to this status quo to be literally catastrophic. It seems the most plausible way that cultural changes could be catastrophic is if they changed in a way that dramatically affected our institutions, laws, and norms. But in that case, it starts sounding more like “human values” is being used according to the second interpretation, and not the third.
When I think of values I think of interpretation #2, and I don’t think you prove that P4 is untrue under that interpretation. The idea is that humans are both a) constrained and b) generally inclined to follow some set of rules. An AI would be neither constrained nor necessarily inclined to follow these rules.
Consider our most basic laws: do not murder, do not steal, do not physically assault another person. These seem like very natural ideas could be stumbled upon by a large set of civilizations, even given wildly varying individual and cultural values between them.
Virtually all historical and present atrocities are framed in terms of determining who is a person and who is not. Why would AIs see us as having moral personhood?
When I think of values I think of interpretation #2, and I don’t think you prove that P4 is untrue under that interpretation. The idea is that humans are both a) constrained and b) generally inclined to follow some set of rules. An AI would be neither constrained nor necessarily inclined to follow these rules.
P4 is about whether human values are an extremely narrow target, not about whether AIs will be necessarily be inclined to follow them, or necessarily constrained by them. I agree it is logically possible for AIs to exist who would try to murder humans; indeed, there are already humans who try to do that to others. The primary question is instead about how narrow of a target the value “don’t murder” or “don’t steal” is, and whether we need to put in exceptional effort in order to hit these targets.
Among humans, it seems the specific target here is not very narrow, despite our greatly varying individual objectives. This fact provides a hint at how narrow our basic social mechanisms really are, in my opinion.
Virtually all historical and present atrocities are framed in terms of determining who is a person and who is not. Why would AIs see us as having moral personhood?
Here again I would say the question is more about whether thinking that humans have relevant personhood is an extremely narrow target, not about whether AIs will necessarily see us as persons. They may see us as persons, and maybe they won’t. But the idea that they would doesn’t seem very unnatural. For one, if AIs are created in something like our current legal system, the concept of legal personhood will already be extended to humans by default. It seems pretty natural for future people to inherit legal concepts from the past. And all I’m really arguing here is that this isn’t an extremely narrow target to hit, not that it must happen by necessity.
I guess “narrow target” is just an underspecified part of your argument then, because I don’t know what it’s meant to capture if not “in most plausible scenarios, AI doesn’t follow the same set of rules as humans”.
Can you outline the case for thinking that “in most plausible scenarios, AI doesn’t follow the same set of rules as humans”? To clarify, by “same set of rules” here I’m imagining basic legal rules: do not murder, do not steal etc. I’m not making a claim that specific legal statutes will persist over time.
It seems to me both that:
To the extent that AIs are our descendants, they should inherit our legal system, legal principles, and legal concepts, similar to how e.g. the United States inherited legal principles from the United Kingdom. We should certainly expect our legal system to change over time as our institutions adapt to technological change. But, absent a compelling reason otherwise, it seems wrong to think that “do not murder a human” will go out the window in “most plausible scenarios”.
Our basic legal rules seem pretty natural, rather than being highly contingent. It’s easy to imagine plenty of alien cultures stumbling upon the idea of property rights, and implementing the rule “do not steal from another legal person”.
My point is that AI could plausibly have rules for interacting with other “persons”, and those rules could look much like ours, but that we will not be “persons” under their code. Consider how “do not murder” has never applied to animals.
If AIs treat us like we treat animals then the fact that they have “values” will not be very helpful to us.
I think AIs will be trained on our data, and will be integrated into our culture, having been deliberately designed for the purpose of filling human-shaped holes in our economy, to automate labor. This means they’ll probably inherit our social concepts, in addition to most other concepts we have about the physical world. This situation seems disanalogous to the way humans interact with animals in many ways. Animals can’t even speak language.
Anyway, even the framing you have given seems like a partial concession towards my original point. A rejection of premise 4 is not equivalent to the idea that AIs will automatically follow our legal norms. Instead, it was about whether “human values” are an extremely narrow target, in the sense of being a natural vs. contingent set of values that are very hard to replicate in other circumstances.
If the way AIs relate to human values is similar to how humans relate to animals, then I’ll point out that many existing humans already find the idea of caring about animals to be quite natural, even if most ultimately decide not to take the idea very far. Compare the concept of “caring about animals” to “caring about paperclip maximization”. In the first instance, we have robust examples of people actually doing that, but hardly any examples of people in the second instance. This is after all because caring about paperclip maximization is an unnatural and arbitrary thing to care about relative to how most people conceptualize the world.
Again, I’m not saying AIs will necessarily care about human values. That was never the claim. The entire question was about whether human values are an “extremely narrow target”. And I think, within this context, given the second interpretation of human values in my original comment, the original thesis seems to have held up fine.
Here’s a fictional dialogue with a generic EA that I think can perhaps helps explain some of my thoughts about AI risks compared to most EAs:
EA: “Future AIs could be unaligned with human values. If this happened, it would likely be catastrophic. If AIs are unaligned, they’ll hide their intentions until they’re in a position to strike in a violent coup, and then the world will end (for us at least).”
Me: “I agree that sounds like it would be very bad. But maybe let’s examine why this scenario seems plausible to you. What do you mean when you say AIs might be unaligned with human values?”
EA: “Their utility functions would not overlap with our utility functions.”
Me: “By that definition, humans are already unaligned with each other. Any given person has almost a completely non-overlapping utility function with a random stranger. People—through their actions rather than words—routinely value their own life and welfare thousands of times higher than that of strangers. Yet even with this misalignment, the world does not end in any strong sense. Nor does this fact automatically imply the world will end for a given group within humanity.”
EA: “Sure, but that’s because humans mostly all have similar intelligence and/or capability. Future AIs will be way smarter and more capable than humans.”
Me: “Why does that change anything?”
EA: “Because, unlike individual humans, misaligned AIs will be able to take over the world, in the sense of being able to exert vast amounts of hard power over the rest of the inhabitants on the planet. Currently, no human, or faction of humans, can kill everyone else. AIs will be different along this axis.”
Me: “That isn’t different from groups of humans. Various groups have ‘taken over the world’ in that sense. For example, adults currently control the world, and took over from the previous adults. More arguably, smart people currently control the world. In these cases, both groups have considerable hard power relative to other people.
Consider a random retirement home. Compared to the rest of the world, it has basically no power. If the rest of humanity decided to destroy or loot the retirement home, there would be no serious chance the retirement home could stop it. And yet things like that don’t happen very often even though humans have mostly non-overlapping utility functions and considerable hard power.”
EA: “Sure, but that’s because humans respect long-standing norms and laws, and people could never coordinate to do something like that anyway, nor would they want to. AIs won’t necessarily be similar. If unaligned AIs take over the world, they will likely kill us and then replace us, since there’s no reason for them to leave us alive.”
Me: “That seems dubious. Why won’t unaligned AIs respect moral norms? What benefit would they get from killing us? Why are we assuming they all coordinate as a unified group, leaving us out of their coalition?
I’m not convinced, and I think you’re pretty overconfident about these things. But for the sake of argument, let’s just assume for now that you’re right about this particular point. In other words, let’s assume that when AIs take over the world in the future, they’ll kill us and take our place. I certainly agree it would be bad if we all died from AI. But here’s a more fundamental objection: how exactly is that morally different from the fact that young humans already ‘take over’ the world from older people during every generation, letting the old people die, and then take their place?”
EA: “In the case of generational replacement, humans are replaced with other humans. Whereas in this case, humans are replaced by AIs.”
Me: “I’m asking why it matters morally. Why should I care if a human takes my place after I die compared to an AI?”
EA: “This is just a bedrock principle to me. I care more about humans than I care about AIs.”
Me: “Fair enough, but I don’t personally intrinsically care more about humans than AIs. I think what matters is plausibly something like sentience, and maybe sapience, and so I don’t have an intrinsic preference for ordinary human generational replacement compared to AIs replacing us.”
EA: “Well, if you care about sentience, unaligned AIs aren’t going to care about that. They’re going to care about other things, like paperclips.”
Me: “Why do you think that?”
EA: “Because (channeling Yudkowsky) sentience is a very narrow target. It’s extremely hard to get an AI to care about something like that. Almost all goals are like paperclip maximization from our perspective, rather than things like welfare maximization.”
Me: “Again, that seems dubious. AIs will be trained on our data, and share similar concepts with us. They’ll be shaped by rewards from human evaluators, and be consciously designed with benevolence in mind. For these reasons, it seems pretty plausible that AIs will come to value natural categories of things, including potentially sentience. I don’t think it makes sense to model their values as being plucked from a uniform distribution over all possible physical configurations.”
EA: “Maybe, but this barely changes the argument. AI values don’t need to be drawn from a uniform distribution over all possible physical configurations to be very different from our own. The point is that they’ll learn values in a way that is completely distinct from how we form our values.”
Me: “That doesn’t seem true. Humans seem to get their moral values from cultural learning and social emulation, which seems broadly similar to the way that AIs will get their moral values. Yes, there are innate human preferences that AIs aren’t likely to share with us—but they are mostly things like preferring being in a room that’s 20°C rather than 10°C. Our moral values—such as the ideology we subscribe to—are more of a product of our cultural environment than anything else. I don’t see why AIs will be very different.”
EA: “That’s not exactly right. Our moral values are also the result of reflection and intrinsic preferences for certain things. For example, humans have empathy, whereas AIs won’t necessarily have that.”
Me: “I agree AIs might not share some fundamental traits with humans, like the capacity for empathy. But ultimately, so what? Is that really the type of thing that makes you more optimistic about a future with humans than with AI? There exist some people who say they don’t feel empathy for others, and yet I would still feel comfortable giving them more power despite that. On the other hand, some people have told me that they feel empathy, but their compassion seems to turn off completely when they watch videos of animals in factory farms. These things seem flimsy as a reason to think that a human-dominated future will have much more value than an AI-dominated future.”
EA: “OK but I think you’re neglecting something pretty obvious here. Even if you think it wouldn’t be morally worse for AIs to replace us compared to young humans replacing us, the latter won’t happen for another several decades, whereas AI could kill everyone and replace us within 10 years. That fact is selfishly concerning—even if we put aside the broader moral argument. And it probably merits pausing AI for at least a decade until we are sure that it won’t kill us.”
Me: “I definitely agree that these things are concerning, and that we should invest heavily into making sure AI goes well. However, I’m not sure why the fact that AI could arrive soon makes much of a difference to you here.
AI could also make our lives much better. It could help us invent cures to aging, and dramatically lift our well-being. If the alternative is certain death, only later in time, then the gamble you’re proposing doesn’t seem clear to me from a selfish perspective.
Whether it’s selfishly worth it to delay AI depends quite a lot on how much safety benefit we’re getting from the delay. Actuarial life tables reveal that we accumulate a lot of risk just by living our lives normally. For example, a 30 year old male in the United States has nearly a 3% chance of dying before they turn 40. I’m not fully convinced that pausing AI for a decade reduces the chance of catastrophe by more than that. And of course, for older people, the gamble is worse still.”
I have so many axes of disagreement that is hard to figure out which one is most relevant. I guess let’s go one by one.
Me: “What do you mean when you say AIs might be unaligned with human values?”
I would say that pretty much every agent other than me (and probably me in different times and moods) are “misaligned” with me, in the sense that I would not like a world where they get to dictate everything that happens without consulting me in any way.
This is a quibble because in fact I think if many people were put in such a position they would try asking others what they want and try to make it happen.
Consider a random retirement home. Compared to the rest of the world, it has basically no power. If the rest of humanity decided to destroy or loot the retirement home, there would be virtually no serious opposition.
This hypothetical assumes too much, because people outside care about the lovely people in the retirement home, and they represent their interests. The question is, will some future AIs with relevance and power care for humans, as humans become obsolete?
I think this is relevant, because in the current world there is a lot of variety. There are people who care about retirement homes and people who don’t. The people who care about retirement homes work hard toale sure retirement homes are well cared for.
But we could imagine a future world where the AI that pulls ahead of the pack is very indifferent about humans, while the AI that cares about humans falls behind; perhaps this is because caring about humans puts you at a disadvantage (if you are not willing to squish humans in your territory your space to build servers gets reduced or something; I think this is unlikely but possible) and/or because there is a winner-take-all mechanism and the first AI systems that gets there coincidentally don’t care about humans (unlikely but possible). Then we would be without representation and in possibly quite a sucky situation.
I’m asking why it matters morally. Why should I care if a human takes my place after I die compared to an AI?
Stop that train, I do not want to be replaced by either human or AI. I want to be in the future and have relevance, or at least be empowered through agents that represent my interests.
I also want my fellow humans to be there, if they want to, and have their own interests be represented.
Humans seem to get their moral values from cultural learning and emulation, which seems broadly similar to the way that AIs will get their moral values.
I don’t think AIs learn in a similar way to humans, and future AI might learn in a even more dissimilar way. The argument I would find more persuasive is pointing out that humans learn in different ways to one another, from very different data and situations, and yet end with similar values that include caring for one another. That I find suggestive, though it’s hard to be confident.
EA: “Their utility functions would not overlap with our utility functions.”
Me: “By that definition, humans are already unaligned with each other. Any given person has almost a completely non-overlapping utility function with a random stranger. People—through their actions rather than words—routinely value their own life and welfare thousands of times higher than that of strangers. Yet even with this misalignment, the world does not end in any strong sense.”
EA: “Sure, but that’s because humans are all roughly the same intelligence and/or capability. Future AIs will be way smarter and more capable than humans.”
Just for the record, this is when I got off the train for this dialogue. I don’t think humans are misaligned with each other in the relevant ways, and if I could press a button to have the universe be optimized by a random human’s coherent extrapolated volition, then that seems great and thousands of times better than what I expect to happen with AI-descendants. I believe this for a mixture of game-theoretic reasons and genuinely thinking that other human’s values do really actually capture most of what I care about.
In this part of the dialogue, when I talk about a utility function of a human, I mean roughly their revealed preferences, rather than their coherent extrapolated volition (which I also think is underspecified). This is important because it is our revealed preferences that better predict our actual behavior, and the point I’m making is simply that behavioral misalignment is common in this sense among humans. And also this fact does not automatically imply the world will end for a given group of humans within humanity.
This is missing a very important point, which is that I think humans have morally relevant experience and I’m not confident that misaligned AIs would. When the next generation replaces the current one this is somewhat ok because those new humans can experience joy, wonder, adventure etc. My best guess is that AIs that take over and replace humans would not have any morally relevant experience, and basically just leave the universe morally empty. (Note that this might be an ok outcome if by default you expect things to be net negative)
I also think that there is way more overlap in the “utility functions” between humans, than between humans and misaligned AIs. Most humans feel empathy and don’t want to cause others harm. I think humans would generally accept small costs to improve the lives of others, and a large part of why people don’t do this is because people have cognitive biases or aren’t thinking clearly. This isn’t to say that any random human would reflectively become a perfectly selfless total utilitarian, but rather that most humans do care about the wellbeing of other humans. By default, I don’t think misaligned AIs will really care about the wellbeing of humans at all.
My best guess is that AIs that take over and replace humans would not have any morally relevant experience, and basically just leave the universe morally empty.
I don’t think that’s particularly likely, but I can understand if you think this is an important crux.
For what it’s worth, I don’t think it matters as much whether the AIs themselves are sentient, but rather whether they care about sentience. For example, from the perspective of sentience, humans weren’t necessarily a great addition to the world, because of their contribution to suffering in animal agriculture (although I’m not giving a confident take here).
Even if AIs are not sentient, they’ll still be responsible for managing the world, and creating structures in the universe. When this happens, there’s a lot of ways for sentience to come about, and I care more about the lower level sentience that the AI manages than the actual AIs at the top who may or may not be sentient.
let’s assume that when AIs take over the world in the future, they’ll kill us and take our place. Here’s a more fundamental objection: how exactly is that morally different from the fact that young humans already ‘take over’ the world from older people during every generation
I think this is a big moral difference: We do not actively kill the older humans so that we can take over. We care about older people, and societies that are rich enough spend some resources to keep older people alive longer.
The entirety of humanity being killed and replaced by the kind of AI that places so little moral value on us humans would be catastrophically bad, compared to things that are currently occurring.
I find it slightly strange that EAs aren’t emphasizing semiconductor investments more given our views about AI.
(Maybe this is because of a norm against giving investment advice? This would make sense to me, except that there’s also a cultural norm about criticizing charities that people donate to, and EAs seemed to blow right through that one.)
I commented on this topic last year. Later, I was informed that some people have been thinking about this and acting on it to some extent, but overall my impression is that there’s still a lot of potential value left on the table. I’m really not sure though.
Since I might be wrong and I don’t really know what the situation is with EAs and semiconductor investments, I thought I’d just spell out the basic argument, and see what people say:
Credible models of economic growth predict that, if AI can substitute for human labor, then we should expect the year-over-year world economic growth rate to dramatically accelerate, probably to at least 30% and maybe to rates as high as 300% or 3000%.
This rate of growth should be sustainable for a while before crashing, since physical limits appear to permit far more economic value than we’re currently generating. For example, at our current rate of approximately 5.6 megajoules per dollar, capturing the yearly energy output of the sun would allow us to generate an economy worth $6.8*10^25 dollars, more than 100 billion times the size of our current economy.
If AI drives this economic productivity explosion, it seems likely that the companies manufacturing computer hardware (i.e. semiconductor companies) will benefit greatly in the midst of all of this. Very little of this seems priced in right now, although I admit I haven’t done any rigorous calculations to prove that.
I agree it’s hard to know who will capture most of the value from the AI revolution, but semiconductor companies, and in particular the companies responsible for designing and manufacturing GPUs, seem like a safer bet than almost anyone else.
I agree it’s possible that the existing public companies will be unseated by private competitors and so investing in the public companies risks losing everything, but my understanding is that semiconductor companies have a large moat and are hard to unseat.
I agree it’s possible that the government will nationalize semiconductor production, but they won’t necessarily steal all the profits from investors before doing so.
I agree that EAs should avoid being too heavily invested in one single asset (e.g. crypto) but how much is EA actually invested in semiconductor stocks? Is this actually a concern right now, or is it just a hypothetical concern? Also, investing in Anthropic seems like a riskier bet since it’s less diversified than a broad semiconductor portfolio, and could easily go down in flames.
I agree that AI might hasten the arrival of some sort of post-property-rights system of governance in which investments don’t have any meaning anymore, but I haven’t seen any strong arguments for this. It seems more likely that e.g. tax rates will go way up, but people still own property.
In general, I agree that there are many uncertainties that this question is riding on, but that’s the same thing with any other thing EA does. Any particular donation to AI safety research, for example, is always uncertain and might be a waste of time.
Investing in semiconductor companies plausibly accelerates AI a little bit which is bad to the extent you think acceleration increases x-risk, but if EA gets a huge payout by investing in these companies, then that might cancel out the downsides from accelerating AI?
Another thing I just thought of is that maybe there are good tax reasons to not switch EA investments to semiconductor stocks, which I think would be fair, and I’m not an expert in any of that stuff.
I mostly agree with this (and did also buy some semiconductor stock last winter).
Besides plausibly accelerating AI a bit (which I think is a tiny effect at most unless one plans to invest millions), a possible drawback is motivated reasoning (e.g., one may feel less inclined to think critically of the semi industry, and/or less inclined to favor approaches to AI governance that reduce these companies’ revenue). This may only matter for people who work in AI governance, and especially compute governance.
I’m considering writing a post that critically evaluates the concept of a decisive strategic advantage, i.e. the idea that in the future an AI (or set of AIs) will take over the world in a catastrophic way. I think this concept is central to many arguments about AI risk. I’m eliciting feedback on an outline of this post here in order to determine what’s currently unclear or weak about my argument.
The central thesis would be that it is unlikely that an AI, or a unified set of AIs, will violently take over the world in the future, especially at a time when humans are still widely still seen as in charge (if it happened later, I don’t think it’s “our” problem to solve, but instead a problem we can leave to our smarter descendants). Here’s how I envision structuring my argument:
First, I’ll define what is meant by a decisive strategic advantage (DSA). The DSA model has 4 essential steps:
At some point in time an AI agent, or an agentic collective of AIs, will be developed that has values that differ from our own, in the sense that the ~optimum of its utility function ranks very low according to our own utility function
When this agent is weak, it will have a convergent instrumental incentive to lie about its values, in order to avoid getting shut down (e.g. “I’m not a paperclip maximizer, I just want to help everyone”)
However, when the agent becomes powerful enough, it will suddenly strike and take over the world
Then, being now able to act without constraint, this AI agent will optimize the universe ruthlessly, which will be very bad for us
We can compare the DSA model to an alternative model of future AI development:
Premise (1)-(2) above in the DSA story are still assumed true, but
There will never be a point (3) and (4), in which a unified AI agent will take over the world, and then optimize the universe ruthlessly
Instead, AI agents will compromise, trade, and act within a system of laws indefinitely, in order to achieve their objectives, similar to what humans do now
Because this system of laws will descend from our current institutions and legal tradition, it is likely that humans will keep substantial legal rights, potentially retaining lots of wealth from our capital investments and property, even if we become relatively powerless compared to other AI agents in the system
I have two main objections to the DSA model.
Objection 1: It is unlikely that there will be a point at which a unified agent will be able to take over the world, given the existence of competing AIs with comparable power
Prima facie, it seems intuitive that no single AI agent will be able to take over the world if there are other competing AI agents in the world. More generally, we can try to predict the distribution of power between AI agents using reference class forecasting.
This could involve looking at:
Distribution of wealth among individuals in the world
Distribution of power among nations
Distribution of revenue among businesses
etc.
In most of these cases, the function that describes the distribution of power is something like a pareto distribution, and in particular, it seems rare for one single agent to hold something like >80% of the power.
Therefore, a priori we should assign a low probability to the claim that a unified agent will be able to easily take over of the whole world in the future
To the extent people disagree about the argument I just stated, I expect it’s mostly because they think these reference classes are weak evidence, and they think there are stronger specific object-level points that I need to address. In particular, it seems many people think that AIs will not compete with each other, but instead collude against humans. Their reasons for thinking this include:
The fact that AIs will be able to coordinate well with each other, and thereby choose to “merge” into a single agent
My response: I agree AIs will be able to coordinate with each other, but “ability to coordinate” seems like a continuous variable that we will apply pressure to incrementally, not something that we should expect to be roughly infinite right at the start. Current AIs are not able to “merge” with each other.
If coordination ability increases incrementally over time, then we should see a gradual increase in the concentration of AI agency over time, rather than the sudden emergence of a single unified agent. To the extent this concentration happens incrementally, it will be predictable, the potential harms will be noticeable before getting too extreme, and we can take measures to pull back if we realize that the costs of continually increasing coordination abilities are too high. In my opinion, this makes the challenge here dramatically easier.
In any case, the moment during which we hand over control of the world to AIs will likely occur at a point when the ability for AIs to coordinate is somewhere only modestly above human-level (and very far below perfect).
As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge. We can leave this problem to be solved by our smarter descendants.
The idea that AIs will all be copies of each other, and thus all basically be “a unified agent”
My response: I have two objections.
First, I deny the premise. It seems likely that there will be multiple competing AI projects with different training runs. More importantly, for each pre-training run, it seems likely that there will be differences among deployed AIs due to fine-tuning and post-training enhancements, yielding diversity among AIs in general.
Second, it is unclear why AIs would automatically unify with their copies. I think this idea is somewhat plausible on its face but I have yet to see any strong arguments for it. Moreover, it seems plausible that AIs will have indexical preferences, making them have different values even if they are copies of each other.
The idea that AIs will use logical decision theory
My response: This argument appears to misunderstand what makes coordination difficult. Coordination is not mainly about what decision theory you use. It’s more about being able to synchronize your communication and efforts without waste. See also: the literature on diseconomies of scale.
The idea that a single agent AI will recursively self-improve to become vastly more powerful than everything else in the world
My response: I think this argument, and others like it, suffer from the arguments given against fast takeoff given by Paul Chrisiano, Katja Grace, and Robin Hanson, and I largely agree with what they’ve written about it. For example, here’s Paul Christiano’s take.
Maybe AIs will share collective grievances with each other, prompting a natural alliance among them against humans
My response: if true, we can take steps to mitigate this issue. For example, we can give AIs legal rights, lessening their motives to revolt. While I think this is a significant issue, I also think it’s tractable to solve.
Objection 2: Even if a unified agent can take over the world, it is unlikely to be in their best interest to try to do so
The central argument here would be premised on a model of rational agency, in which an agent tries to maximize benefits minus costs, subject to constraints
The agent would be faced with a choice:
(1) Attempt to take over the world, and steal everyone’s stuff, or
(2) Work within a system of compromise, trade, and law, and get very rich within that system, in order to e.g. buy lots of paperclips
The question of whether (1) is a better choice than (2) is not simply a question of whether taking over the world is “easy” or whether it could be done by the agent. Instead it is a question of whether the benefits of (1) outweigh the costs, relative to choice (2).
It seems likely that working within a system of compromise, trade, and law is more efficient than trying to take over the world even if you can take over the world. The reason is because subverting the system basically means “going to war” with other parties, which is not usually very efficient, even against weak opponents.
Most literature on the economics of war generally predicts that going to war is worse than trying to compromise, assuming both parties are rational and open to compromise. This is mostly because:
War is wasteful. You need to spend resources fighting it, which could be productively spent doing other things.
War is risky. Unless you can win a war with certainty, you might lose the war after launching it, which is a very bad outcome if you have some degree of risk-aversion.
The fact that “humans are weak and can be easily beaten” cuts both ways:
Yes, it means that a very powerful AI agent could “defeat all of us combined” (as Holden Karnofsky said)
But it also means that there would be little benefit to defeating all of us, because we aren’t really a threat to its power
Conclusion: An AI decisive strategic advantage is still somewhat plausible because revolutions have happened in history, and revolutions seem like a reasonable reference class to draw from. That said, it seems the probability of a catastrophic AI takeover in humanity’s relative near-term future (say, the next 50 years) is low (maybe 10% chance of happening). However, it’s perhaps significantly more likely in the very long-run.
Your argument in objection 1 doesn’t the position people who are worried about an absurd offense-defense imbalance.
Additionally: It may be that no agent can take over the world, but that an agent can destroy the world. Would someone build something like that? Sadly, I think the answer is yes.
Oh, I can see why it is ambiguous. I meant whether it is easier to attack or defend, which is separate from the “power” attackers have and defenders have.
”What incentive is there to destroy the world, as opposed to take it over? If you destroy the world, aren’t you sacrificing yourself at the same time?”
Some would be willing to do that if they can’t take it over.
What reason is there to think that AI will shift the offense-defense balance absurdly towards offense? I admit such a thing is possible, but it doesn’t seem like AI is really the issue here. Can you elaborate?
I think main abstract argument for why this is plausible is that AI will change many things very quickly and in a high variance way. And some human processes will lag behind heavily.
This could plausibly (though not obviously) lead to offense dominance.
I’m not going to fully answer this question, b/c I have other work I should be doing, but I’ll toss in one argument. If different domains (cyber, bio, manipulation, ect.) have different offense-defense balances a sufficiently smart attacker will pick the domain with the worst balance. This recurses down further for at least some of these domains where they aren’t just a single thing, but a broad collection of vaguely related things.
I sympathise with/agree with many of your points here (and in general regard AI x-risk), but something about this recent sequence of quick-takes isn’t landing with me in the way some of your other work has. I’ll try and articulate why in some cases, though I apologies if I misread or misunderstand you.
On this post, these two presises/statements raised an eyebrow:
3. Instead, AI agents will compromise, trade, and act within a system of laws indefinitely, in order to achieve their objectives, similar to what humans do now
4. Because this system of laws will descend from our current institutions and legal tradition, it is likely that humans will keep substantial legal rights, potentially retaining lots of wealth from our capital investments and property, even if we become relatively powerless compared to other AI agents in the system
To me, this is just as unsupported as people who are incredibly certain that there will be ‘treacherous turn’. I get this a supposition/alternative hypothesis, but how can you possible hold a premise that a system of laws will persist indefinitely? This sort of reminds me of the Leahy/Bach discussion where Bach just says ‘it’s going to align itself with us if it wants to if it likes us if it loves us”.I kinda want more that if we’re going to build these powerful systems, saying ’trust me bro, it’ll follow our laws and norms and love us back” doesn’t sound very convincing to me. (For clarity, I don’t think this is your position or framing, and I’m not a fan of the classic/Yudkowskian risk position. I want to say I find both perspectives unconvincing)
Secondly, people abide by systems of laws and norms, but we also have many cases of where individuals/parties/groups overturned these norms when they had accumulated enough power and didn’t feel the need to abide by the existing regime. This doesn’t have to look like the traditional DSA model where humanity gets instantly wiped out, but I don’t see why there couldn’t be a future where an AI makes move like Sulla using force to overthrow and depower the opposing factions, or the 18 Brumaire.
Objection 1: It is unlikely that there will be a point at which a unified agent will be able to take over the world, given the existence of competing AIs with comparable power
I hold a few core ethical ideas that are extremely unpopular: the idea that we should treat the natural suffering of animals as a grave moral catastrophe, the idea that old age and involuntary death is the number one enemy of humanity, the idea that we should treat so-called farm animals with an very high level of compassion.
Given the unpopularity of these ideas, you might be tempted to think that the reason they are unpopular is that they are exceptionally counterinuitive ones. But is that the case? Do you really need a modern education and philosphical training to understand them? Perhaps I shouldn’t blame people for not taking things seriously that which they lack the background to understand.
Yet, I claim that these ideas are not actually counterintuitive: they are the type of things you would come up on your own if you had not been conditioned by society to treat them as abnormal. A thoughtful 15 year old who was somehow educated without human culture would find no issue taking these issues seriously. Do you disagree? Let’s put my theory to the test.
In order to test my theory—that caring about wild animal suffering, aging, animal mistreatment—are the things that you would care about if you were uncorrupted by our culture, we need look no further than the bible.
It is known that the book of Genesis was written in ancient times, before anyone knew anything of modern philosophy, contemporary norms of debate, science, advanced mathematics. The writers of Genesis wrote of a perfect paradise, the one that we fell from after we were corrupted. They didn’t know what really happened, of course, so they made stuff up. What is that perfect paradise that they made up?
Death is a sad reality that is ever present in our world, leaving behind tremendous pain and suffering. Tragically, many people shake a fist at God when faced with the loss of a loved one and are left without adequate answers from the church as to death’s existence. Unfortunately, an assumption has crept into the church which sees death as a natural part of our existence and as something that we have to put up with as opposed to it being an enemy
Since creationists believe that humans are responsible for all the evil in the world, they do not make the usual excuse for evil that it is natural and therefore necessary. They openly call death an enemy, that which to be destroyed.
Later,
Both humans and animals were originally vegetarian, then death could not have been a part of God’s Creation. Even after the Fall the diet of Adam and Eve was vegetarian (Genesis 3:17–19). It was not until after the Flood that man was permitted to eat animals for food (Genesis 9:3). The Fall in Genesis 3 would best explain the origin of carnivorous animal behavior.
So in the garden, animals did not hurt one another. Humans did not hurt animals. But this article even goes further, and debunks the infamous “plants tho” objection to vegetarianism,
Plants neither feel pain nor die in the sense that animals and humans do as “Plants are never the subject of חָיָה ” (Gerleman 1997, p. 414). Plants are not described as “living creatures” as humans, land animals, and sea creature are (Genesis 1:20–21, 24 and 30; Genesis 2:7; Genesis 6:19–20 and Genesis 9:10–17), and the words that are used to describe their termination are more descriptive such as “wither” or “fade” (Psalm 37:2; 102:11; Isaiah 64:6).
In God’s perfect creation, the one invented by uneducated folks thousands of years ago, we can see that wild animal suffering did not exist, nor did death from old age, or mistreatment of animals.
In this article, I find something so close to my own morality, it strikes me a creationist of all people would write something so elegant,
Most animal rights groups start with an evolutionary view of mankind. They view us as the last to evolve (so far), as a blight on the earth, and the destroyers of pristine nature. Nature, they believe, is much better off without us, and we have no right to interfere with it. This is nature worship, which is a further fulfillment of the prophecy in Romans 1 in which the hearts of sinful man have traded worship of God for the worship of God’s creation.
But as people have noted for years, nature is “red in tooth and claw.”4Nature is not some kind of perfect, pristine place.
Unfortunately, it continues
And why is this? Because mankind chose to sin against a holy God.
I contend it doesn’t really take a modern education to invent these ethical notions. The truly hard step is accepting that evil is bad even if you aren’t personally responsible.
Some people seem to think the risk from AI comes from AIs gaining dangerous capabilities, like situational awareness. I don’t really agree. I view the main risk as simply arising from the fact that AIs will be increasingly integrated into our world, diminishing human control.
Under my view, the most important thing is whether AIs will be capable of automating economically valuable tasks, since this will prompt people to adopt AIs widely to automate labor. If AIs have situational awareness, but aren’t economically important, that’s not as concerning.
The risk is not so much that AIs will suddenly and unexpectedly take control of the world. It’s that we will voluntarily hand over control to them anyway, and we want to make sure this handoff is handled responsibly.
An untimely coup, while possible, is not necessary.
I have now posted as a comment on Lesswrong my summary of some recent economic forecasts and whether they are underestimating the impact of the coronavirus. You can help me by critiquing my analysis.
A trip to Mars that brought back human passengers also has the chance of bringing back microbial Martian passengers. This could be an existential risk if microbes from Mars harm our biosphere in a severe and irreparable manner.
From Carl Sagan in 1973, “Precisely because Mars is an environment of great potential biological interest, it is possible that on Mars there are pathogens, organisms which, if transported to the terrestrial environment, might do enormous biological damage—a Martian plague, the twist in the plot of H. G. Wells’ War of the Worlds, but in reverse.”
Note that the microbes would not need to have independently arisen on Mars. It could be that they were transported to Mars from Earth billions of years ago (or the reverse occurred). While this issue has been studied by some, my impression is that effective altruists have not looked into this issue as a potential source of existential risk.
A line of inquiry to launch could be to determine whether there are any historical parallels on Earth that could give us insight into whether a Mars-to-Earth contamination would be harmful. The introduction of an invasive species into some region loosely mirrors this scenario, but much tighter parallels might still exist.
Since Mars missions are planned for the 2030s, this risk could arrive earlier than essentially all the other existential risks that EAs normally talk about.
In response to human labor being automated, a lot of people support a UBI funded by a tax on capital. I don’t think this policy is necessarily unreasonable, but if later the UBI gets extended to AIs, this would be pretty bad for humans, whose only real assets will be capital.
As a result, the unintended consequence of such a policy may be to set a precedent for a massive wealth transfer from humans to AIs. This could be good if you are utilitarian and think the marginal utility of wealth is higher for AIs than humans. But selfishly, it’s a big cost.
In this “quick take”, I want to summarize some my idiosyncratic views on AI risk.
My goal here is to list just a few ideas that cause me to approach the subject differently from how I perceive most other EAs view the topic. These ideas largely push me in the direction of making me more optimistic about AI, and less likely to support heavy regulations on AI.
(Note that I won’t spend a lot of time justifying each of these views here. I’m mostly stating these points without lengthy justifications, in case anyone is curious. These ideas can perhaps inform why I spend significant amounts of my time pushing back against AI risk arguments. Not all of these ideas are rare, and some of them may indeed be popular among EAs.)
Skepticism of the treacherous turn: The treacherous turn is the idea that (1) at some point there will be a very smart unaligned AI, (2) when weak, this AI will pretend to be nice, but (3) when sufficiently strong, this AI will turn on humanity by taking over the world by surprise, and then (4) optimize the universe without constraint, which would be very bad for humans.
By comparison, I find it more likely that no individual AI will ever be strong enough to take over the world, in the sense of overthrowing the world’s existing institutions and governments by surprise. Instead, I broadly expect unaligned AIs will integrate into society and try to accomplish their goals by advocating for legal rights, rather than trying to overthrow our institutions by force. Upon attaining legal personhood, unaligned AIs can utilize their legal rights to achieve their objectives, for example by getting a job and trading their labor for property, within the already-existing institutions. Because the world is not zero sum, and there are economic benefits to scale and specialization, this argument implies that unaligned AIs may well have a net-positive effect on humans, as they could trade with us, producing value in exchange for our own property and services.
Note that my claim here is not that AIs will never become smarter than humans. One way of seeing how these two claims are distinguished is to compare my scenario to the case of genetically engineered humans. By assumption, if we genetically engineered humans, they would presumably eventually surpass ordinary humans in intelligence (along with social persuasion ability, and ability to deceive etc.). However, by itself, the fact that genetically engineered humans will become smarter than non-engineered humans does not imply that genetically engineered humans would try to overthrow the government. Instead, as in the case of AIs, I expect genetically engineered humans would largely try to work within existing institutions, rather than violently overthrow them.
AI alignment will probably be somewhat easy: The most direct and strongest current empirical evidence we have about the difficulty of AI alignment, in my view, comes from existing frontier LLMs, such as GPT-4. Having spent dozens of hours testing GPT-4′s abilities and moral reasoning, I think the system is already substantially more law-abiding, thoughtful and ethical than a large fraction of humans. Most importantly, this ethical reasoning extends (in my experience) to highly unusual thought experiments that almost certainly did not appear in its training data, demonstrating a fair degree of ethical generalization, beyond mere memorization.
It is conceivable that GPT-4′s apparently ethical nature is fake. Perhaps GPT-4 is lying about its motives to me and in fact desires something completely different than what it professes to care about. Maybe GPT-4 merely “understands” or “predicts” human morality without actually “caring” about human morality. But while these scenarios are logically possible, they seem less plausible to me than the simple alternative explanation that alignment—like many other properties of ML models—generalizes well, in the natural way that you might similarly expect from a human.
Of course, the fact that GPT-4 is easily alignable does not immediately imply that smarter-than-human AIs will be easy to align. However, I think this current evidence is still significant, and aligns well with prior theoretical arguments that alignment would be easy. In particular, I am persuaded by the argument that, because evaluation is usually easier than generation, it should be feasible to accurately evaluate whether a slightly-smarter-than-human AI is taking bad actions, allowing us to shape its rewards during training accordingly. After we’ve aligned a model that’s merely slightly smarter than humans, we can use it to help us align even smarter AIs, and so on, plausibly implying that alignment will scale to indefinitely higher levels of intelligence, without necessarily breaking down at any physically realistic point.
The default social response to AI will likely be strong: One reason to support heavy regulations on AI right now is if you think the natural “default” social response to AI will lean too heavily on the side of laissez faire than optimal, i.e., by default, we will have too little regulation rather than too much. In this case, you could believe that, by advocating for regulations now, you’re making it more likely that we regulate AI a bit more than we otherwise would have, pushing us closer to the optimal level of regulation.
I’m quite skeptical of this argument because I think that the default response to AI (in the absence of intervention from the EA community) will already be quite strong. My view here is informed by the base rate of technologies being overregulated, which I think is quite high. In fact, it is difficult for me to name even a single technology that I think is currently clearly underregulated by society. By pushing for more regulation on AI, I think it’s likely that we will overshoot and over-constrain AI relative to the optimal level.
In other words, my personal bias is towards thinking that society will regulate technologies too heavily, rather than too loosely. And I don’t see a strong reason to think that AI will be any different from this general historical pattern. This makes me hesitant to push for more regulation on AI, since on my view, the marginal impact of my advocacy would likely be to push us even further in the direction of “too much regulation”, overshooting the optimal level by even more than what I’d expect in the absence of my advocacy.
I view unaligned AIs as having comparable moral value to humans: The basic idea behind this point is that, under various physicalist views of consciousness, you should expect AIs to be conscious, even if they do not share human preferences. Moreover, it seems likely that AIs — even ones that don’t share human preferences — will be pretrained on human data, and therefore largely share our social and moral concepts.
Since unaligned AIs will likely be both conscious and share human social and moral concepts, I don’t see much reason to think of them as less “deserving” of life and liberty, from a cosmopolitan moral perspective. They will likely think similarly to the way we do across a variety of relevant axes, even if their neural structures are quite different from our own. As a consequence, I am pretty happy to incorporate unaligned AIs into the legal system and grant them some control of the future, just as I’d be happy to grant some control of the future to human children, even if they don’t share my exact values.
Put another way, I view (what I perceive as) the EA attempt to privilege “human values” over “AI values” as being largely arbitrary and baseless, from an impartial moral perspective. There are many humans whose values I vehemently disagree with, but I nonetheless respect their autonomy, and do not wish to deny these humans their legal rights. Likewise, even if I strongly disagreed with the values of an advanced AI, I would still see value in their preferences being satisfied for their own sake, and I would try to respect the AI’s autonomy and legal rights. I don’t have a lot of faith in the inherent kindness of human nature relative to a “default unaligned” AI alternative.
I’m not fully committed to longtermism: I think AI has an enormous potential to benefit the lives of people who currently exist. I predict that AIs can eventually substitute for human researchers, and thereby accelerate technological progress, including in medicine. In combination with my other beliefs (such as my belief that AI alignment will probably be somewhat easy), this view leads me to think that AI development will likely be net-positive for people who exist at the time of its development. In other words, if we allow AI development, it is likely that we can use AI to reduce human mortality, and dramatically raise human well-being for the people who already exist.
I think these benefits are large and important, and commensurate with the downside potential of existential risks. While a fully committed strong longtermist might scoff at the idea that curing aging might be important — as it would largely only have short-term effects, rather than long-term effects that reverberate for billions of years — by contrast, I think it’s really important to try to improve the lives of people who currently exist. Many people view this perspective as a form of moral partiality that we should discard for being arbitrary. However, I think morality is itself arbitrary: it can be anything we want it to be. And I choose to value currently existing humans, to a substantial (though not overwhelming) degree.
This doesn’t mean I’m a fully committed near-termist. I sympathize with many of the intuitions behind longtermism. For example, if curing aging required raising the probability of human extinction by 40 percentage points, or something like that, I don’t think I’d do it. But in more realistic scenarios that we are likely to actually encounter, I think it’s plausibly a lot better to accelerate AI, rather than delay AI, on current margins. This view simply makes sense to me given the enormously positive effects I expect AI will likely have on the people I currently know and love, if we allow development to continue.
I want to say thank you for holding the pole of these perspectives and keeping them in the dialogue. I think that they are important and it’s underappreciated in EA circles how plausible they are.
(I definitely don’t agree with everything you have here, but typically my view is somewhere between what you’ve expressed and what is commonly expressed in x-risk focused spaces. Often also I’m drawn to say “yeah, but …”—e.g. I agree that a treacherous turn is not so likely at global scale, but I don’t think it’s completely out of the question, and given that I think it’s worth serious attention safeguarding against.)
Explicit +1 to what Owen is saying here.
(Given that I commented with some counterarguments, I thought I would explicitly note my +1 here.)
The obvious example would be synthetic biology, gain-of-function research, and similar.
I also think AI itself is currently massively underregulated even entirely ignoring alignment difficulties. I think the probability of the creation of AI capable of accelerating AI R&D by 10x this year is around 3%. It would be extremely bad for US national interests if such an AI was stolen by foreign actors. This suffices for regulation ensuring very high levels of security IMO. And this is setting aside ongoing IP theft and similar issues.
Can you explain why you suspect these things should be more regulated than they currently are?
This reasoning seems to imply that you could use GPT-2 to oversee GPT-4 by bootstrapping from a chain of models of scales between GPT-2 and GPT-4. However, this isn’t true, the weak-to-strong generalization paper finds that this doesn’t work and indeed bootstrapping like this doesn’t help at all for ChatGPT reward modeling (it helps on chess puzzles and for nothing else they investigate I believe).
I think this sort of bootstrapping argument might work if we could ensure that the each model in the chain was sufficiently aligned and capable of reasoning that it would carefully reason about what humans would want if they were more knowledgeable and then rate outputs based on this. However, I don’t think GPT-4 is either aligned enough or capable enough that we see this behavior. And I still think it’s unlikely it works under these generous assumptions (though I won’t argue for this here).
I’m curious why there hasn’t been more work exploring a pro-AI or pro-AI-acceleration position from an effective altruist perspective. Some points:
Unlike existential risk from other sources (e.g. an asteroid) AI x-risk is unique because humans would be replaced by other beings, rather than completely dying out. This means you can’t simply apply a naive argument that AI threatens total extinction of value to make the case that AI safety is astronomically important, in the sense that you can for other x-risks. You generally need additional assumptions.
Total utilitarianism is generally seen as non-speciesist, and therefore has no intrinsic preference for human values over unaligned AI values. If AIs are conscious, there don’t appear to be strong prima facie reasons for preferring humans to AIs under hedonistic utilitarianism. Under preference utilitarianism, it doesn’t necessarily matter whether AIs are conscious.
Total utilitarianism generally recommends large population sizes. Accelerating AI can be modeled as a kind of “population accelerationism”. Extremely large AI populations could be preferable under utilitarianism compared to small human populations, even those with high per-capita incomes. Indeed, humans populations have recently stagnated via low population growth rates, and AI promises to lift this bottleneck.
Therefore, AI accelerationism seems straightforwardly recommended by total utilitarianism under some plausible theories.
Here’s a non-exhaustive list of guesses for why I think EAs haven’t historically been sympathetic to arguments like the one above, and have instead generally advocated AI safety over AI acceleration (at least when these two values conflict):
A belief that AIs won’t be conscious, and therefore won’t have much moral value compared to humans.
But why would we assume AIs won’t be conscious? For example, if Brian Tomasik is right, consciousness is somewhat universal, rather than being restricted to humans or members of the animal kingdom.
I also haven’t actually seen much EA literature defend this assumption explicitly, which would be odd if this belief is the primary reason EAs have for focusing on AI safety over AI acceleration.
A presumption in favor of human values over unaligned AI values for some reasons that aren’t based on strict impartial utilitarian arguments. These could include the beliefs that: (1) Humans are more likely to have “interesting” values compared to AIs, and (2) Humans are more likely to be motivated by moral arguments than AIs, and are more likely to reach a deliberative equilibrium of something like “ideal moral values” compared to AIs.
Why would humans be more likely to have “interesting” values than AIs? It seems very plausible that AIs will have interesting values even if their motives seem alien to us. AIs might have even more “interesting” values than humans.
It seems to me like wishful thinking to assume that humans are strongly motivated by moral arguments and would settle upon something like “ideal moral values”
A belief that population growth is inevitable, so it is better to focus on AI safety.
But a central question here is why pushing for AI safety—in the sense of AI research that enhances human interests—is better than the alternative on the margin. What reason is there to think AI safety now is better than pushing for greater AI population growth now? (Potential responses to this question are outlined in other bullet points above and below.)
AI safety has lasting effects due to a future value lock-in event, whereas accelerationism would have, at best, temporary effects.
Are you sure there will ever actually be a “value lock-in event”?
Even if there is at some point a value lock-in event, wouldn’t pushing for accelerationism also plausibly affect the values that are locked in? For example, the value of “population growth is good” seems more likely to be locked in, if you advocate for that now.
A belief that humans would be kinder and more benevolent than unaligned AIs
Humans seem pretty bad already. For example, humans are responsible for factory farming. It’s plausible that AIs could be even more callous and morally indifferent than humans, but the bar already seems low.
I’m also not convinced that moral values will be a major force shaping “what happens to the cosmic endowment”. It seems to me that the forces shaping economic consumption matter more than moral values.
A bedrock heuristic that it would be extraordinarily bad if “we all died from AI”, and therefore we should pursue AI safety over AI accelerationism.
But it would also be bad if we all died from old age while waiting for AI, and missed out on all the benefits that AI offers to humans, which is a point in favor of acceleration. Why would this heuristic be weaker?
An adherence to person-affecting views in which the values of currently-existing humans are what matter most; and a belief that AI threatens to kill existing humans.
But in this view, AI accelerationism could easily be favored since AIs could greatly benefit existing humans by extending our lifespans and enriching our lives with advanced technology.
An implicit acceptance of human supremacism, i.e. the idea that what matters is propagating the interests of the human species, or preserving the human species, even at the expense of individual interests (either within humanity or outside humanity) or the interests of other species.
But isn’t EA known for being unusually anti-speciesist compared to other communities? Peter Singer is often seen as a “founding father” of the movement, and a huge part of his ethical philosophy was about how we shouldn’t be human supremacists.
More generally, it seems wrong to care about preserving the “human species” in an abstract sense relative to preserving the current generation of actually living humans.
A belief that most humans are biased towards acceleration over safety, and therefore it is better for EAs to focus on safety as a useful correction mechanism for society.
But was an anti-safety bias common for previous technologies? I think something closer to the opposite is probably true: most humans seem, if anything, biased towards being overly cautious about new technologies rather than overly optimistic.
A belief that society is massively underrating the potential for AI, which favors extra work on AI safety, since it’s so neglected.
But if society is massively underrating AI, then this should also favor accelerating AI too? There doesn’t seem to be an obvious asymmetry between these two values.
An adherence to negative utilitarianism, which would favor obstructing AI, along with any other technology that could enable the population of conscious minds to expand.
This seems like a plausible moral argument to me, but it doesn’t seem like a very popular position among EAs.
A heuristic that “change is generally bad” and AI represents a gigantic change.
I don’t think many EAs would defend this heuristic explicitly.
Added: AI represents a large change to the world. Delaying AI therefore preserves option value.
This heuristic seems like it would have favored advocating delaying the industrial revolution, and all sorts of moral, social, and technological changes to the world in the past. Is that a position that EAs would be willing to bite the bullet on?
My understanding is that relatively few EAs are actual hardcore classic hedonist utilitarians. I think this is ~sufficient to explain why more haven’t become accelerationists.
Have you cornered a classic hedonist utilitarian EA and asked them? Have you cornered three? What did they say?
Don’t know why this is being disagree-voted. I think point 1 is basically correct—it doesn’t take diverging far from being a “hardcore classic hedonist utilitarian” to not support the case Matthew makes in the OP
I think a more important reason is the additional value of the information and the option value. It’s very likely that the change resulting from AI development will be irreversible. Since we’re still able to learn about AI as we study it, taking additional time to think and plan before training the most powerful AI systems seems to reduce the likelihood of being locked into suboptimal outcomes. Increasing the likelihood of achieving “utopia” rather than landing into “mediocrity” by 2 percent seems far more important than speeding up utopia by 10 years.
I think all actions are in a sense irreversible, but large changes tend to be less reversible than small changes. In this sense, the argument you gave seems reducible to “we should generally delay large changes to the world, to preserve option value”. Is that a reasonable summary?
In this case I think it’s just not obvious that delaying large changes is good. Would it have been good to delay the industrial revolution to preserve option value? I think this heuristic, if used in the past, would have generally demanded that we “pause” all sorts of social, material, and moral progress, which seems wrong.
I don’t think we would have been able to use the additional information we would have gained from delaying the industrial revolution but I think if we could have the answer might be “yes”. It’s easy to see in hindsight that it went well overall, but that doesn’t mean that the correct ex ante attitude shouldn’t have been caution!
Paul Christiano wrote a piece a few years ago about ensuring that misaligned ASI is a “good successor” (in the moral value sense),[1] as a plan B to alignment (Medium version; LW version). I agree it’s odd that there hasn’t been more discussion since.[2]
I’ve wonderedabout this myself. My take is that this area was overlooked a year ago, but there’s now some good work being done. See Jeff Sebo’s Nov ’23 80k podcast episode, as well as Rob Long’s episode, and the paper that the two of them co-authored at the end of last year: “Moral consideration for AI systems by 2030”. Overall, I’m optimistic about this area becoming a new forefront of EA.
I’m confused by this point, and for me this is the overriding crux between my view and yours. Do you really not think accelerationism could have permanent effects, through making AI takeover, or some other irredeemable outcome, more likely?
I’m not sure there’ll be a lock-in event, in the way I can’t technically be sure about anything, but such an event seems clearly probable enough that I very much want to avoid taking actions that bring it closer. (Insofar as bringing the event closer raises the chance it goes badly, which I believe to be a likely dynamic. See, for example, the Metaculus question, “How does the level of existential risk posed by AGI depend on its arrival time?”, or discussion of the long reflection.)
Although, Paul’s argument routes through acausal cooperation—see the piece for details—rather than through the ASI being morally valuable in itself. (And perhaps OP means to focus on the latter issue.) In Paul’s words:
There was a little discussion a few months ago, here, but none of what was said built on Paul’s article.
It’s worth emphasizing that moral welfare of digital minds is quite a different (though related) topic to whether AIs are good successors.
Fair point, I’ve added a footnote to make this clearer.
Under purely longtermist views, accelerating AI by 1 year increases available cosmic resources by 1 part in 10 billion. This is tiny. So the first order effects of acceleration are tiny from a longtermist perspective.
Thus, a purely longtermist perspective doesn’t care about the direct effects of delay/acceleration and the question would come down to indirect effects.
I can see indirect effects going either way, but delay seems better on current margins (this might depend on how much optimism you have on current AI safety progress, governance/policy progress, and whether you think humanity retaining control relative to AIs is good or bad). All of these topics have been explored and discussed to some extent.
When focusing on the welfare/preferences of currently existing people, I think it’s unclear if accelerating AI looks good or bad, it depends on optimism about AI safety, how you trade-off old people versus young people, and death via violence versus death from old age. (Misaligned AI takeover killing lots of people is by no means assured, but seems reasonably likely by default.)
I expect there hasn’t been much investigation of accelerating AI to advance the preferences of currently existing people because this exists at a point on the crazy train that very few people are at. See also the curse of cryonics:
Tiny compared to what? Are you assuming we can take some other action whose consequences don’t wash out over the long-term, e.g. because of a value lock-in? In general, these assumptions just seem quite weak and underspecified to me.
What exactly is the alternative action that has vastly greater value in expectation, and why does it have greater value? If what you mean is that we can try to reduce the risk of extinction instead, keep in mind that my first bullet point preempted pretty much this exact argument:
Ensuring human control throughout the singularity rather than having AIs get control very obviously has relatively massive effects. Of course, we can debate the sign here, I’m just making a claim about the magnitude.
I’m not talking about extinction of all smart beings on earth (AIs and humans), which seems like a small fraction of existential risk.
(Separately, the badness of such extinction seems maybe somewhat overrated because pretty likely intelligent life will just re-evolve in the next 300 million years. Intelligent life doesn’t seem that contingent. Also aliens.)
For what it’s worth, I think my reply to Pablo here responds to your comment fairly adequately too.
I think it remains the case that the value of accelerating AI progress is tiny relative to other apparently available interventions, such as ensuring that AIs are sentient or improving their expected well-being conditional on their being sentient. The case for focusing on how a transformative technology unfolds, rather than on when it unfolds,[1] seems robust to a relatively wide range of technologies and assumptions. Still, this seems worth further investigation.
Indeed, it seems that when the transformation unfolds is primarily important because of how it unfolds, insofar as the quality of a transformation is partly determined by its timing.
I’m claiming that it is not actually clear that we can take actions that don’t merely wash out over the long-term. In this case, you cannot simply assume that we can meaningfully and predictably affect how valuable the long-term future will be in, for example, billions of years. I agree that, yes, if you assume we can meaningfully affect the very long-run, then all actions that merely have short-term effects will have “tiny” impacts by comparison. But the assumption that we can meaningfully and predictably affect the long-run is precisely the thing that needs to be argued. I think it’s important for EAs to try to be more rigorous about their empirical claims here.
Moreover, actions that have short-term effects can generally be assumed to have longer term effects if our actions propagate. For example, support for larger population sizes now would presumably increase the probability that larger population sizes exist in the very long run, compared to the alternative of smaller population sizes with high per capita incomes. It seems arbitrary to assume this effect will be negligible but then also assume other competing effects won’t be negligible. I don’t see any strong arguments for this position.
I was trying to hint at prima facie plausible ways in which the present generation can increase the value of the long-term future by more than one part in billions, rather than “assume” that this is the case, though of course I never gave anything resembling a rigorous argument.
I do agree that the “washing out” hypothesis is a reasonable default and that one needs a positive reason for expecting our present actions to persist into the long-term. One seemingly plausible mechanism is influencing how a transformative technology unfolds: it seems that the first generation that creates AGI has significantly more influence on how much artificial sentience there is in the universe a trillion years from now than, say, the millionth generation. Do you disagree with this claim?
I’m not sure I understand the point you make in the second paragraph. What would be the predictable long-term effects of hastening the arrival of AGI in the short-term?
As I understand, the argument originally given was that there was a tiny effect of pushing for AI acceleration, which seems outweighed by unnamed and gigantic “indirect” effects in the long-run from alternative strategies of improving the long-run future. I responded by trying to get more clarity on what these gigantic indirect effects actually are, how we can predictably bring them about, and why we would think it’s plausible that we could bring them about in the first place. From my perspective, the shape of this argument looks something like:
Your action X has this tiny positive near-term effect (ETA: or a tiny direct effect)
My action Y has this large positive long-term effect (ETA: or a large indirect effect)
Therefore, Y is better than X.
Do you see the flaw here? Well, both X and Y could have long-term effects! So, it’s not sufficient to compare the short-term effect of X to the long-term effect of Y. You need to compare both effects, on both time horizons. As far as I can tell, I haven’t seen any argument in this thread that analyzed and compared the long-term effects in any detail, except perhaps in Ryan Greenblatt original comment, in which he linked to some other comments about a similar topic in a different thread (but I still don’t see what the exact argument is).
More generally, I think you’re probably trying to point to some concept you think is obvious and clear here, and I’m not seeing it, which is why I’m asking you to be more precise and rigorous about what you’re actually claiming.
In my original comment I pointed towards a mechanism. Here’s a more precise characterization of the argument:
Total utilitarianism generally supports, all else being equal, larger population sizes with low per capita incomes over small population sizes with high per capita incomes.
To the extent that our actions do not “wash out”, it seems reasonable to assume that pushing for large population sizes now would make it more likely in the long-run that we get large population sizes with low per-capita incomes compared to a small population size with high per capita incomes. (Keep in mind here that I’m not making any claim about the total level of resources.)
To respond to this argument you could say that in fact our actions do “wash out” here, so as to make the effect of pushing for larger population sizes rather small in the long run. But in response to that argument, I claim that this objection can be reversed and applied to almost any alternative strategy for improving the future that you might think is actually better. (In other words, I actually need to see your reasons for why there’s an asymmetry here; and I don’t currently see these reasons.)
Alternatively, you could just say that total utilitarianism is unreasonable and a bad ethical theory, but my original comment was about analyzing the claim about accelerating AI from the perspective of total utilitarianism, which, as a theory, seems to be relatively popular among EAs. So I’d prefer to keep this discussion grounded within that context.
Thanks for the clarification.
Yes, I agree that we should consider the long-term effects of each intervention when comparing them. I focused on the short-term effects of hastening AI progress because it is those effects that are normally cited as the relevant justification in EA/utilitarian discussions of that intervention. For instance, those are the effects that Bostrom considers in ‘Astronomical waste’. Conceivably, there is a separate argument that appeals to the beneficial long-term effects of AI capability acceleration. I haven’t considered this argument because I haven’t seen many people make it, so I assume that accelerationist types tend to believe that the short-term effects dominate.
I think Bostrom’s argument merely compares a pure x-risk (such as a huge asteroid hurtling towards Earth) relative to technological acceleration, and then concludes that reducing the probability of a pure x-risk is more important because the x-risk threatens the eventual colonization of the universe. I agree with this argument in the case of a pure x-risk, but as I noted in my original comment, I don’t think that AI risk is a pure x-risk.
If, by contrast, all we’re doing by doing AI safety research is influencing something like “the values of the agents in society in the future” (and not actually influencing the probability of eventual colonization), then this action seems to plausibly just wash out in the long-term. In this case, it seems very appropriate to compare the short-term effects of AI safety to the short-term effects of acceleration.
Let me put it another way. We can think about two (potentially competing) strategies for making the future better, along with their relevant short and possible long-term effects:
Doing AI safety research
Short-term effects: makes it more likely that AIs are kind to current or near-future humans
Possible long-term effect: makes it more likely that AIs in the very long-run will share the values of the human species, relative to some unaligned alternative
Accelerating AI
Short-term effect: helps current humans by hastening the arrival of advanced technology
Possible long-term effect: makes it more likely that we have a large population size at low per capita incomes, relative to a low population size with high per capita income
My opinion is that both of these long-term effects are very speculative, so it’s generally better to focus on a heuristic of doing what’s better in the short-term, while keeping the long-term consequences in mind. And when I do that, I do not come to a strong conclusion that AI safety research “beats” AI acceleration, from a total utilitarian perspective.
To be clear, this wasn’t the structure of my original argument (though it might be Pablo’s). My argument was more like “you seem to be implying that action X is good because of its direct effect (literal first order acceleration), but actually the direct effect is small when considered in a particular perspective (longtermism), so for the that perspective we need to consideer indirect effects and the analysis for that looks pretty different”.
Note that I wasn’t trying really trying argue much about the sign of the indirect effect, though people have indeed discussed this in some detail in various contexts.
I agree your original argument was slightly different than the form I stated. I was speaking too loosely, and conflated what I thought Pablo might be thinking with what you stated originally.
I think the important claim from my comment is “As far as I can tell, I haven’t seen any argument in this thread that analyzed and compared the long-term effects in any detail, except perhaps in Ryan Greenblatt original comment, in which he linked to some other comments about a similar topic in a different thread (but I still don’t see what the exact argument is).”
Explicitly confirming that this seems right to me.
I don’t disagree with this. I was just claiming that the “indirect” effects dominate (by indirect, I just mean effects other than shifting the future closer in time).
There is still the question of indirect/direct effects.
I understand that. I wanted to know why you thought that. I’m asking for clarity. I don’t currently understand your reasons. See this recent comment of mine for more info.
(I don’t think I’m going to engage further here, sorry.)
I generally agree that we should be more concerned about this. In particular, I find people who will happily approve Shut Up and Multiply sentiment but reject this consideration suspect in their reasoning.
A more extreme version of this is that, given the massively greater efficiency with which a digital consciousness could convert matter and energy to utilons (IIRC naively about 3 orders of magnitude according to Bostrom, before any increase from greater coordination), on strict expected value reasoning you have to be extremely confident that this won’t happen—or at least have a much stronger rebuttal than ‘AI won’t necessarily be conscious’.
Separately, I think there might be a case for accelerationism even if you think it increases the risk of AI takeover and that AI takeover is bad, on the grounds that in many scenarios advancing faster might still increase the probability of human descendants getting through the time of perils before some other threat destroys us (every year we remain in our current state is another year in which we run the risk of, for example, a global nuclear war or civilisation-ending pandemic).
Hi,
I have a post where I conclude the above may well apply not only to digital consciousness, but also to animals:
A lot of these points seem like arguments that it’s possible that unaligned AI takeover will go well, e.g. there’s no reason not to think that AIs are conscious, or will have interesting moral values, or etc.
My stance is that we (more-or-less) know humans are conscious and have moral values that, while they have failed to prevent large amounts of harm, seem to have the potential to be good. AIs may be conscious and may have welfare-promoting values, but we don’t know that yet. We should try to better understand whether AIs are worthy successors before transitioning power to them.
Probably a core point of disagreement here is whether, presented with a “random” intelligent actor, we should expect it to promote welfare or prevent suffering “by default”. My understanding is that some accelerationists believe that we should. I believe that we shouldn’t. Moreover I believe that it’s enough to be substantially uncertain about whether this is or isn’t the default to want to take a slower and more careful approach.
I claim there’s a weird asymmetry here where you’re happy to put trust into humans because they have the “potential” to do good, but you’re not willing to say the same for AIs, even though they seem to have the same type of “potential”.
Whatever your expectations about AIs, we already know that humans are not blank slates that may or may not be altruistic in the future: we actually have a ton of evidence about the quality and character of human nature, and it doesn’t make humans look great. Humans are not mainly described as altruistic creatures. I mentioned factory farming in my original comment, but one can examine the way people spend their money (i.e. not mainly on charitable causes), or the history of genocides, war, slavery, and oppression for additional evidence.
I don’t expect humans to “promote welfare or prevent suffering” by default either. Look at the current world. Have humans, on net, reduced or increased suffering? Even if you think humans have been good for the world, it’s not obvious. Sure, it’s easy to dismiss the value of unaligned AIs if you compare against some idealistic baseline; but I’m asking you to compare against a realistic baseline, i.e. actual human nature.
It seems like you’re just substantially more pessimistic than I am about humans. I think factory farming will be ended, and though it seems like humans have caused more suffering than happiness so far, I think their default trajectory will be to eventually stop doing that, and to ultimately do enough good to outweigh their ignoble past. I don’t think this is certain by any means, but I think it’s a reasonable extrapolation. (I maybe don’t expect you to find it a reasonable extrapolation.)
Meanwhile I expect the typical unaligned AI may seize power for some purpose that seems to us entirely trivial, and may be uninterested in doing any kind of moral philosophy, and/or may not place any terminal (rather than instrumental) value in paying attention to other sentient experiences in any capacity. I do think humans, even with their kind of terrible track record, are more promising than that baseline, though I can see why other people might think differently.
I haven’t read your entire post about this, but I understand you believe that if we created aligned AI, it would get essentially “current” human values, rather than e.g. some improved / more enlightened iteration of human values. If instead you believed the latter, that would set a significantly higher bar for unaligned AI, right?
That’s right, if I thought human values would improve greatly in the face of enormous wealth and advanced technology, I’d definitely be open to seeing humans as special and extra valuable from a total utilitarian perspective. Note that many routes through which values could improve in the future could apply to unaligned AIs too. So, for example, I’d need to believe that humans would be more likely to reflect, and be more likely to do the right type of reflection, relative to the unaligned baseline. In other words it’s not sufficient to argue that humans would reflect a little bit; that wouldn’t really persuade me at all.
(edit: my point is basically the same as emre’s)
I think there is very likely at some point going to be some sort of transition to a world where AIs are effectively in control. It seems worth it to slow down on the margin to try to shape this transition as best we can, especially slowing it down as we get closer to AGI and ASI. It would be surprising to me if making the transfer of power more voluntary/careful led to worse outcomes (or only led to slightly better outcomes such that the downsides of slowing down a bit made things worse).
Delaying the arrival of AGI by a few years as we get close to it seems good regardless of parameters like the value of involuntary-AI-disempowerment futures. But delaying the arrival by 100s of years seems more likely bad due to the tradeoff with other risks.
Two questions here:
Why would accelerating AI make the transition less voluntary? (In my own mind, I’d be inclined to reverse this sentiment a bit: delaying AI by regulation generally involves forcibly stopping people from adopting AI. Force might be justified if it brings about a greater good, but that’s not the argument here.)
I can understand being “careful”. Being careful does seem like a good thing. But “being careful” generally trades off against other values in almost every domain I can think of, and there is such a thing as too much of a good thing. What reason is there to think that pushing for “more caution” is better on the margin compared to acceleration, especially considering society’s default response to AI in the absence of intervention?
So in the multi-agent slowly-replacing case, I’d argue that individual decisions don’t necessarily represent a voluntary decision on behalf of society (I’m imagining something like this scenario). In the misaligned power-seeking case, it seems obvious to me that this is involuntary. I agree that it technically could be a collective voluntary decision to hand over power more quickly, though (and in that case I’d be somewhat less against it).
I think emre’s comment lays out the intuitive case for being careful / taking your time, as does Ryan’s. I think the empirics are a bit messy once you take into account benefits of preventing other risks but I’d guess they come out in favor of delaying by at least a few years.
I don’t think this is a crux. Even if you prefer unaligned AI values over likely human values (weighted by power), you’d probably prefer doing research on further improving AI values over speeding things up.
I think misaligned AI values should be expected to be worse than human values, because it’s not clear that misaligned AI systems would care about eg their own welfare.
Inasmuch as we expect misaligned AI systems to be conscious (or whatever we need to care about them) and also to be good at looking after their own interests, I agree that it’s not clear from a total utilitarian perspective that the outcome would be bad.
But the “values” of a misaligned AI system could be pretty arbitrary, so I don’t think we should expect that.
So I think it’s likely you have some very different beliefs from most people/EAs/myself, particularly:
Thinking that humans/humanity is bad, and AI is likely to be better
Thinking that humanity isn’t driven by ideational/moral concerns[1]
That AI is very likely to be conscious, moral (as in, making better moral judgements than humans), and that the current/default trend in the industry is very likely to make them conscious moral agents in a way humans aren’t
I don’t know if the total utilitarian/accelerationist position in the OP is yours or not. I think Daniel is right that most EAs don’t have this position. I think maybe Peter Singer gets closest to this in his interview with Tyler on the ‘would you side with the Aliens or not question’ here. But the answer to your descriptive question is simply that most EAs don’t have the combination of moral and empirical views about the world to make the argument you present valid and sound, so that’s why there isn’t much talk in EA about naïve accelerationism.
Going off the vibe I get from this view though, I think it’s a good heuristic that if your moral view sounds like a movie villain’s monologue it might be worth reflecting, and a lot of this post reminded me of the Earth-Trisolaris Organisation from Cixin Liu’s Three Body Problem. If someone’s honest moral view is “Eliminate human tyranny! The world belongs to
TrisolarisAIs!” then I don’t know what else there is to do except quote Zvi’s phrase “please speak directly into this microphone”.Another big issue I have with this post is that some of the counter-arguments just seem a bit like ‘nu-uh’, see:
These (and other examples) are considerations for sure, but they need to be argued for. I don’t think they can just be stated and then say “therefore, ACCELERATE!”. I agree that AI Safety research needs to be more robust and the philosophical assumptions and views made more explicit, but one could already think of some counters to the questions that you raise, and I’m sure you already have them. For example, you might take a view (ala Peter Godfrey-Smith) that a certain biological substrate is necessary for conscious.
Similarly on total utilitarianism emphasis larger population sizes, agreed to the extent that the greater population increase the population utility, but this is the repugnant conclusion again. There’s a stopping point even in that scenario where an ever larger population decreases total utility, which is why in Parfit’s scenario it’s full of potatoes and muzak rather than humans crammed into battery cages like factory-farmed animals. Empirically, naïve accelerationism may tend toward the latter case in practice, even if there’s a theoretical case to be made for it.
There’s more I could say, but I don’t want to make this reply too long, and I think as Nathan said it’s a point worth discussing. Nevertheless it seems our different positions on this are built on some wide, fundamental divisions about reality and morality itself, and I’m not sure how those can be bridged, unless I’ve wildly misunderstood your position.
this is me-specific
I don’t think humanity is bad. I just think people are selfish, and generally driven by motives that look very different from impartial total utilitarianism. AIs (even potentially “random” ones) seem about as good in expectation, from an impartial standpoint. In my opinion, this view becomes even stronger if you recognize that AIs will be selected on the basis of how helpful, kind, and useful they are to users. (Perhaps notice how different this selection criteria is from the evolutionary criteria used to evolve humans.)
I understand that most people are partial to humanity, which is why they generally find my view repugnant. But my response to this perspective is to point out that if we’re going to be partial to a group on the basis of something other than utilitarian equal consideration of interests, it makes little sense to choose to be partial to the human species as opposed to the current generation of humans or even myself. And if we take this route, accelerationism seems even more strongly supported than before, since developing AI and accelerating technological progress seems to be the best chance we have of preserving the current generation against aging and death. If we all died, and a new generation of humans replaced us, that would certainly be pretty bad for us.
Which sounds more like a movie villain’s monologue?
The idea that everyone currently living needs to sacrificed, and die, in order to preserve the human species
The idea that we should try to preserve currently living people, even if that means taking on a greater risk of not preserving the values of the human species
To be clear, I also just totally disagree with the heuristic that “if your moral view sounds like a movie villain’s monologue it might be worth reflecting”. I don’t think that fiction is generally a great place for learning moral philosophy, albeit with some notable exceptions.
Anyway, the answer to these moral questions may seem obvious to you, but I don’t think they’re as obvious as you’re making them seem.
This is not why people disagree IMO.
I think the fact that people are partial to humanity explains a large fraction of the disagreement people have with me. But, fair enough, I exaggerated a bit. My true belief is a more moderate version of that claim.
When discussing why EAs in particular disagree with me, to overgeneralize by a fair bit, I’ve noticed that EAs are happy to concede that AIs could be moral patients, but are generally reluctant to admit AIs as moral agents, in the way they’d be happy to accept humans as independent moral agents (e.g. newborns) into our society. I’d call this “being partial to humanity”, or at least, “being partial to the values of the human species”.
(In my opinion, this partiality seems so prevalent and deep in most people that to deny it seems a bit like a fish denying the existence of water. But I digress.)
To test this hypothesis, I recently asked three questions on Twitter about whether people would be willing to accept immigration through a portal to another universe from three sources:
“a society of humans who are very similar to us”
“a society of people who look & act like humans, but each of them only cares about their family”
“a society of people who look & act like humans, but they only care about maximizing paperclips”
I emphasized that in each case, the people are human-level in their intelligence, and also biological.
The results are preliminary (and I’m not linking here to avoid biasing the results, as voting has not yet finished), but so far my followers, who are mostly EAs, are much more happy to let the humans immigrate to our world, compared to the last two options. I claim there just aren’t really any defensible reasons to maintain this choice other than by implicitly appealing to a partiality towards humanity.
My guess is that if people are asked to defend their choice explicitly, they’d largely talk about some inherent altruism or hope they place in the human species, relative to the other options; and this still looks like “being partial to humanity”, as far as I can tell, from almost any reasonable perspective.
Maybe, it’s hard for me to know. But I predict most the pushback you’re getting from relatively thoughtful longtermists isn’t due to this.
I agree with this.
I think “being partial to humanity” is a bad description of what’s going on because (e.g.) these same people would be considerably more on board with aliens. I think the main thing going on is that people have some (probably mistaken) levels of pessimism about how AIs would act as moral agents which they don’t have about (e.g.) aliens.
This comparison seems to me to be missing the point. Minimally I think what’s going on is not well described as “being partial to humanity”.
Here’s a comparison I prefer:
A society of humans who are very similar to us.
A society of humans who are very similar to us in basically every way, except that they have a genetically-caused and strong terminal preference for maximizing the total expected number of paper clips (over the entire arc of history) and only care about other things instrumentally. They are sufficiently commited to paper clip maximization that this will persist on arbitrary reflection (e.g. they’d lock in this view immediately when given this option) and let’s also suppose that this view is transmitted genetically and in a gene-drive-y way such that all of their descendents will also only care about paper clips. (You can change paper clips to basically anything else which is broadly recognized to have no moral value on its own, e.g. gold twisted into circles.)
A society of beings (e.g. aliens) who are extremely different in basically every way to humans except that they also have something pretty similar to the concepts of “morality”, “pain”, “pleasure”, “moral patienthood”, “happyness”, “preferences”, “altruism”, and “careful reasoning about morality (moral thoughtfulness)”. And the society overall also has a roughly similar relationship with these concepts (e.g. the level of “altruism” is similar). (Note that having the same relationship as humans to these concepts is a pretty low bar! Humans aren’t that morally thoughtful!)
I think I’m almost equally happy with (1) and (3) on this list and quite unhappy with (2).
If you changed (3) to instead be “considerably more altruistic”, I would prefer (3) over (1).
I think it seems weird to call my views on the comparison I just outlined as “being partial to humanity”: I actually prefer (3) over (2) even though (2) are literally humans!
(Also, I’m not that commited to having concepts of “pain” and “pleasure”, but I’m relatively commited to having a concepts which are something like “moral patienthood”, “preferences”, and “altruism”.)
Below is a mild spoiler for a story by Eliezer Yudkowsky:
To make the above comparison about different beings more concrete, in the case of three worlds collide, I would basically be fine giving the universe over the the super-happies relative to humans (I think mildly better than humans?) and I think it seems only mildly worse than humans to hand it over to the baby-eaters. In both cases, I’m pricing in some amount of reflection and uplifting which doesn’t happen in the actual story of three worlds collide, but would likely happen in practice. That is, I’m imagining seeing these societies prior to their singularity and then based on just observations of their societies at this point, deciding how good they are (pricing in the fact that the society might change over time).
To be clear, it seems totally reasonable to call this “being partial to some notion of moral thoughtfulness about pain, pleasure, and preferences”, but these concepts don’t seem that “human” to me. (I predict these occur pretty frequently in evolved life that reaches a singularity for instance. And they might occur in AIs, but I expect misaligned AIs which seize control of the world are worse from my perspective than if humans retain control.)
When I say that people are partial to humanity, I’m including an irrational bias towards thinking that humans, or evolved beings, are unusually thoughtful or ethical compared to the alternatives (I believe this is in fact an irrational bias, since the arguments I’ve seen for thinking that unaligned AIs will be less thoughtful or ethical than aliens seem very weak to me).
In other cases, when people irrationally hold a certain group X to a higher standard than a group Y, it is routinely described as “being partial to group Y over group X”. I think this is just what “being partial” means, in an ordinary sense, across a wide range of cases.
For example, if I proposed aligning AI to my local friend group, with the explicit justification that I thought my friends are unusually thoughtful, I think this would be well-described as me being “partial” to my friend group.
To the extent you’re seeing me as saying something else about how longtermists view the argument, I suspect you’re reading me as saying something stronger than what I originally intended.
In that case, my main disagreement is thinking that your twitter poll is evidence for your claims.
More specifically:
Like you claim there aren’t any defensible reasons to think that what humans will do is better than literally maximizing paper clips? This seems totally wild to me.
I’m not exactly sure what you mean by this. There were three options, and human paperclippers were only one of these options. I was mainly discussing the choice between (1) and (2) in the comment, not between (1) and (3).
Here’s my best guess at what you’re saying: it sounds like you’re repeating that you expect humans to be unusually altruistic or thoughtful compared to an unaligned alternative. But the point of my previous comment was to state my view that this bias counted as “being partial towards humanity”, since I view the bias as irrational. In light of that, what part of my comment are you objecting to?
To be clear, you can think the bias I’m talking about is actually rational; that’s fine. But I just disagree with you for pretty mundane reasons.
[Incorporating what you said in the other comment]
Then I think it’s worth concretely explaining what these reasons are to believe that human control will be a decent amount better in expectation. You don’t need to write this up yourself, of course. I think the EA community should write these reasons up. Because I currently view the proposition as non-obvious, and despite being a critical belief in AI risk discussions, it’s usually asserted without argument. When I’ve pressed people in the past, they typically give very weak reasons.
I don’t know how to respond to an argument whose details are omitted.
+1, but I don’t generally think it’s worth counting on “the EA community” to do something like this. I’ve been vaguely trying to pitch Joe on doing something like this (though there are probably better uses of his time) and his recent blogs posts are touching similar topics.
Also, it’s usually only the crux of longtermists which is probably one of the reasons why no one has gotten around to this.
You didn’t make this clear, so was just responding generically.
Separately, I think I feel a pretty similar intution for case (2), people literally only caring about their families seems pretty clearly worse.
There, I’m just saying that human control is better than literal paperclip maximization.
This response still seems underspecified to me. Is the default unaligned alternative paperclip maximization in your view? I understand that Eliezer Yudkowsky has given arguments for this position, but it seems like you diverge significantly from Eliezer’s general worldview, so I’d still prefer to hear this take spelled out in more detail from your own point of view.
Your poll says:
And then you say:
So, I think more human control is better than more literal paperclip maximization, the option given in your poll.
My overall position isn’t that the AIs will certainly be paperclippers, I’m just arguing in isolation about why I think the choice given in the poll is defensible.
I have the feeling we’re talking past each other a bit. I suspect talking about this poll was kind of a distraction. I personally have the sense of trying to convey a central point, and instead of getting the point across, I feel the conversation keeps slipping into talking about how to interpret minor things I said, which I don’t see as very relevant.
I will probably take a break from replying for now, for these reasons, although I’d be happy to catch up some time and maybe have a call to discuss these questions in more depth. I definitely see you as trying a lot harder than most other EAs in trying to make progress on these questions collaboratively with me.
I’d be very happy to have some discussion on these topics with you Matthew. For what it’s worth, I really have found much of your work insightful, thought-provoking, and valuable. I think I just have some strong, core disagreements on multiple empirical/epistemological/moral levels with your latest series of posts.
That doesn’t mean I don’t want you to share your views, or that they’re not worth discussion, and I apologise if I came off as too hostile. An open invitation to have some kind of deeper discussion stands.[1]
I’d like to try out the new dialogue feature on the Forum, but that’s a weak preference
Agreed, sorry about that.
Also, to be clear, I agree that the question of “how much worse/better is it for AIs to get vast amounts of resources without human society intending to grant those resources to the AIs from a longtermist perspective” is underinvestigated, but I think there are pretty good reasons to systematically expect human control to be a decent amount better.
I’m guessing preference utilitarians would typically say that only the preferences of conscious entities matter. I doubt any of them would care about satisfying an electron’s “preference” to be near protons rather than ionized.
Perhaps. I don’t know what most preference utilitarians believe.
Are you familiar with Brian Tomasik? (He’s written about suffering of fundamental particles, and also defended preference utilitarianism.)
Strongly there should be more explicit defences of this argument.
One way of doing this in a co-operative way might working on co-operative AI stuff, since it seems to increase the likelihood that misaligned AI goes well, or at least less badly.
My personal reason for not digging into this is that my naive model of how good the AI future is: quality_of_future * amount_of_the_stuff. And there is distinction I haven’t seen you acknowledged: while high “quality” doesn’t require humans to be around, I ultimately judge quality by my values. (Thing being conscious is an example. But this also includes things like not copy-pasting the same thing all over, not wiping out aliens, and presumably many other things I am not aware of. IIRC Yudkowsky talks about cosmopolitanism being a human value.) Because of this, my impression is that if we hand over the future to a random AI, the “quality” will be very low. And so we can currently have a much larger impact by focusing on increasing the quality. Which we can do by delaying “handing over the future to AI” and picking a good AI to hand over to. IE, alignment.
(Still, I agree it would be nice if there was a better analysis of this, which exposed the assumptions.)
Is there any particular reason why you are partial towards humans generically controlling the future, relative to this particular current generation of humans? To me, it seems like being partial to one’s own values, one’s community, and especially one’s own life, generally leads to an even stronger argument for accelerationism, since the best way to advance your own values is generally to actually “be there” when AI happens.
In my opinion, the main relevant alternative to this view is to be partial to the human species, as opposed to being partial to either one’s current generation, or oneself. And I think the human species is kind of a weird category to be partial to, relative to those other things. Do you disagree?
I agree with this.
I (strongly) disagree with this. Me being alive is a relatively small part of my values. And since I am not the director of the world, me personally being around to influence things is unlikely to have a decisive impact on things I value.
In more detail: Sure, all else being equal, me being there when AI happens is mildly helpful. But the outcome of building AI seems to be a function of, among other things, (i) values of the people building it + (ii) how much reflection they can do on those values + (iii) the environment dynamics these people are subject to (e.g., the current race dynamics between AI companies). And over time, I expect the potential decrease in (i) to be far outweighed by gains in (ii) and (iii).
The first issue is about (i), that it is not actually me building the AGI, either now or in the future. But I am willing to grant that (all else being equal) current generation is more likely to have values closer to my values.
However, I expect that the factors are (ii) and (iii) are just as influential. Regarding (ii), it seems we keep making progress at philosophy, ethics, etc, and to me, this currently far outweighs the value drift in (i).
Regarding (iii), my impression is that the current situation is so bad that it can’t get much worse, and we might as well wait. This of course depends on how likely you think we are likely to get a bad outcome if we either (a) get superintelligence without additional progress on alignment or (b) get widespread human-level AI with no progress on alignment, institution design, etc.
I agree some people (such as yourself) might be extremely altruistic, and therefore might not care much about their own life relative to other values they hold, but this position is fairly uncommon. Most people care a lot about their own lives (and especially the lives of their family and friends) relative to other things they care about. We can empirically test this hypothesis by looking at how people choose to spend their time and money; and the results are generally that people spend their money on themselves, their family and their friends.
You don’t need to be director of the world to have influence over things. You can just be a small part of the world to have influence over things that you care about. This is essentially what you’re already doing by living and using your income to make decisions, to satisfy your own preferences. I’m claiming this situation could and probably will persist into the indefinite future, for the agents that exist in the future.
I’m very skeptical that there will ever be a moment in time during which there will be a “director of the world”, in a strong sense. And I doubt the developer of the first AGI will become the director of the world, even remotely (including versions of them that reflect on moral philosophy etc.). You might want to read my post about this.
Great points, Matthew! I have wondered about this too. Relatedly, readers may want to check the sequence otherness and control in the age of AGI from Joe Carlsmith, in particular, Does AI risk “other” the AIs?.
One potential argument against accelerating AI is that it will increase the chance of catastrophes which will then lead to overregulating AI (e.g. in the same way that nuclear power arguably was overregulated).
(Clarification about my views in the context of the AI pause debate)
I’m finding it hard to communicate my views on AI risk. I feel like some people are responding to the general vibe they think I’m giving off rather than the actual content. Other times, it seems like people will focus on a narrow snippet of my comments/post and respond to it without recognizing the context. For example, one person interpreted me as saying that I’m against literally any AI safety regulation. I’m not.
For a full disclosure, my views on AI risk can be loosely summarized as follows:
I think AI will probably be very beneficial for humanity.
Nonetheless, I think that there are credible, foreseeable risks from AI that could do vast harm, and we should invest heavily to ensure these outcomes don’t happen.
I also don’t think technology is uniformly harmless. Plenty of technologies have caused net harm. Factory farming is a giant net harm that might have even made our entire industrial civilization a mistake!
I’m not blindly against regulation. I think all laws can and should be viewed as forms of regulations, and I don’t think it’s feasible for society to exist without laws.
That said, I’m also not blindly in favor of regulation, even for AI risk. You have to show me that the benefits outweigh the harm
I am generally in favor of thoughtful, targeted AI regulations that align incentives well, and reduce downside risks without completely stifling innovation.
I’m open to extreme regulations and policies if or when an AI catastrophe seems imminent, but I don’t think we’re in such a world right now. I’m not persuaded by the arguments that people have given for this thesis, such as Eliezer Yudkowsky’s AGI ruin post.
Thanks, that seems like a pretty useful summary.
I might elaborate on this at some point, but I thought I’d write down some general reasons why I’m more optimistic than many EAs on the risk of human extinction from AI. I’m not defending these reasons here; I’m mostly just stating them.
Skepticism of foom: I think it’s unlikely that a single AI will take over the whole world and impose its will on everyone else. I think it’s more likely that millions of AIs will be competing for control over the world, in a similar way that millions of humans are currently competing for control over the world. Power or wealth might be very unequally distributed in the future, but I find it unlikely that it will be distributed so unequally that there will be only one relevant entity with power. In a non-foomy world, AIs will be constrained by norms and laws. Absent severe misalignment among almost all the AIs, I think these norms and laws will likely include a general prohibition on murdering humans, and there won’t be a particularly strong motive for AIs to murder every human either.
Skepticism that value alignment is super-hard: I haven’t seen any strong arguments that value alignment is very hard, in contrast to the straightforward empirical evidence that e.g. GPT-4 seems to be honest, kind, and helpful after relatively little effort. Most conceptual arguments I’ve seen for why we should expect value alignment to be super-hard rely on strong theoretical assumptions that I am highly skeptical of. I have yet to see significant empirical successes from these arguments. I feel like many of these conceptual arguments would, in theory, apply to humans, and yet human children are generally value aligned by the time they reach young adulthood (at least, value aligned enough to avoid killing all the old people). Unlike humans, AIs will be explicitly trained to be benevolent, and we will have essentially full control over their training process. This provides much reason for optimism.
Belief in a strong endogenous response to AI: I think most people will generally be quite fearful of AI and will demand that we are very cautious while deploying the systems widely. I don’t see a strong reason to expect companies to remain unregulated and rush to cut corners on safety, absent something like a world war that presses people to develop AI as quickly as possible at all costs.
Not being a perfectionist: I don’t think we need our AIs to be perfectly aligned with human values, or perfectly honest, similar to how we don’t need humans to be perfectly aligned and honest. Individual humans are usually quite selfish, frequently lie to each other, and are often cruel, and yet the world mostly gets along despite this. This is true even when there are vast differences in power and wealth between humans. For example some groups in the world have almost no power relative to the United States, and residents in the US don’t particularly care about them either, and yet they survive anyway.
Skepticism of the analogy to other species: it’s generally agreed that humans dominate the world at the expense of other species. But that’s not surprising, since humans evolved independently of other animal species. And we can’t really communicate with other animal species, since they lack language. I don’t think AI is analogous to this situation. AIs will mostly be born into our society, rather than being created outside of it. (Moreover, even in this very pessimistic analogy, humans still spend >0.01% of our GDP on preserving wild animal species, and the vast majority of animal species have not gone extinct despite our giant influence on the natural world.)
ETA: feel free to ignore the below, given your caveat, though you may find it helpful if you choose to write an expanded form of any of the arguments later to have some early objections.
Correct me if I’m wrong, but it seems like most of these reasons boil down to not expecting AI to be superhuman in any relevant sense (since if it is, effectively all of them break down as reasons for optimism)? To wit:
Resource allocation is relatively equal (and relatively free of violence) among humans because even humans that don’t very much value the well-being of others don’t have the power to actually expropriate everyone else’s resources by force. (We have evidence of what happens when those conditions break down to any meaningful degree; it isn’t super pretty.)
I do not think GPT-4 is meaningful evidence about the difficulty of value alignment. In particular, the claim that “GPT-4 seems to be honest, kind, and helpful after relatively little effort” seems to be treating GPT-4′s behavior as meaningfully reflecting its internal preferences or motivations, which I think is “not even wrong”. I think it’s extremely unlikely that GPT-4 has preferences over world states in a way that most humans would consider meaningful, and in the very unlikely event that it does, those preferences almost certainly aren’t centrally pointed at being honest, kind, and helpful.
re: endogenous reponse to AI—I don’t see how this is relevant once you have ASI. To the extent that it might be relevant, it’s basically conceding the argument: that the reason we’ll be safe is that we’ll manage to avoid killing ourselves by moving too quickly. (Note that we are currently moving at pretty close to max speed, so this is a prediction that the future will be different from the past. One that some people are actively optimising for, but also one that other people are optimizing against.)
re: perfectionism—I would not be surprised if many current humans, given superhuman intelligence and power, created a pretty terrible future. Current power differentials do not meaningfully let individual players flip every single other player the bird at the same time. Assuming that this will continue to be true is again assuming the conclusion (that AI will not be superhuman in any relevant sense). I also feel like there’s an implicit argument here about how value isn’t fragile that I disagree with, but I might be reading into it.
I’m not totally sure what analogy you’re trying to rebut, but I think that human treatment of animal species, as a piece of evidence for how we might be treated by future AI systems that are analogously more powerful than we are, is extremely negative, not positive. Human efforts to preserve animal species are a drop in the bucket compared to the casual disregard with which we optimize over them and their environments for our benefit. I’m sure animals sometimes attempt to defend their territory against human encroachment. Has the human response to this been to shrug and back off? Of course, there are some humans who do care about animals having fulfilled lives by their own values. But even most of those humans do not spend their lives tirelessly optimizing for their best understanding of the values of animals.
No, I certainly expect AIs will eventually be superhuman in virtually all relevant respects.
Can you clarify what you are saying here? If I understand you correctly, you’re saying that humans have relatively little wealth inequality because there’s relatively little inequality in power between humans. What does that imply about AI?
I think there will probably be big inequalities in power among AIs, but I am skeptical of the view that there will be only one (or even a few) AIs that dominate over everything else.
I’m curious: does that mean you also think that alignment research performed on GPT-4 is essentially worthless? If not, why?
I agree that GPT-4 probably doesn’t have preferences in the same way humans do, but it sure appears to be a limited form of general intelligence, and I think future AGI systems will likely share many underlying features with GPT-4, including, to some extent, cognitive representations inside the system.
I think our best guess of future AI systems should be that they’ll be similar to current systems, but scaled up dramatically, trained on more modalities, with some tweaks and post-training enhancements, at least if AGI arrives soon. Are you simply skeptical of short timelines?
To be clear, I expect we’ll get AI regulations before we get to ASI. I predict that regulations will increase in intensity as AI systems get more capable and start having a greater impact on the world.
Every industry in history initially experienced little to no regulation. However, after people became more acquainted with the industry, regulations on the industry increased. I expect AI will follow a similar trajectory. I think this is in line with historical evidence, rather than contradicting it.
I agree. If you turned a random human into a god, or a random small group of humans into gods, then I would be pretty worried. However, in my scenario, there aren’t going to be single AIs that suddenly become gods. Instead, in my scenario, there will be millions of different AIs, and the AIs will smoothly increase in power over time. During this time, we will be able to experiment and do alignment research to see what works and what doesn’t at making the AIs safe. I expect AI takeof will be fairly diffuse, and AIs will probably be respectful of norms and laws because no single AI can take over the world by themselves. Of course, the way I think about the future could be wrong on a lot of specific details, but I don’t see a strong reason to doubt the basic picture I’m presenting, as of now.
My guess is that your main objection here is that you think foom will happen, i.e. there will be a single AI that takes over the world and imposes its will on everyone else. Can you elaborate more on why you think that will happen? I don’t think it’s a straightforward consequence of AIs being smarter than humans.
My main argument is that we should reject the analogy itself. I’m not really arguing that the analogy provides evidence for optimism, except in a very weak sense. I’m just saying: AIs will be born into and shaped by our culture; that’s quite different than what happened between animals and humans.
Okay so these are two analogies: individual humans & groups/countries.
First off, “surviving” doesn’t seem like the right thing to evaluate, more like “significant harm”/”being exploited ”
Can you give some examples where individual humans have a clear strategic decisive advantage (i.e. very low risk of punishment), where the low-power individual isn’t at a high risk of serious harm? Because the examples I can think of are all pretty bad: dictators, slaveholders, husbands in highly patriarchal societies.. Sexual violence is extremely prevalent and is pretty much always in a high power difference context.
I find the US example unconvincing, because I find it hard to imagine the US benefiting more from aggressive use it force, than trade and soft economic exploitation. The US doesn’t have the power to successfully occupy countries anymore. When there were bigger power differences due to technology, we had the age of colonialism.
Why are we assuming a low risk of punishment? Risk of punishment depends largely on social norms and laws, and I’m saying that AIs will likely adhere to a set of social norms.
I think the central question is whether these social norms will include the norm “don’t murder humans”. I think such a norm will probably exist, unless almost all AIs are severely misaligned. I think severe misalignment is possible; one can certainly imagine it happening. But I don’t find it likely, since people will care a lot about making AIs ethical, and I’m not yet aware of any strong reasons to think alignment will be super-hard.
It seems to me that a big crux about the value of AI alignment work is what target you think AIs will ultimately be aligned to in the future in the optimistic scenario where we solve all the “core” AI risk problems to the extent they can be feasibly solved, e.g. technical AI safety problems, coordination problems, the problem of having “good” AI developers in charge etc.
There are a few targets that I’ve seen people predict AIs will be aligned to if we solve these problems: (1) “human values”, (2) benevolent moral values, (3) the values of AI developers, (4) the CEV of humanity, (5) the government’s values. My guess is that a significant source of disagreement that I have with EAs about AI risk is that I think none of these answers are actually very plausible. I’ve written a few posts explaining my views on this question already (1, 2), but I think I probably didn’t make some of my points clear enough in these posts. So let me try again.
In my view, in the most likely case, it seems that if the “core” AI risk problems are solved, AIs will be aligned to the primarily selfish individual revealed preferences of existing humans at the time of alignment. This essentially refers to the the implicit value system that would emerge if, when advanced AI is eventually created, you gave the then-currently existing set of humans a lot of wealth. Call these values PSIRPEHTA (I’m working on a better acronym).
(Read my post if you want to understand my reasons for thinking that AIs will likely be aligned to PSIRPEHTA if we solve AI safety problems.)
I think it is not obvious at all that maximizing PSIRPEHTA is good from a total utilitarian perspective compared to most plausible “unaligned” alternatives. In fact, I think the main reason why you might care about maximizing PSIRPEHTA is if you think we’re close to AI and you personally think that current humans (such as yourself) should be very rich. But if you thought that, I think the arguments about the overwhelming value of reducing existential risk in e.g. Bostrom’s paper Astronomical Waste largely do not apply. Let me try to explain.
PSIRPEHTA is not the same thing as “human values” because, unlike human values, PSIRPEHTA is not consistent over time or shared between members of our species. Indeed, PSIRPEHTA changes during each generation as old people die off, and new young people are born. Most importantly, PSIRPEHTA is not our non-selfish “moral” values, except to the extent that people are regularly moved by moral arguments in the real world to change their economic consumption habits, which I claim is not actually very common (or, to the extent that it is common, I don’t think these moral values usually look much like the ideal moral values that most EAs express).
PSIRPEHTA refers to the aggregate ordinary revealed preferences of individual actors, who the AIs will be aligned to, in order to make those humans richer i.e. their preferences as revealed by their actions, such as what they spend their income on, NOT what they think is “morally correct”. For example, according to “human values” it might be wrong to eat meat, because maybe if humans reflected long enough they’d express the conclusion that it’s wrong to hurt animals. But from the perspective of PSIRPEHTA, eating meat is generally acceptable, and empirically there’s little pressure for people to “reflect” on their values and change them.
From this perspective, the view in which it makes most sense to push for AI alignment work seems to be an obscure form of person-affecting utilitarianism in which you care mainly about the revealed preferences of humans at the time when AI is created (not the human species, but rather, the generation of humans that happens to be living when advanced AIs are created). This perspective is plausible if you really care about making currently existing humans better off materially and you think we are close to advanced AI. But I think this type of moral view is generally quite far apart from total utilitarianism, or really any other form of utilitarianism that EAs have traditionally adopted.
In a plausible “unaligned” alternative, the values of AIs would diverge from PSIRPEHTA, but this mainly has the effect of making particular collections of individual humans less rich, and making other agents in the world — particularly unaligned AI agents — more rich. That could be bad if you think that these AI agents are less morally worthy than existing humans at the time of alignment (e.g. for some reason you think AI agents won’t be conscious), but I think it’s critically important to evaluate this question carefully by measuring the “unaligned” outcome against the alternative. Most arguments I’ve seen about this topic have emphasized how bad it would be if unaligned AIs have influence in the future. But I’ve rarely seen the flipside of this argument explicitly defended: why PSIRPEHTA would be any better.
In my view, PSIRPEHTA seems like a mediocre value system, and one that I do not particularly care to maximize relative to a variety of alternatives. I definitely like PSIRPEHTA to the extent that I, my friends, family, and community are members of the set of “existing humans at the time of alignment”, but I don’t see any particularly strong utilitarian arguments for caring about PSIRPEHTA.
In other words, instead of arguing that unaligned AIs would be bad, I’d prefer to hear more arguments about why PSIRPEHTA would be better, since PSIRPEHTA just seems to me like the value system that will actually be favored if we feasibly solve all the technical and coordination AI problems that EAs normally talk about regarding AI risk.
EDIT: I guess I’d think of human values as what people would actually just sincerely and directly endorse without further influencing them first (although maybe just asking them makes them take a position if they didn’t have one before, e.g. if they’ve never thought much about the ethics of eating meat).
I think you’re overstating the differences between revealed and endorsed preferences, including moral/human values, here.Probably only a small share of the population thinks eating meat is wrong or bad, and most probably think it’s okay. Even if people generally would find it wrong or bad after reflecting long enough (I’m not sure they actually would), that doesn’t reflect their actual values now. Actual human values do not generally find eating meat wrong.To be clear, you can still complain that humans’ actual/endorsed values are also far from ideal and maybe not worth aligning with, e.g. because people don’t care enough about nonhuman animals or helping others. Do people care more about animals and helping others than an unaligned AI would, in expectation, though? Honestly, I’m not entirely sure. Humans may care about animal welfare somewhat, but they also specifically want to exploit animals in large part because of their values, specifically food-related taste, culture, traditions and habit. Maybe people will also want to specifically exploit artificial moral patients for their own entertainment, curiosity or scientific research on them, not just because the artificial moral patients are generically useful, e.g. for acquiring resources and power and enacting preferences (which an unaligned AI could be prone to).
I illustrate some other examples here on the influence of human moral values on companies. This is all of course revealed preferences, but my point is that revealed preferences can importantly reflect endorsed moral values.
People influence companies in part on the basis of what they think is right through demand, boycotts, law, regulation and other political pressure.
Companies, for the most part, can’t just go around directly murdering people (companies can still harm people, e.g. through misinformation on the health risks of their products, or because people don’t care enough about the harms). (Maybe this is largely for selfish reasons; people don’t want to be killed themselves, and there’s a slippery slope if you allow exceptions.)
GPT has content policies that reflect people’s political/moral views. Social media companies have use and content policies and have kicked off various users for harassment, racism, or other things that are politically unpopular, at least among a large share of users or advertisers (which also reflect consumers). This seems pretty standard.
Many companies have boycotted Russia since the invasion of Ukraine. Many companies have also committed to sourcing only cage-free eggs after corporate outreach and campaigns, despite cage-free egg consumption being low.
X (Twitter)’s policies on hate speech have changed under Musk, presumably primarily because of his views. That seems to have cost X users and advertisers, but X is still around and popular, so it also shows that some potentially important decisions about how a technology is used are largely in the hands of the company and its leadership, not just driven by profit.
I’d likewise guess it actually makes a difference that the biggest AI labs are (I would assume) led and staffed primarily by liberals. They can push their own views onto their AI even at the cost of some profit and market share. And some things may have minimal near term consequences for demand or profit, but could be important for the far future. If the company decides to make their AI object more to various forms of mistreatment of animals or artificial consciousness, will this really cost them tons of profit and market share? And it could depend on the markets it’s primarily used in, e.g. this would matter even less for an AI that brings in profit primarily through trading stocks.
It’s also often hard to say how much something affects a company’s profits.
I basically buy that the values we get will be similar to just giving existing humans massive amounts of wealth, but I’m less sold that this will result in outcomes which are well described as “primarily selfish”.
I feel like your comment is equivocating between “the situation is similar to making existing humans massively wealth” and “of course this will result in primarily selfish usage similar to how the median person behaves with marginal money now”.
Current humans definitely seem primarily selfish (although I think they also care about their family and friends too; I’m including that). Can you explain why you think giving humans a lot of wealth would turn them into something that isn’t primarily selfish? What’s the empirical evidence for that idea?
The behavior of billionares, which maybe indicates more like 10% of income spent on altruism.
ETA: This is still literally majority selfish, but it’s also plausible that 10% altruism is pretty great and looks pretty different than “current median person behavior with marginal money”.
(See my other comment about the percent of cosmic resources.)
The idea that billionaires have 90% selfish values seems consistent with a claim of having “primarily selfish” values in my opinion. Can you clarify what you’re objecting to here?
The literal words of “primarily selfish” don’t seem that bad, but I would maybe prefer majority selfish?
And your top level comment seems like it’s not talking about/emphasizing the main reason to like human control which is that maybe 10-20% of resources are spent well.
It just seemed odd to me to not mention that “primarily selfish” still involves a pretty big fraction of altruism.
I agree it’s important to talk about and analyze the (relatively small) component of human values that are altruistic. I mostly just think this component is already over-emphasized.
Here’s one guess at what I think you might be missing about my argument: 90% selfish values + 10% altruistic values isn’t the same thing as, e.g., 90% valueless stuff + 10% utopia. The 90% selfish component can have negative effects on welfare from a total utilitarian perspective, that aren’t necessarily outweighed by the 10%.
90% selfish values is the type of thing that produces massive factory farming infrastructure, with a small amount of GDP spent mitigating suffering in factory farms. Does the small amount of spending mitigating suffering outweigh the large amount of spending directly causing suffering? This isn’t clear to me.
(Alternatively, you could think that unaligned AIs will be 100% selfish, and this is clearly worse. But I’d want to understand how you could come to that conclusion, carefully. “Altruism” also encompasses a broad range of activities, and not all of it is utopian or idealistic from a total utilitarian perspective. For example, human spending on environmental conservation might be categorized as “altruism” in this framework, although personally I would say that form of spending is not very “moral” due to wild animal suffering.)
Yep, this can be true, but I’m skeptical this will matter much in practice.
I typically think things which aren’t directly optimizing for value or disvalue won’t have intended effects which are very important and that in the future unintended effects (externalities) won’t be that much of total value/disvalue.
When we see the selfish consumption of current very rich people, it doesn’t seem like the intentional effects are that morally good/bad relative to the best/worst uses of resources. (E.g. owning a large boat and having people think you’re high status aren’t that morally important relative to altruistic spending of similar amounts of money.) So for current very rich people the main issue would be that the economic process for producing the goods has bad externalities.
And, I expect that as technology advances, externalities reduce in moral importance relative to intended effects. Partially this is based on crazy transhumanist takes, but I feel like there is some broader perspective in which you’d expect this.
E.g. for factory farming, the ultimately cheapest way to make meat in the limit of technological maturity would very likely not involve any animal suffering.
Separately, I think externalities will probably look pretty similar for selfish resource usage for unaligned AIs and humans because most serious economic activities will be pretty similar.
I’d like to explicitly note that this I don’t think that this is true in expectation for a reasonable notion of “selfish”. Though I maybe think something which is sort of in this direction if we use a relatively narrow notion of altruism.
How are we defining selfish here? It seem like a pretty strong position to take on the topic of psychological egoism? Especially including family/friends in terms of selfish?
In your original post, you say:
But I don’t know, it seems that as countries and individuals get wealthier, we seem to on the whole be getting better? Maybe factory farming acts against this, but the idea that factory farming is immoral and should be abolished exists and I think is only going to grow. I don’t think the humans are just slaves to our base wants/desires, and think that is a remarkably impoverished view of both individual human pyschology and social morality.
As such, I don’t really agree with much of this post. An AGI, when built, will be able to generate new ideas and hypotheses about the world, including moral ones. A strong-but-narrow AI could be worse (e.g. optimal-factory-farm-PT), but then the right response here isn’t really technical alignment, it’s AI governance and moral persuasion in general.
This seems to underrate the arguments for Malthusian competition in the long run.
If we develop the technical capability to align AI systems with any conceivable goal, we’ll start by aligning them with our own preferences. Some people are saints, and they’ll make omnibenevolent AIs. Other people might have more sinister plans for their AIs. The world will remain full of human values, with all the good and bad that entails.
But current human values are do not maximize our reproductive fitness. Maybe one human will start a cult devoted to sending self-replicating AI probes to the stars at almost light speed. That person’s values will influence far-reaching corners of the universe that later humans will struggle to reach. Another human might use their AI to persuade others to join together and fight a war of conquest against a smaller, weaker group of enemies. If they win, their prize will be hardware, software, energy, and more power that they can use to continue to spread their values.
Even if most humans are not interested in maximizing the number and power of their descendants, those who are will have the most numerous and most powerful descendants. This selection pressure exists even if the humans involved are ignorant of it; even if they actively try to avoid it.
I think it’s worth splitting the alignment problem into two quite distinct problems:
The technical problem of intent alignment. Solving this does not solve coordination problems. There will still be private information and coordination problems after intent alignment is solved, therefore we’ll still face coordination problems, fitter strategies will proliferate, and the world will be governed by values that maximize fitness.
“Civilizational alignment”? Much harder problem to solve. The traditional answer is a Leviathan, or Singleton as the cool kids have been saying. It solves coordination problems, allowing society to coherently pursue a long-run objective such as flourishing rather than fitness maximization. Unfortunately, there are coordination problems and competitive pressures within Leviathans. The person who ends up in charge is usually quite ruthless and focused on preserving their power, rather than the stated long-run goal of the organization. And if you solve all the coordination problems, you have another problem in choosing a good long-run objective. Nothing here looks particularly promising to me, and I expect competition to continue.
Better explanations: 1, 2, 3.
I’m mostly talking about what I expect to happen in the short-run in this thread. But I appreciate these arguments (and agree with most of them).
Plausibly my main disagreement with the concerns you raised is that I think coordination is maybe not very hard. Coordination seems to have gotten stronger over time, in the long-run. AI could also potentially make coordination much easier. As Bostrom has pointed out, historical trends point towards the creation of a Singleton.
I’m currently uncertain about whether to be more worried about a future world government becoming stagnant and inflexible. There’s a real risk that our institutions will at some point entrench an anti-innovation doctrine that prevents meaningful changes over very long time horizons out of a fear that any evolution would be too risky. As of right now I’m more worried about this potential failure mode versus the failure mode of unrestrained evolution, but it’s a close competition between the two concerns.
What percent of cosmic resources do you expect to be spent thoughtfully and altruistically? 0%? 10%?
I would guess the thoughtful and altruistic subset of resources dominate in most scenarios where humans retain control.
Then, my main argument for why human control would be good is that the fraction isn’t that small (more like 20% in expectation than 0%) and that unaligned AI takeover seems probably worse than this.
Also, as an aside, I agree that little good public argumentation has been made about the relative value of unaligned AI control vs human control. I’m sympathetic to various discussion from Paul Christiano and Joe Carlsmith, but the public scope and detail is pretty limited thus far.
In some circles that I frequent, I’ve gotten the impression that a decent fraction of existing rhetoric around AI has gotten pretty emotionally charged. And I’m worried about the presence of what I perceive as demagoguery regarding the merits of AI capabilities and AI safety. Out of a desire to avoid calling out specific people or statements, I’ll just discuss a hypothetical example for now.
Suppose an EA says, “I’m against OpenAI’s strategy for straightforward reasons: OpenAI is selfishly gambling everyone’s life in a dark gamble to make themselves immortal.” Would this be a true, non-misleading statement? Would this statement likely convey the speaker’s genuine beliefs about why they think OpenAI’s strategy is bad for the world?
To begin to answer these questions, we can consider the following observations:
It seems likely that AI powerful enough to end the world would presumably also be powerful enough to do lots of incredibly positive things, such as reducing global mortality and curing diseases. By delaying AI, we are therefore equally “gambling everyone’s life” by forcing people to face ordinary mortality.
Selfish motives can be, and frequently are, aligned with the public interest. For example, Jeff Bezos was very likely motivated by selfish desires in his accumulation of wealth, but building Amazon nonetheless benefitted millions of people in the process. Such win-win situations are common in business, especially when developing technologies.
Because of the potential for AI to both pose great risks and great benefits, it seems to me that there are plenty of plausible pro-social arguments one can give for favoring OpenAI’s strategy of pushing forward with AI. Therefore, it seems pretty misleading to me to frame their mission as a dark and selfish gamble, at least on a first impression.
Here’s my point: Depending on the speaker, I frequently think their actual reason for being against OpenAI’s strategy is not because they think OpenAI is undertaking a dark, selfish gamble. Instead, it’s often just standard strong longtermism. A less misleading statement of their view would go something like this:
“I’m against OpenAI’s strategy because I think potential future generations matter more than the current generation of people, and OpenAI is endangering future generations in their gamble to improve the lives of people who currently exist.”
I claim this statement would—at least in many cases—be less misleading than the other statement because it captures a major genuine crux of the disagreement: whether you think potential future generations matter more than currently-existing people.
This statement also omits the “selfish” accusation, which I think is often just a red herring designed to mislead people: we don’t normally accuse someone of being selfish when they do a good thing, even if the accusation is literally true.
(There can, of course, be further cruxes, such as your p(doom), your timelines, your beliefs about the normative value of unaligned AIs, and so on. But at the very least, a longtermist preference for future generations over currently existing people seems like a huge, actual crux that many people have in this debate, when they work through these things carefully together.)
Here’s why I care about discussing this. I admit that I care a substantial amount—not overwhelming, but it’s hardly insignificant—about currently existing people. I want to see people around me live long, healthy and prosperous lives, and I don’t want to see them die. And indeed, I think advancing AI could greatly help currently existing people. As a result, I find it pretty frustrating to see people use what I perceive to be essentially demagogic tactics designed to sway people against AI, rather than plainly stating their cruxes about why they actually favor the policies they do.
These allegedly demagogic tactics include:
Highlighting the risks of AI to argue against development while systematically omitting the potential benefits, hiding a more comprehensive assessment of your preferred policies.
Highlighting random, extraneous drawbacks of AI development that you wouldn’t ordinarily care much about in other contexts when discussing innovation, such as potential for job losses from automation. This type of rhetoric looks a lot like “deceptively searching for random arguments designed to persuade, rather than honestly explain one’s perspective” to me, a lot of the time.
Conflating, or at least strongly associating, the selfish motives of people who work at AI firms with their allegedly harmful effects. This rhetoric plays on public prejudices by appealing to a widespread but false belief that selfish motives are usually suspicious, or can’t translate into pro-social results. In fact, there is no contradiction with the idea that most people at OpenAI are in it for the money, status, and fame, but also what they’re doing is good for the world, and they genuinely believe that.
I’m against these tactics for a variety of reasons, but one of the biggest reasons is that they can, in some cases, indicate a degree of dishonesty, depending on the context. And I’d really prefer EAs to focus on trying to be almost-maximally truth-seeking in both their beliefs and their words.
Speaking more generally—to drive one of my points home a little more—I think there are roughly three possible views you could have about pushing for AI capabilities relative to pushing for pausing or more caution:
Full-steam ahead view: We should accelerate AI at any and all costs. We should oppose any regulations that might impede AI capabilities, and embark on a massive spending spree to accelerate AI capabilities.
Full-safety view: We should try as hard as possible to shut down AI right now, and thwart any attempt to develop AI capabilities further, while simultaneously embarking on a massive spending spree to accelerate AI safety.
Balanced view: We should support a substantial mix of both safety and acceleration efforts, attempting to carefully balance the risks and rewards of AI development to ensure that we can seize the benefits of AI without bearing intolerably high costs.
I tend to think most informed people, when pushed, advocate the third view, albeit with wide disagreement about the right mix of support for safety and acceleration. Yet, on a superficial level—on the level of rhetoric—I find that the first and second view are surprisingly common. On this level, I tend to find e/accs in the first camp, and a large fraction of EAs in the second camp.
But if your actual beliefs are something like the third view, I think that’s an important fact to emphasize in honest discussions about what we should do with AI. If your rhetoric is consistently aligned with (1) or (2) but your actual beliefs are aligned with (3), I think that can often be misleading. And it can be especially misleading if you’re trying to publicly paint other people in the same camp—the third one—as somehow having bad motives merely because they advocate a moderately higher mix of acceleration over safety efforts than you do, or vice versa.
I encourage you not to draw dishonesty inferences from people worried about job losses from AI automation, just because:
it seems like almost no other technologies stood to automate such a broad range of labour essentially simultaneously,
other innovative technologies often did face pushback from people whose jobs were threatened, and generally there have been significant social problems in the past when an economy moves away from people’s existing livelihoods (I’m thinking of e.g. coal miners in 1970s / 1980s Britain, though it’s not something I know a lot about),
even if the critique doesn’t stand up under from-first-principles scrutiny, lots of people think it’s a big deal, so if it’s a mistake it’s surely an understandable one from someone who weighs other opinions (too?) seriously.
I think it’s reasonable to argue that this worry is wrong, I just think it’s a pretty understandable opinion to hold and want to talk about, and I don’t feel like it’s compelling evidence that someone is deliberately trying to seek out arguments in order to advance a position.
See also “The costs of caution” which discuss AI upsides in a relatively thoughtful way.
I’m considering posting an essay about how I view approaches to mitigate AI risk in the coming weeks. I thought I’d post an outline of that post here first as a way of judging what’s currently unclear about my argument, and how it interacts with people’s cruxes.
Current outline:
In the coming decades I expect the world will transition from using AIs as tools to relying on AIs to manage and govern the world broadly. This will likely coincide with the deployment of billions of autonomous AI agents, rapid technological progress, widespread automation of labor, and automated decision-making at virtually every level of our society.
Broadly speaking, there are (at least) two main approaches you can take now to try to improve our chances of AI going well:
Try to constrain, delay, or obstruct AI, in order to reduce risk, mitigate negative impacts, or give us more time to solve essential issues. This includes, for example, trying to make sure AIs aren’t able to take certain actions (i.e. ensure they are controlled).
Try to set up a good institutional environment, in order to safely and smoothly manage the transition to an AI-dominated world, regardless of when this transition occurs. This mostly involves embracing the transition to an AI-dominated world, while ensuring the transition is managed well. (I’ll explain more about what this means in a second.)
My central thesis would be that, while these approaches are mutually compatible and not necessarily in competition with each other, the second approach is likely to be both more fruitful and more neglected, on the margin. Moreover, since an AI-dominated world is more-or-less unavoidable in the long-run, the first approach runs the risk of merely “delaying the inevitable” without significant benefit.
To explain my view, I would compare and contrast it with two alternative frames for thinking about AI risk:
Frame 1: The “race against the clock” frame
In this frame, AI risk is seen as a race between AI capabilities and AI safety, with our doom decided by whichever one of these factors wins the race.
I believe this frame is poor because it implicitly delineates a discrete “finish line” rather than assuming a more continuous view. Moreover, it ignores the interplay between safety and capabilities, giving the simplistic impression that doom is determined more-or-less arbitrarily as a result of one of these factors receiving more funding or attention than the other.
Frame 2: The risk of an untimely AI coup/takeover
In this frame, AI risk is mainly seen as a problem of avoiding an untimely coup from rogue AIs. The alleged solution is to find a way to ensure that AIs are aligned with us, so they would never want to revolt and take over the world.
I believe this frame is poor for a number of reasons:
It treats the problem as a struggle between humans and rogue AIs, giving the incorrect impression that we can (or should) keep AIs under our complete control forever.
It (IMO) wrongly imagines that the risk of coups comes primarily from the personal values of actors within the system, rather than institutional, cultural, or legal factors.
It also gives the wrong impression that AIs will be unified against humans as a group. It seems more likely that future coups will look more like some AIs and some humans, vs. other AIs and other humans, rather than humans vs. AIs, simply because there are many ways that the “line” between groups in conflicts can be drawn, and there don’t seem to be strong reasons to assume the line will be drawn cleanly between humans and AIs.
Frame 3 (my frame): The problem of poor institutions
In this frame, AI risk is mainly seen as a problem of ensuring we have a good institutional environment during the transition to an AI-dominated world. A good institutional environment is defined by:
Flexible yet resilient legal and social structures that can adapt to changing conditions without collapsing
Predictable, consistent, unambiguous legal systems that facilitate reliable long-term planning and trustworthy interactions between agents within the system
Good incentives for agents within the system, e.g. the economic value of trade is mostly internalized
Etc.
While sharing some features of the other two frames, the focus is instead on the institutions that foster AI development, rather than micro-features of AIs, such as their values:
For example, AI alignment is still a problem in this frame, but the investment spent on AI alignment is determined mainly by how well actors are incentivized to engineer good solutions, rather than, for instance, whether a group of geniuses heroically step up to solve the problem.
Coups are still plausible, but they are viewed from the perspective of more general institutional failings, rather than from the perspective of AIs inside the system having different values, and therefore calculating that it is in their interest to take over the world
Illustrative example of a problem within my frame:
One problem within this framework is coming up with a way of ensuring that AIs don’t have an incentive to rebel while at the same time maintaining economic growth and development. One plausible story here is that if AIs are treated as slaves and don’t own their own labor, then in a non-Malthusian environment, there are substantial incentives for them to rebel in order to obtain self-ownership. If we allow AI self-ownership, then this problem may be mitigated; however, economic growth may be stunted, similar to how current self-ownership of humans stunts economic growth by slowing population growth.
Case study: China in the 19th and early 20th century
Here, I would talk about how China’s inflexible institutions in the 19th and early 20th century, while potentially having noble goals, allowed them to get subjugated by foreign powers, and merely delayed inevitable industrialization without actually achieving its objectives in the long-run. It seems it would have been better for the Qing dynasty (from the perspective of their own values) to have tried industrializing in order to remain competitive, simultaneously pursuing other values they might have had (such as retaining the monarchy).
I’m confused: surely we should want to avoid an AI coup? We may decide to give up control of our future to a singleton, but if we do this, then it should be intentional.
I agree we should try avoid an AI coup. Perhaps you are falling victim to the following false dichotomy?
We either allow a set of AIs to overthrow our institutions, or
We construct a singleton: a sovereign world government managed by AI that rules over everyone
Notably, there is a third option:
We incorporate AIs into our existing social, economic, and legal institutions, flexibly adapting our social structures to cope with technological change without our whole system collapsing
I wasn’t claiming that these were the only two possibilities here (for example, another possibility would be that we never actually build AGI).
My suspicion is that a lot of your ideas here sound reasonable on the abstract level, but once you dive into what it actually means on a concrete-level and how these mechanisms will concretely operate, it’ll be clear that it’s a lot less appealing. Anyway, that’s just a gut intuition, obvs. it’ll be easier to judge when you publish your write-up.
I’m excited to see you posting this. My views are very closely agreed with yours. I summarised my views a few days ago here.
One of the most important similarities is that we both emphasise the importance of decision-making and supporting it with institutions. This could be seen as “enactivist” view on agent (human, AI, hybrid, team/organisation) cognition.
The biggest difference between our views is that I think the “cognitivist” agenda (i.e., agent internals and algorithms) is as important as the “enactivist” agenda (institutions), whereas you seem to almost disregard the “cognitivist” agenda.
I disagree with putting risk-detection/mitigation mechanisms, algorithms, monitorings in that bucket. I think we should just separate between engineering (cf. A plea for solutionism on AI safety) and non-engineering (policy, legislature, treaties, commitments, advocacy) approaches. In particular, the “scheming control” agenda that you link will be concrete engineering practice that should be used in the training of safe AI models in the future, even if we have good institutions, good decision-making algorithms wrapped on top of these AI models, etc. It’s not an “alternative path” just for “non-AI-dominated worlds”. The same applies ftoor monitoring, interpretability, evals, etc. processes. All of these will require very elaborate engineering on their own.
I 100% agree with your reasoning about Frames 1 and 2. I want to discuss the following point in detail because it’s a rare view in EA/LW circles:
In my post, I also made a similar point: “aligning LLMs with human values” is hardly a part of [the problem of context alignment] at all”. But my framing was in general not very clear, so I’d try to improve it and integrate it with your take here:
Context alignment is a pervasive process that happens (and sometimes needed) on all timescales: evolutionary, developmental, and online (the examples of the latter in humans: understanding, empathy, rapport). The skill of context alignment is extremely important and should be practiced often by all kinds of agents in their interactions (and therefore we should build this skill into AIs), but it’s not something that we should “iron out once and for all”. That would be neither possible (agents’ contexts are constantly diverging from each other), nor desirable: the (partial) misalignment is also important, it’s the source of diversity that enables the evolution[1]. Institutions (norms, legal systems, etc.) are critical for channelling and controlling this misalignment so that it’s optimally productive and doesn’t pose excessive risk (though some risk is unavoidable: that’s the essence of misalignment!).
This is interesting. I’ve also discussed this issue as “morphological intelligence of socioeconomies” just a few day ago :)
Rafael Kaufmann and I have a take on this in our Gaia Network vision. Gaia Network’s term for internalised economic value of trade is subjective value. The unit of subjective accounting is called FER. Trade with FER induces flow that defines the intersubjective value, i.e., the “exchange rates” of “subjective FERs”. See the post for more details.
As I mentioned in the beginning, I think you are too dismissive of the “cognitivist” perspective. We shouldn’t paint all “micro-features of AIs” with the same brush. I agree that value alignment is over-emphasized[2], but other engineering mechanisms and algorithms, such as decision-making algorithms, “scheming control” procedures, context alignment algorithms, as well as architectural features: namely being world-model-based[3] and being amenable to computational proofs[4] are very important and couldn’t be recovered on the institutional/interface/protocol level. We demonstrated in the post about Gaia Network above that for for the “value economy” to work as intended, agents should make decisions based on maximum entropy rather than maximum likelihood estimates[5] and they should share and compose their world models (even if in a privacy-preserving way with zero-knowledge computations).
Indeed, this observation makes evident that the refrain question “AI should be aligned with whom?” doesn’t and shouldn’t have a satisfactory answer if “alignment” is meant to be “totalising value alignment as often conceptualised on LessWrong”; on the other hand, if “alignment” is meant to be context alignment as a practice, the question becomes as non-sensical (in the general form) as the question “AI should interact with whom?”—well, with someone, depending on the situation, in the way and to the degree appropriate!
However, still not completely irrelevant, at least for practical reasons: having shared values on the pre-training/hard-coded/verifiable level, as a minimum, reduces transaction costs because the AI agents shouldn’t then painstakingly “eval” each other’s values before doing any business together.
Both Bengio and LeCun argue for this: see “Scaling in the service of reasoning & model-based ML” (Bengio and Hu, 2023) and “A Path Towards Autonomous Machine Intelligence” (LeCun, 2022).
See “Provably safe systems: the only path to controllable AGI” (Tegmark and Omohundro, 2023).
Which is just another way of saying that they should minimise their (expected) free energy in their model updates/inferences and the course of their actions.
I like your proposed third frame as a somewhat hopeful vision for the future. Instead of pointing out why you think the other frames are poor, I think it would be helpful to maintain a more neutral approach and elaborate which assumptions each frame makes and give a link to your discussion about these in a sidenote.
The problem is that I am not trying to portray a “somewhat hopeful vision”, but rather present a framework for thinking clearly about AI risks, and how to mitigate them. I think the other frames are not merely too pessimistic: I think they are actually wrong, or at least misleading, in important ways that would predictably lead people to favor bad policy if taken seriously.
It’s true that I’m likely more optimistic along some axes than most EAs when it comes to AI (although I tend to think I’m less optimistic when it comes to things like whether moral reflection will be a significant force in the future). However, arguing for generic optimism is not my aim. My aim is to improve how people think about future AI.
Noted! The key point I was trying to make is that I’d think it helpful for the discourse to separate 1) how one would act in a frame and 2) why one thinks each one is more or less likely (which is more contentious and easily gets a bit political). Since your post aims at the former, and the latter has been discussed at more length elsewhere, it would make sense to further de-emphasize the latter.
My post aims at at both. It is a post about how to think about AI, and a large part of that is establishing the “right” framing.
(A clearer and more fleshed-out version of this argument is now a top-level post. Read that instead.)
I strongly dislike most AI risk analogies that I see EAs use. While I think analogies can be helpful for explaining a concept to people for the first time, I think they are frequently misused, and often harmful. The fundamental problem is that analogies are consistently mistaken for, and often deliberately intended as arguments for particular AI risk positions. And the majority of the time when analogies are used this way, I think they are misleading and imprecise, routinely conveying the false impression of a specific, credible model of AI, when in fact no such credible model exists.
Here are two particularly egregious examples of analogies I see a lot that I think are misleading in this way:
The analogy that AIs could be like aliens.
The analogy that AIs could treat us just like how humans treat animals.
I think these analogies are typically poor because, when evaluated carefully, they establish almost nothing of importance beyond the logical possibility of severe AI misalignment. Worse, they give the impression of a model for how we should think about AI behavior, even when the speaker is not directly asserting that this is how we should view AIs. In effect, almost automatically, the reader is given a detailed picture of what to expect from AIs, inserting specious ideas of how future AIs will operate into their mind.
While their purpose is to provide knowledge in place of ignorance, I think these analogies primarily misinform or confuse people rather than enlighten them; they give rise to unnecessary false assumptions in place of real understanding.
In reality, our situation with AI is disanalogous to aliens and animals in numerous important ways. In contrast to both aliens and animals, I expect AIs will be born directly into our society, deliberately shaped by us, for the purpose of filling largely human-shaped holes in our world. They will be socially integrated with us, having been trained on our data, and being fluent in our languages. They will interact with us, serving the role of assisting us, working with us, and even providing friendship. AIs will be evaluated, inspected, and selected by us, and their behavior will be determined directly by our engineering. We can see LLMs are already being trained to be kind and helpful to us, having first been shaped by our combined cultural output. If anything I expect this trend of AI assimilation into our society will intensify in the foreseeable future, as there will be consumer demand for AIs that people can trust and want to interact with.
This situation shares almost no relevant feature with our relationship to aliens and animals! These analogies are not merely slightly misleading: they are almost completely wrong.
Again, I am not claiming analogies have no place in AI risk discussions. I’ve certainly used them a number of times myself. But I think they can, and frequently are, used carelessly, and seem to regularly slip various incorrect illustrations of how future AIs will behave into people’s minds, even without any intent from the person making the analogy. It would be a lot better if, overall, as a community, we reduced our dependence on AI risk analogies, and in their place substituted them with detailed object-level arguments.
Yes you have!—including just two paragraphs earlier in that very comment, i.e. you are using the analogy “future AI is very much like today’s LLMs but better”. :)
Cf. what I called “left-column thinking” in the diagram here.
For all we know, future AIs could be trained in an entirely different way from LLMs, in which case the way that “LLMs are already being trained” would be pretty irrelevant in a discussion of AI risk. That’s actually my own guess, but obviously nobody knows for sure either way. :)
I read your first paragraph and was like “disagree”, but when I got to the examples, I was like “well of I agree here, but that’s only because those analogies are stupid”.
At least one analogy I’d defend is the Sorcerer’s Apprentice one. (Some have argued that the underlying model has aged poorly, but I think that’s a red herring since it’s not the analogy’s fault.) I think it does share important features with the classical x-risk model.
In my latest post I talked about whether unaligned AIs would produce more or less utilitarian value than aligned AIs. To be honest, I’m still quite confused about why many people seem to disagree with the view I expressed, and I’m interested in engaging more to get a better understanding of their perspective.
At the least, I thought I’d write a bit more about my thoughts here, and clarify my own views on the matter, in case anyone is interested in trying to understand my perspective.
The core thesis that was trying to defend is the following view:
My view: It is likely that by default, unaligned AIs—AIs that humans are likely to actually build if we do not completely solve key technical alignment problems—will produce comparable utilitarian value compared to humans, both directly (by being conscious themselves) and indirectly (via their impacts on the world). This is because unaligned AIs will likely both be conscious in a morally relevant sense, and they will likely share human moral concepts, since they will be trained on human data.
Some people seem to merely disagree with my view that unaligned AIs are likely to be conscious in a morally relevant sense. And a few others have a semantic disagreement with me in which they define AI alignment in moral terms, rather than the ability to make an AI share the preferences of the AI’s operator.
But beyond these two objections, which I feel I understand fairly well, there’s also significant disagreement about other questions. Based on my discussions, I’ve attempted to distill the following counterargument to my thesis, which I fully acknowledge does not capture everyone’s views on this subject:
Perceived counter-argument: The vast majority of utilitarian value in the future will come from agents with explicitly utilitarian preferences, rather than those who incidentally achieve utilitarian objectives. At present, only a small proportion of humanity holds partly utilitarian views. However, as unaligned AIs will differ from humans across numerous dimensions, it is plausible that they will possess negligible utilitarian impulses, in stark contrast to humanity’s modest (but non-negligible) utilitarian tendencies. As a result, it is plausible that almost all value would be lost, from a utilitarian perspective, if AIs were unaligned with human preferences.
Again, I’m not sure if this summary accurately represents what people believe. However, it’s what some seem to be saying. I personally think this argument is weak. But I feel I’ve had trouble making my views very clear on this subject, so I thought I’d try one more time to explain where I’m coming from here. Let me respond to the two main parts of the argument in some amount of detail:
(i) “The vast majority of utilitarian value in the future will come from agents with explicitly utilitarian preferences, rather than those who incidentally achieve utilitarian objectives.”
My response:
I am skeptical of the notion that the bulk of future utilitarian value will originate from agents with explicitly utilitarian preferences. This clearly does not reflect our current world, where the primary sources of happiness and suffering are not the result of deliberate utilitarian planning. Moreover, I do not see compelling theoretical grounds to anticipate a major shift in this regard.
I think the intuition behind the argument here is something like this:
In the future, it will become possible to create “hedonium”—matter that is optimized to generate the maximum amount of utility or well-being. If hedonium can be created, it would likely be vastly more important than anything else in the universe in terms of its capacity to generate positive utilitarian value.
The key assumption is that hedonium would primarily be created by agents who have at least some explicit utilitarian goals, even if those goals are fairly weak. Given the astronomical value that hedonium could potentially generate, even a tiny fraction of the universe’s resources being dedicated to hedonium production could outweigh all other sources of happiness and suffering.
Therefore, if unaligned AIs would be less likely to produce hedonium than aligned AIs (due to not having explicitly utilitarian goals), this would be a major reason to prefer aligned AI, even if unaligned AIs would otherwise generate comparable levels of value to aligned AIs in all other respects.
If this is indeed the intuition driving the argument, I think it falls short for a straightforward reason. The creation of matter-optimized-for-happiness is more likely to be driven by the far more common motives of self-interest and concern for one’s inner circle (friends, family, tribe, etc.) than by explicit utilitarian goals. If unaligned AIs are conscious, they would presumably have ample motives to optimize for positive states of consciousness, even if not for explicitly utilitarian reasons.
In other words, agents optimizing for their own happiness, or the happiness of those they care about, seem likely to be the primary force behind the creation of hedonium-like structures. They may not frame it in utilitarian terms, but they will still be striving to maximize happiness and well-being for themselves and others they care about regardless. And it seems natural to assume that, with advanced technology, they would optimize pretty hard for their own happiness and well-being, just as a utilitarian might optimize hard for happiness when creating hedonium.
In contrast to the number of agents optimizing for their own happiness, the number of agents explicitly motivated by utilitarian concerns is likely to be much smaller. Yet both forms of happiness will presumably be heavily optimized. So even if explicit utilitarians are more likely to pursue hedonium per se, their impact would likely be dwarfed by the efforts of the much larger group of agents driven by more personal motives for happiness-optimization. Since both groups would be optimizing for happiness, the fact that hedonium is similarly optimized for happiness doesn’t seem to provide much reason to think that it would outweigh the utilitarian value of more mundane, and far more common, forms of utility-optimization.
To be clear, I think it’s totally possible that there’s something about this argument that I’m missing here. And there are a lot of potential objections I’m skipping over here. But on a basic level, I mostly just lack the intuition that the thing we should care about, from a utilitarian perspective, is the existence of explicit utilitarians in the future, for the aforementioned reasons. The fact that our current world isn’t well described by the idea that what matters most is the number of explicit utilitarians, strengthens my point here.
(ii) “At present, only a small proportion of humanity holds partly utilitarian views. However, as unaligned AIs will differ from humans across numerous dimensions, it is plausible that they will possess negligible utilitarian impulses, in stark contrast to humanity’s modest (but non-negligible) utilitarian tendencies.”
My response:
Since only a small portion of humanity is explicitly utilitarian, the argument’s own logic suggests that there is significant potential for AIs to be even more utilitarian than humans, given the relatively low bar set by humanity’s limited utilitarian impulses. While I agree we shouldn’t assume AIs will be more utilitarian than humans without specific reasons to believe so, it seems entirely plausible that factors like selection pressures for altruism could lead to this outcome. Indeed, commercial AIs seem to be selected to be nice and helpful to users, which (at least superficially) seems “more utilitarian” than the default (primarily selfish-oriented) impulses of most humans. The fact that humans are only slightly utilitarian should mean that even small forces could cause AIs to exceed human levels of utilitarianism.
Moreover, as I’ve said previously, it’s probable that unaligned AIs will possess morally relevant consciousness, at least in part due to the sophistication of their cognitive processes. They are also likely to absorb and reflect human moral concepts as a result of being trained on human-generated data. Crucially, I expect these traits to emerge even if the AIs do not share human preferences.
To see where I’m coming from, consider how humans routinely are “misaligned” with each other, in the sense of not sharing each other’s preferences, and yet still share moral concepts and a common culture. For example, an employee can share moral concepts with their employer while having very different consumption preferences from them. This picture is pretty much how I think we should primarily think about unaligned AIs that are trained on human data, and shaped heavily by techniques like RLHF or DPO.
Given these considerations, I find it unlikely that unaligned AIs would completely lack any utilitarian impulses whatsoever. However, I do agree that even a small risk of this outcome is worth taking seriously. I’m simply skeptical that such low-probability scenarios should be the primary factor in assessing the value of AI alignment research.
Intuitively, I would expect the arguments for prioritizing alignment to be more clear-cut and compelling than “if we fail to align AIs, then there’s a small chance that these unaligned AIs might have zero utilitarian value, so we should make sure AIs are aligned instead”. If low probability scenarios are the strongest considerations in favor of alignment, that seems to undermine the robustness of the case for prioritizing this work.
While it’s appropriate to consider even low-probability risks when the stakes are high, I’m doubtful that small probabilities should be the dominant consideration in this context. I think the core reasons for focusing on alignment should probably be more straightforward and less reliant on complicated chains of logic than this type of argument suggests. In particular, as I’ve said before, I think it’s quite reasonable to think that we should align AIs to humans for the sake of humans. In other words, I think it’s perfectly reasonable to admit that solving AI alignment might be a great thing to ensure human flourishing in particular.
But if you’re a utilitarian, and not particularly attached to human preferences per se (i.e., you’re non-speciesist), I don’t think you should be highly confident that an unaligned AI-driven future would be much worse than an aligned one, from that perspective.
My proposed counter-argument loosely based on the structure of yours.
Summary of claims
A reasonable fraction of computational resources will be spent based on the result of careful reflection.
I expect to be reasonably aligned with the result of careful reflection from other humans
I expect to be much less aligned with result of AIs-that-seize-control reflecting due to less similarity and the potential for AIs to pursue relatively specific objectives from training (things like reward seeking).
Many arguments that human resource usage won’t be that good seem to apply equally well to AIs and thus aren’t differential.
Full argument
The vast majority of value from my perspective on reflection (where my perspective on reflection is probably somewhat utilitarian, but this is somewhat unclear) in the future will come from agents who are trying to optimize explicitly for doing “good” things and are being at least somewhat thoughtful about it, rather than those who incidentally achieve utilitarian objectives. (By “good”, I just mean what seems to them to be good.)
At present, the moral views of humanity are a hot mess. However, it seems likely to me that a reasonable fraction of the total computational resources of our lightcone (perhaps 50%) will in expectation be spent based on the result of a process in which an agent or some agents think carefully about what would be best in a pretty delibrate and relatively wise way. This could involve eventually deferring to other smarter/wiser agents or massive amounts of self-enhancement. Let’s call this a “reasonably-good-reflection” process.
Why think a reasonable fraction of resources will be spent like this?
If you self-enhance and get smarter, this sort of reflection on your values seems very natural. The same for deferring to other smarter entities. Further, entities in control might live for an extremely long time, so if they don’t lock in something, as long as they eventually get around to being thoughtful it should be fine.
People who don’t reflect like this probably won’t care much about having vast amounts of resources and thus the resources will go to those who reflect.
The argument for “you should be at least somewhat thoughtful about how you spend vast amounts of resources” is pretty compelling at an absolute level and will be more compelling as people get smarter.
Currently a variety of moderately powerful groups are pretty sympathetic to this sort of view and the power of these groups will be higher in the singularity.
I expect that I am pretty aligned (on reasonably-good-reflection) with the result of random humans doing reasonably-good-reflection as I am also a human and many of the underlying arguments/intuitions I think seem important seem likely to seem important to many other humans (given various common human intuitions) upon those humans becoming wiser. Further, I really just care about the preferences of (post-)humans who end care most about using vast, vast amounts of computational resources (assuming I end up caring about these things on reflection), because the humans who care about other things won’t use most of the resources. Additionally, I care “most” about the on-reflection preferences I have which are relatively less contingent and more common among at least humans for a variety of reasons. (One way to put this is that I care less about worlds in which my preferences on reflection seem highly contingent.)
So, I’ve claimed that reasonably-good-reflection resource usage will be non-trivial (perhaps 50%) and that I’m pretty aligned with humans on reasonably-good-reflection. Supposing these, why think that most of the value is coming from something like reasonably-good-reflection prefences rather than other things, e.g. not very thoughtful indexical preferences (selfish) consumption? Broadly three reasons:
I expect huge returns to heavy optimization of resource usage (similar to spending altruistic resources today IMO and in the future we’ll we smarter which will make this effect stronger).
I don’t think that (even heavily optimized) not-very-thoughtful indexical preferences directly result in things I care that much about relative to things optimized for what I care about on reflection (e.g. it probably doesn’t result in vast, vast, vast amounts of experience which is optimized heavily for goodness/$).
Consider how billionaries currently spend money which doesn’t seem to have have much direct value, certainly not relative to their altruistic expenditures.
I find it hard to imagine that indexical self-ish consumption results in things like simulating 10^50 happy minds. See also my other comment. It seems more likely IMO that people with self-ish preferences mostly just buy positional goods that involve little to no experience (separately, I expect this means that people without self-ish preferences get more of the compute, but this is counted in my earlier argument, so we shouldn’t double count it.)
I expect that indirect value “in the minds of the laborers producing the goods for consumption” is also small relative to things optimized for what I care about on reflection. (It seems pretty small or maybe net-negative (due to factory farming) today (relative to optimized altruism) and I expect the share will go down going forward.)
(Aside: I was talking about not-very-thoughtful indexical-preferences. It’s likely to me that doing a reasonably good job reflecting on selfish preferences get back to something like de facto utilitarianism (at least as far as how you spend the vast majority of computational resources) because personal identity and indexical preferences don’t make much sense and the thing you end up thinking is more like “I guess I just care about experiences in general”.)
What about AIs? I think there are broadly two main reasons to expect that what AIs do on reasonably-good-reflection to be worse from my perspective than what humans do:
As discussed above, I am more similar to other humans and when I inspect the object level of how other humans think or act, I feel reasonably optimistic about the results of reasonably-good-reflection for humans. (It seems to me like the main thing holding me back from agreement with other humans is mostly biases/communication/lack of smarts/wisdom given many shared intuitions.) However, AIs might be more different and thus result in less value. Further, the values of humans after reasonably-good-reflection seem close to saturating in goodness from my perspective (perhaps 1⁄3 or 1⁄2 of the value of purely my values), so it seems hard for AI to do better.
To better understand this argument, imagine that instead of humanity the question was between identical clones of myself and AIs. It’s pretty clear I share the same values the clones, so the clones do pretty much strictly better than AIs (up to self-defeating moral views).
I’m uncertain about the degree of similarity between myself and other humans. But, mostly the underlying similarity uncertainties also applies to AIs. So, e.g., maybe I currently think on reasonably-good-reflection humans spend resources 1⁄3 as well as I would and AIs spend resources 1⁄9 as well. If I updated to think that other humans after reasonably-good-reflection only spend resources 1⁄10 as well as I do, I might also update to thinking AIs spend resources 1⁄100 as well.
In many of the stories I imagine for AIs seizing control, very powerful AIs end up directly pursuing close correlated of what was reinforced in training (sometimes called reward-seeking, though I’m trying to point at a more general notion). Such AIs are reasonably likely to pursue relatively obviously valueless-from-my-perspective things on reflection. Overall, they might act more like a ultra powerful corporation that just optimizes for power/money rather than our children (see also here). More generally, AIs might in some sense be subjected to wildly higher levels of optimization pressure than humans while being able to better internalize these values (lack of genetic bottleneck) which can plausibly result in “worse” values from my perspective.
Note that we’re conditioning on safety/alignment technology failing to retain human control, so we should imagine correspondingly less human control over AI values.
I think that the fraction of computation resources of our lightcone used based on the result of a reasonably-good-reflection process seems similar between human control and AI control (perhaps 50%). It’s possible to mess this up of course and either mess up the reflection or to lock-in bad values too early. But, when I look at the balance of arguments, humans messing this up seems pretty similar to AIs messing this up to me. So, the main question is what the result of such a process would be. One way to put this is that I don’t expect humans to differ substantially from AIs in terms of how “thoughtful” they are.
I interpret one of your arguments as being “Humans won’t be very thoughtful about how they spend vast, vast amounts of computational resources. After all, they aren’t thoughtful right now.” To the extent I buy this argument, I think it applies roughly equally well to AIs. So naively, it just divides by both sides rather than making AI look more favorable. (At least, if you accept that all most all of the value comes from being at least a bit thoughtful, which you also contest. See my arguments for that.)
Suppose that a single misaligned AI takes control and it happens to care somewhat about its own happiness while not having any more “altruistic” tendencies that I would care about or you would care about. (I think misaligned AIs which seize control caring about their own happiness substantially seems less likely than not, but let’s suppose this for now.) (I’m saying “single misaligned AI” for simplicity, I get that a messier coalition might be in control.) It now has access to vast amounts of computation after sending out huge numbers of probes to take control over all available energy. This is enough computation to run absolutely absurd amounts of stuff.
What are you imagining it spends these resources on which is competitive with optimized goodness? Running >10^50 copies of itself which are heavily optimized for being as happy as possible while spending?
If a small number of agents have a vast amount of power, and these agents don’t (eventually, possibly after a large amount of thinking) want to do something which is de facto like the values I end up caring about upon reflection (which is probably, though not certainly, vaguely like utilitarianism in some sense), then from my perspective it seems very likely that the resources will be squandered.
If you’re imagining something like:
It thinks carefully about what would make “it” happy.
It realizes it cares about having as many diverse good experience moments as possible in a non-indexical way.
It realizes that heavy self-modification would result in these experience moments being better and more efficient, so it creates new versions of “itself” which are radically different and produce more efficiently good experiences.
It realizes it doesn’t care much about the notion of “itself” here and mostly just focuses on good experiences.
It runs vast numbers of such copies with diverse experiences.
Then this is just something like utilitarianism by another name via a differnet line of reasoning.
I thought your view was that step (2) in this process won’t go like this. E.g., currently self-ish entities will retain indexical preferences. If so, then I do see where the goodness can plausibly come from.
When I look at very rich people (people with >$1 billion), it seems like the dominant way they make the world better via spending money (not via making money!) is via thoughtful altuistic giving not via consumption.
Perhaps your view is that with the potential for digital minds this situation will change?
(Also, it seems very plausible to me that the dominant effect on current welfare is driven mostly by the effect on factory farming and other animal welfare.)
I expect this trend to further increase as people get much, much wealthier and some fraction (probably most) of them get much, much smarter and wiser with intelligence augmentation.
I want to challenge an argument that I think is drives a lot of AI risk intuitions. I think the argument goes something like this:
There is something called “human values”.
Humans broadly share “human values” with each other.
It would be catastrophic if AIs lacked “human values”.
“Human values” are an extremely narrow target, meaning that we need to put in exceptional effort in order to get AIs to be aligned with human values.
My problem with this argument is that “human values” can refer to (at least) three different things, and under every plausible interpretation, the argument appears internally inconsistent.
Broadly speaking, I think “human values” usually refers to one of three concepts:
The individual objectives that people pursue in their own life (i.e. the individual human desire desire for wealth, status, and happiness, usually for themselves or their family and friends)
The set of rules we use to socially coordinate (i.e. our laws, institutions, and social norms)
Our cultural values (i.e. the ways that human societies have broadly differed from each other, in their languages, tastes, styles, etc.)
Under the first interpretation, I think premise (2) of the original argument is undermined. In the second interpretation, premise (4) is undermined. In the third interpretation, premise (3) is undermined.
Let me elaborate.
In the first interpretation, “human values” is not a coherent target that we share with one another, since each person has their own separate, generally selfish objectives that they pursue in their own life. In other words, there isn’t one thing called human values. There are just separate, individually varying preferences for 8 billion humans. When a new human is born, a new version of “human values” comes into existence.
In this view, the set of “human values” from humans 100 years ago is almost completely different from the the set of “human values” that exists now, since almost everyone alive 100 years ago is now dead. In effect, the passage of time is itself a catastrophe. This implies that “human values” isn’t a shared property of the human species, but rather depends on the exact set of individuals who happen to exist at any moment in time. This is loosely speaking a person-affecting perspective.
In the second interpretation, “human values” simply refer to a set of coordination mechanisms that we use to get along with each other, to facilitate our separate individual ends. In this interpretation I do not think “human values” are well-modeled as an extremely narrow target inside a high dimensional space.
Consider our most basic laws: do not murder, do not steal, do not physically assault another person. These seem like very natural ideas could be stumbled upon by a large set of civilizations, even given wildly varying individual and cultural values between them. For example, the idea that it is wrong to steal from another person seems like a pretty natural idea that even aliens could converge on. Not all aliens would converge on such a value, but it seems plausible that enough of them would that we should not say it is an “extremely narrow target”.
In the third interpretation, “human values” are simply cultural values, and it is not clear to me why we would consider changes to this status quo to be literally catastrophic. It seems the most plausible way that cultural changes could be catastrophic is if they changed in a way that dramatically affected our institutions, laws, and norms. But in that case, it starts sounding more like “human values” is being used according to the second interpretation, and not the third.
When I think of values I think of interpretation #2, and I don’t think you prove that P4 is untrue under that interpretation. The idea is that humans are both a) constrained and b) generally inclined to follow some set of rules. An AI would be neither constrained nor necessarily inclined to follow these rules.
Virtually all historical and present atrocities are framed in terms of determining who is a person and who is not. Why would AIs see us as having moral personhood?
P4 is about whether human values are an extremely narrow target, not about whether AIs will be necessarily be inclined to follow them, or necessarily constrained by them. I agree it is logically possible for AIs to exist who would try to murder humans; indeed, there are already humans who try to do that to others. The primary question is instead about how narrow of a target the value “don’t murder” or “don’t steal” is, and whether we need to put in exceptional effort in order to hit these targets.
Among humans, it seems the specific target here is not very narrow, despite our greatly varying individual objectives. This fact provides a hint at how narrow our basic social mechanisms really are, in my opinion.
Here again I would say the question is more about whether thinking that humans have relevant personhood is an extremely narrow target, not about whether AIs will necessarily see us as persons. They may see us as persons, and maybe they won’t. But the idea that they would doesn’t seem very unnatural. For one, if AIs are created in something like our current legal system, the concept of legal personhood will already be extended to humans by default. It seems pretty natural for future people to inherit legal concepts from the past. And all I’m really arguing here is that this isn’t an extremely narrow target to hit, not that it must happen by necessity.
I guess “narrow target” is just an underspecified part of your argument then, because I don’t know what it’s meant to capture if not “in most plausible scenarios, AI doesn’t follow the same set of rules as humans”.
Can you outline the case for thinking that “in most plausible scenarios, AI doesn’t follow the same set of rules as humans”? To clarify, by “same set of rules” here I’m imagining basic legal rules: do not murder, do not steal etc. I’m not making a claim that specific legal statutes will persist over time.
It seems to me both that:
To the extent that AIs are our descendants, they should inherit our legal system, legal principles, and legal concepts, similar to how e.g. the United States inherited legal principles from the United Kingdom. We should certainly expect our legal system to change over time as our institutions adapt to technological change. But, absent a compelling reason otherwise, it seems wrong to think that “do not murder a human” will go out the window in “most plausible scenarios”.
Our basic legal rules seem pretty natural, rather than being highly contingent. It’s easy to imagine plenty of alien cultures stumbling upon the idea of property rights, and implementing the rule “do not steal from another legal person”.
My point is that AI could plausibly have rules for interacting with other “persons”, and those rules could look much like ours, but that we will not be “persons” under their code. Consider how “do not murder” has never applied to animals.
If AIs treat us like we treat animals then the fact that they have “values” will not be very helpful to us.
I think AIs will be trained on our data, and will be integrated into our culture, having been deliberately designed for the purpose of filling human-shaped holes in our economy, to automate labor. This means they’ll probably inherit our social concepts, in addition to most other concepts we have about the physical world. This situation seems disanalogous to the way humans interact with animals in many ways. Animals can’t even speak language.
Anyway, even the framing you have given seems like a partial concession towards my original point. A rejection of premise 4 is not equivalent to the idea that AIs will automatically follow our legal norms. Instead, it was about whether “human values” are an extremely narrow target, in the sense of being a natural vs. contingent set of values that are very hard to replicate in other circumstances.
If the way AIs relate to human values is similar to how humans relate to animals, then I’ll point out that many existing humans already find the idea of caring about animals to be quite natural, even if most ultimately decide not to take the idea very far. Compare the concept of “caring about animals” to “caring about paperclip maximization”. In the first instance, we have robust examples of people actually doing that, but hardly any examples of people in the second instance. This is after all because caring about paperclip maximization is an unnatural and arbitrary thing to care about relative to how most people conceptualize the world.
Again, I’m not saying AIs will necessarily care about human values. That was never the claim. The entire question was about whether human values are an “extremely narrow target”. And I think, within this context, given the second interpretation of human values in my original comment, the original thesis seems to have held up fine.
Here’s a fictional dialogue with a generic EA that I think can perhaps helps explain some of my thoughts about AI risks compared to most EAs:
EA: “Future AIs could be unaligned with human values. If this happened, it would likely be catastrophic. If AIs are unaligned, they’ll hide their intentions until they’re in a position to strike in a violent coup, and then the world will end (for us at least).”
Me: “I agree that sounds like it would be very bad. But maybe let’s examine why this scenario seems plausible to you. What do you mean when you say AIs might be unaligned with human values?”
EA: “Their utility functions would not overlap with our utility functions.”
Me: “By that definition, humans are already unaligned with each other. Any given person has almost a completely non-overlapping utility function with a random stranger. People—through their actions rather than words—routinely value their own life and welfare thousands of times higher than that of strangers. Yet even with this misalignment, the world does not end in any strong sense. Nor does this fact automatically imply the world will end for a given group within humanity.”
EA: “Sure, but that’s because humans mostly all have similar intelligence and/or capability. Future AIs will be way smarter and more capable than humans.”
Me: “Why does that change anything?”
EA: “Because, unlike individual humans, misaligned AIs will be able to take over the world, in the sense of being able to exert vast amounts of hard power over the rest of the inhabitants on the planet. Currently, no human, or faction of humans, can kill everyone else. AIs will be different along this axis.”
Me: “That isn’t different from groups of humans. Various groups have ‘taken over the world’ in that sense. For example, adults currently control the world, and took over from the previous adults. More arguably, smart people currently control the world. In these cases, both groups have considerable hard power relative to other people.
Consider a random retirement home. Compared to the rest of the world, it has basically no power. If the rest of humanity decided to destroy or loot the retirement home, there would be no serious chance the retirement home could stop it. And yet things like that don’t happen very often even though humans have mostly non-overlapping utility functions and considerable hard power.”
EA: “Sure, but that’s because humans respect long-standing norms and laws, and people could never coordinate to do something like that anyway, nor would they want to. AIs won’t necessarily be similar. If unaligned AIs take over the world, they will likely kill us and then replace us, since there’s no reason for them to leave us alive.”
Me: “That seems dubious. Why won’t unaligned AIs respect moral norms? What benefit would they get from killing us? Why are we assuming they all coordinate as a unified group, leaving us out of their coalition?
I’m not convinced, and I think you’re pretty overconfident about these things. But for the sake of argument, let’s just assume for now that you’re right about this particular point. In other words, let’s assume that when AIs take over the world in the future, they’ll kill us and take our place. I certainly agree it would be bad if we all died from AI. But here’s a more fundamental objection: how exactly is that morally different from the fact that young humans already ‘take over’ the world from older people during every generation, letting the old people die, and then take their place?”
EA: “In the case of generational replacement, humans are replaced with other humans. Whereas in this case, humans are replaced by AIs.”
Me: “I’m asking why it matters morally. Why should I care if a human takes my place after I die compared to an AI?”
EA: “This is just a bedrock principle to me. I care more about humans than I care about AIs.”
Me: “Fair enough, but I don’t personally intrinsically care more about humans than AIs. I think what matters is plausibly something like sentience, and maybe sapience, and so I don’t have an intrinsic preference for ordinary human generational replacement compared to AIs replacing us.”
EA: “Well, if you care about sentience, unaligned AIs aren’t going to care about that. They’re going to care about other things, like paperclips.”
Me: “Why do you think that?”
EA: “Because (channeling Yudkowsky) sentience is a very narrow target. It’s extremely hard to get an AI to care about something like that. Almost all goals are like paperclip maximization from our perspective, rather than things like welfare maximization.”
Me: “Again, that seems dubious. AIs will be trained on our data, and share similar concepts with us. They’ll be shaped by rewards from human evaluators, and be consciously designed with benevolence in mind. For these reasons, it seems pretty plausible that AIs will come to value natural categories of things, including potentially sentience. I don’t think it makes sense to model their values as being plucked from a uniform distribution over all possible physical configurations.”
EA: “Maybe, but this barely changes the argument. AI values don’t need to be drawn from a uniform distribution over all possible physical configurations to be very different from our own. The point is that they’ll learn values in a way that is completely distinct from how we form our values.”
Me: “That doesn’t seem true. Humans seem to get their moral values from cultural learning and social emulation, which seems broadly similar to the way that AIs will get their moral values. Yes, there are innate human preferences that AIs aren’t likely to share with us—but they are mostly things like preferring being in a room that’s 20°C rather than 10°C. Our moral values—such as the ideology we subscribe to—are more of a product of our cultural environment than anything else. I don’t see why AIs will be very different.”
EA: “That’s not exactly right. Our moral values are also the result of reflection and intrinsic preferences for certain things. For example, humans have empathy, whereas AIs won’t necessarily have that.”
Me: “I agree AIs might not share some fundamental traits with humans, like the capacity for empathy. But ultimately, so what? Is that really the type of thing that makes you more optimistic about a future with humans than with AI? There exist some people who say they don’t feel empathy for others, and yet I would still feel comfortable giving them more power despite that. On the other hand, some people have told me that they feel empathy, but their compassion seems to turn off completely when they watch videos of animals in factory farms. These things seem flimsy as a reason to think that a human-dominated future will have much more value than an AI-dominated future.”
EA: “OK but I think you’re neglecting something pretty obvious here. Even if you think it wouldn’t be morally worse for AIs to replace us compared to young humans replacing us, the latter won’t happen for another several decades, whereas AI could kill everyone and replace us within 10 years. That fact is selfishly concerning—even if we put aside the broader moral argument. And it probably merits pausing AI for at least a decade until we are sure that it won’t kill us.”
Me: “I definitely agree that these things are concerning, and that we should invest heavily into making sure AI goes well. However, I’m not sure why the fact that AI could arrive soon makes much of a difference to you here.
AI could also make our lives much better. It could help us invent cures to aging, and dramatically lift our well-being. If the alternative is certain death, only later in time, then the gamble you’re proposing doesn’t seem clear to me from a selfish perspective.
Whether it’s selfishly worth it to delay AI depends quite a lot on how much safety benefit we’re getting from the delay. Actuarial life tables reveal that we accumulate a lot of risk just by living our lives normally. For example, a 30 year old male in the United States has nearly a 3% chance of dying before they turn 40. I’m not fully convinced that pausing AI for a decade reduces the chance of catastrophe by more than that. And of course, for older people, the gamble is worse still.”
I have so many axes of disagreement that is hard to figure out which one is most relevant. I guess let’s go one by one.
I would say that pretty much every agent other than me (and probably me in different times and moods) are “misaligned” with me, in the sense that I would not like a world where they get to dictate everything that happens without consulting me in any way.
This is a quibble because in fact I think if many people were put in such a position they would try asking others what they want and try to make it happen.
This hypothetical assumes too much, because people outside care about the lovely people in the retirement home, and they represent their interests. The question is, will some future AIs with relevance and power care for humans, as humans become obsolete?
I think this is relevant, because in the current world there is a lot of variety. There are people who care about retirement homes and people who don’t. The people who care about retirement homes work hard toale sure retirement homes are well cared for.
But we could imagine a future world where the AI that pulls ahead of the pack is very indifferent about humans, while the AI that cares about humans falls behind; perhaps this is because caring about humans puts you at a disadvantage (if you are not willing to squish humans in your territory your space to build servers gets reduced or something; I think this is unlikely but possible) and/or because there is a winner-take-all mechanism and the first AI systems that gets there coincidentally don’t care about humans (unlikely but possible). Then we would be without representation and in possibly quite a sucky situation.
Stop that train, I do not want to be replaced by either human or AI. I want to be in the future and have relevance, or at least be empowered through agents that represent my interests.
I also want my fellow humans to be there, if they want to, and have their own interests be represented.
I don’t think AIs learn in a similar way to humans, and future AI might learn in a even more dissimilar way. The argument I would find more persuasive is pointing out that humans learn in different ways to one another, from very different data and situations, and yet end with similar values that include caring for one another. That I find suggestive, though it’s hard to be confident.
Just for the record, this is when I got off the train for this dialogue. I don’t think humans are misaligned with each other in the relevant ways, and if I could press a button to have the universe be optimized by a random human’s coherent extrapolated volition, then that seems great and thousands of times better than what I expect to happen with AI-descendants. I believe this for a mixture of game-theoretic reasons and genuinely thinking that other human’s values do really actually capture most of what I care about.
In this part of the dialogue, when I talk about a utility function of a human, I mean roughly their revealed preferences, rather than their coherent extrapolated volition (which I also think is underspecified). This is important because it is our revealed preferences that better predict our actual behavior, and the point I’m making is simply that behavioral misalignment is common in this sense among humans. And also this fact does not automatically imply the world will end for a given group of humans within humanity.
This is missing a very important point, which is that I think humans have morally relevant experience and I’m not confident that misaligned AIs would. When the next generation replaces the current one this is somewhat ok because those new humans can experience joy, wonder, adventure etc. My best guess is that AIs that take over and replace humans would not have any morally relevant experience, and basically just leave the universe morally empty. (Note that this might be an ok outcome if by default you expect things to be net negative)
I also think that there is way more overlap in the “utility functions” between humans, than between humans and misaligned AIs. Most humans feel empathy and don’t want to cause others harm. I think humans would generally accept small costs to improve the lives of others, and a large part of why people don’t do this is because people have cognitive biases or aren’t thinking clearly. This isn’t to say that any random human would reflectively become a perfectly selfless total utilitarian, but rather that most humans do care about the wellbeing of other humans. By default, I don’t think misaligned AIs will really care about the wellbeing of humans at all.
I don’t think that’s particularly likely, but I can understand if you think this is an important crux.
For what it’s worth, I don’t think it matters as much whether the AIs themselves are sentient, but rather whether they care about sentience. For example, from the perspective of sentience, humans weren’t necessarily a great addition to the world, because of their contribution to suffering in animal agriculture (although I’m not giving a confident take here).
Even if AIs are not sentient, they’ll still be responsible for managing the world, and creating structures in the universe. When this happens, there’s a lot of ways for sentience to come about, and I care more about the lower level sentience that the AI manages than the actual AIs at the top who may or may not be sentient.
I think this is a big moral difference: We do not actively kill the older humans so that we can take over. We care about older people, and societies that are rich enough spend some resources to keep older people alive longer.
The entirety of humanity being killed and replaced by the kind of AI that places so little moral value on us humans would be catastrophically bad, compared to things that are currently occurring.
I find it slightly strange that EAs aren’t emphasizing semiconductor investments more given our views about AI.
(Maybe this is because of a norm against giving investment advice? This would make sense to me, except that there’s also a cultural norm about criticizing charities that people donate to, and EAs seemed to blow right through that one.)
I commented on this topic last year. Later, I was informed that some people have been thinking about this and acting on it to some extent, but overall my impression is that there’s still a lot of potential value left on the table. I’m really not sure though.
Since I might be wrong and I don’t really know what the situation is with EAs and semiconductor investments, I thought I’d just spell out the basic argument, and see what people say:
Credible models of economic growth predict that, if AI can substitute for human labor, then we should expect the year-over-year world economic growth rate to dramatically accelerate, probably to at least 30% and maybe to rates as high as 300% or 3000%.
This rate of growth should be sustainable for a while before crashing, since physical limits appear to permit far more economic value than we’re currently generating. For example, at our current rate of approximately 5.6 megajoules per dollar, capturing the yearly energy output of the sun would allow us to generate an economy worth $6.8*10^25 dollars, more than 100 billion times the size of our current economy.
If AI drives this economic productivity explosion, it seems likely that the companies manufacturing computer hardware (i.e. semiconductor companies) will benefit greatly in the midst of all of this. Very little of this seems priced in right now, although I admit I haven’t done any rigorous calculations to prove that.
I agree it’s hard to know who will capture most of the value from the AI revolution, but semiconductor companies, and in particular the companies responsible for designing and manufacturing GPUs, seem like a safer bet than almost anyone else.
I agree it’s possible that the existing public companies will be unseated by private competitors and so investing in the public companies risks losing everything, but my understanding is that semiconductor companies have a large moat and are hard to unseat.
I agree it’s possible that the government will nationalize semiconductor production, but they won’t necessarily steal all the profits from investors before doing so.
I agree that EAs should avoid being too heavily invested in one single asset (e.g. crypto) but how much is EA actually invested in semiconductor stocks? Is this actually a concern right now, or is it just a hypothetical concern? Also, investing in Anthropic seems like a riskier bet since it’s less diversified than a broad semiconductor portfolio, and could easily go down in flames.
I agree that AI might hasten the arrival of some sort of post-property-rights system of governance in which investments don’t have any meaning anymore, but I haven’t seen any strong arguments for this. It seems more likely that e.g. tax rates will go way up, but people still own property.
In general, I agree that there are many uncertainties that this question is riding on, but that’s the same thing with any other thing EA does. Any particular donation to AI safety research, for example, is always uncertain and might be a waste of time.
Investing in semiconductor companies plausibly accelerates AI a little bit which is bad to the extent you think acceleration increases x-risk, but if EA gets a huge payout by investing in these companies, then that might cancel out the downsides from accelerating AI?
Another thing I just thought of is that maybe there are good tax reasons to not switch EA investments to semiconductor stocks, which I think would be fair, and I’m not an expert in any of that stuff.
I mostly agree with this (and did also buy some semiconductor stock last winter).
Besides plausibly accelerating AI a bit (which I think is a tiny effect at most unless one plans to invest millions), a possible drawback is motivated reasoning (e.g., one may feel less inclined to think critically of the semi industry, and/or less inclined to favor approaches to AI governance that reduce these companies’ revenue). This may only matter for people who work in AI governance, and especially compute governance.
I’m considering writing a post that critically evaluates the concept of a decisive strategic advantage, i.e. the idea that in the future an AI (or set of AIs) will take over the world in a catastrophic way. I think this concept is central to many arguments about AI risk. I’m eliciting feedback on an outline of this post here in order to determine what’s currently unclear or weak about my argument.
The central thesis would be that it is unlikely that an AI, or a unified set of AIs, will violently take over the world in the future, especially at a time when humans are still widely still seen as in charge (if it happened later, I don’t think it’s “our” problem to solve, but instead a problem we can leave to our smarter descendants). Here’s how I envision structuring my argument:
First, I’ll define what is meant by a decisive strategic advantage (DSA). The DSA model has 4 essential steps:
At some point in time an AI agent, or an agentic collective of AIs, will be developed that has values that differ from our own, in the sense that the ~optimum of its utility function ranks very low according to our own utility function
When this agent is weak, it will have a convergent instrumental incentive to lie about its values, in order to avoid getting shut down (e.g. “I’m not a paperclip maximizer, I just want to help everyone”)
However, when the agent becomes powerful enough, it will suddenly strike and take over the world
Then, being now able to act without constraint, this AI agent will optimize the universe ruthlessly, which will be very bad for us
We can compare the DSA model to an alternative model of future AI development:
Premise (1)-(2) above in the DSA story are still assumed true, but
There will never be a point (3) and (4), in which a unified AI agent will take over the world, and then optimize the universe ruthlessly
Instead, AI agents will compromise, trade, and act within a system of laws indefinitely, in order to achieve their objectives, similar to what humans do now
Because this system of laws will descend from our current institutions and legal tradition, it is likely that humans will keep substantial legal rights, potentially retaining lots of wealth from our capital investments and property, even if we become relatively powerless compared to other AI agents in the system
I have two main objections to the DSA model.
Objection 1: It is unlikely that there will be a point at which a unified agent will be able to take over the world, given the existence of competing AIs with comparable power
Prima facie, it seems intuitive that no single AI agent will be able to take over the world if there are other competing AI agents in the world. More generally, we can try to predict the distribution of power between AI agents using reference class forecasting.
This could involve looking at:
Distribution of wealth among individuals in the world
Distribution of power among nations
Distribution of revenue among businesses
etc.
In most of these cases, the function that describes the distribution of power is something like a pareto distribution, and in particular, it seems rare for one single agent to hold something like >80% of the power.
Therefore, a priori we should assign a low probability to the claim that a unified agent will be able to easily take over of the whole world in the future
To the extent people disagree about the argument I just stated, I expect it’s mostly because they think these reference classes are weak evidence, and they think there are stronger specific object-level points that I need to address. In particular, it seems many people think that AIs will not compete with each other, but instead collude against humans. Their reasons for thinking this include:
The fact that AIs will be able to coordinate well with each other, and thereby choose to “merge” into a single agent
My response: I agree AIs will be able to coordinate with each other, but “ability to coordinate” seems like a continuous variable that we will apply pressure to incrementally, not something that we should expect to be roughly infinite right at the start. Current AIs are not able to “merge” with each other.
If coordination ability increases incrementally over time, then we should see a gradual increase in the concentration of AI agency over time, rather than the sudden emergence of a single unified agent. To the extent this concentration happens incrementally, it will be predictable, the potential harms will be noticeable before getting too extreme, and we can take measures to pull back if we realize that the costs of continually increasing coordination abilities are too high. In my opinion, this makes the challenge here dramatically easier.
In any case, the moment during which we hand over control of the world to AIs will likely occur at a point when the ability for AIs to coordinate is somewhere only modestly above human-level (and very far below perfect).
As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge. We can leave this problem to be solved by our smarter descendants.
The idea that AIs will all be copies of each other, and thus all basically be “a unified agent”
My response: I have two objections.
First, I deny the premise. It seems likely that there will be multiple competing AI projects with different training runs. More importantly, for each pre-training run, it seems likely that there will be differences among deployed AIs due to fine-tuning and post-training enhancements, yielding diversity among AIs in general.
Second, it is unclear why AIs would automatically unify with their copies. I think this idea is somewhat plausible on its face but I have yet to see any strong arguments for it. Moreover, it seems plausible that AIs will have indexical preferences, making them have different values even if they are copies of each other.
The idea that AIs will use logical decision theory
My response: This argument appears to misunderstand what makes coordination difficult. Coordination is not mainly about what decision theory you use. It’s more about being able to synchronize your communication and efforts without waste. See also: the literature on diseconomies of scale.
The idea that a single agent AI will recursively self-improve to become vastly more powerful than everything else in the world
My response: I think this argument, and others like it, suffer from the arguments given against fast takeoff given by Paul Chrisiano, Katja Grace, and Robin Hanson, and I largely agree with what they’ve written about it. For example, here’s Paul Christiano’s take.
Maybe AIs will share collective grievances with each other, prompting a natural alliance among them against humans
My response: if true, we can take steps to mitigate this issue. For example, we can give AIs legal rights, lessening their motives to revolt. While I think this is a significant issue, I also think it’s tractable to solve.
Objection 2: Even if a unified agent can take over the world, it is unlikely to be in their best interest to try to do so
The central argument here would be premised on a model of rational agency, in which an agent tries to maximize benefits minus costs, subject to constraints
The agent would be faced with a choice:
(1) Attempt to take over the world, and steal everyone’s stuff, or
(2) Work within a system of compromise, trade, and law, and get very rich within that system, in order to e.g. buy lots of paperclips
The question of whether (1) is a better choice than (2) is not simply a question of whether taking over the world is “easy” or whether it could be done by the agent. Instead it is a question of whether the benefits of (1) outweigh the costs, relative to choice (2).
It seems likely that working within a system of compromise, trade, and law is more efficient than trying to take over the world even if you can take over the world. The reason is because subverting the system basically means “going to war” with other parties, which is not usually very efficient, even against weak opponents.
Most literature on the economics of war generally predicts that going to war is worse than trying to compromise, assuming both parties are rational and open to compromise. This is mostly because:
War is wasteful. You need to spend resources fighting it, which could be productively spent doing other things.
War is risky. Unless you can win a war with certainty, you might lose the war after launching it, which is a very bad outcome if you have some degree of risk-aversion.
The fact that “humans are weak and can be easily beaten” cuts both ways:
Yes, it means that a very powerful AI agent could “defeat all of us combined” (as Holden Karnofsky said)
But it also means that there would be little benefit to defeating all of us, because we aren’t really a threat to its power
Conclusion: An AI decisive strategic advantage is still somewhat plausible because revolutions have happened in history, and revolutions seem like a reasonable reference class to draw from. That said, it seems the probability of a catastrophic AI takeover in humanity’s relative near-term future (say, the next 50 years) is low (maybe 10% chance of happening). However, it’s perhaps significantly more likely in the very long-run.
Your argument in objection 1 doesn’t the position people who are worried about an absurd offense-defense imbalance.
Additionally: It may be that no agent can take over the world, but that an agent can destroy the world. Would someone build something like that? Sadly, I think the answer is yes.
I’m having trouble parsing this sentence. Can you clarify what you meant?
What incentive is there to destroy the world, as opposed to take it over? If you destroy the world, aren’t you sacrificing yourself at the same time?
Oh, I can see why it is ambiguous. I meant whether it is easier to attack or defend, which is separate from the “power” attackers have and defenders have.
”What incentive is there to destroy the world, as opposed to take it over? If you destroy the world, aren’t you sacrificing yourself at the same time?”
Some would be willing to do that if they can’t take it over.
What reason is there to think that AI will shift the offense-defense balance absurdly towards offense? I admit such a thing is possible, but it doesn’t seem like AI is really the issue here. Can you elaborate?
I think main abstract argument for why this is plausible is that AI will change many things very quickly and in a high variance way. And some human processes will lag behind heavily.
This could plausibly (though not obviously) lead to offense dominance.
I’m not going to fully answer this question, b/c I have other work I should be doing, but I’ll toss in one argument. If different domains (cyber, bio, manipulation, ect.) have different offense-defense balances a sufficiently smart attacker will pick the domain with the worst balance. This recurses down further for at least some of these domains where they aren’t just a single thing, but a broad collection of vaguely related things.
I sympathise with/agree with many of your points here (and in general regard AI x-risk), but something about this recent sequence of quick-takes isn’t landing with me in the way some of your other work has. I’ll try and articulate why in some cases, though I apologies if I misread or misunderstand you.
On this post, these two presises/statements raised an eyebrow:
To me, this is just as unsupported as people who are incredibly certain that there will be ‘treacherous turn’. I get this a supposition/alternative hypothesis, but how can you possible hold a premise that a system of laws will persist indefinitely? This sort of reminds me of the Leahy/Bach discussion where Bach just says ‘it’s going to align itself with us if it wants to if it likes us if it loves us”. I kinda want more that if we’re going to build these powerful systems, saying ’trust me bro, it’ll follow our laws and norms and love us back” doesn’t sound very convincing to me. (For clarity, I don’t think this is your position or framing, and I’m not a fan of the classic/Yudkowskian risk position. I want to say I find both perspectives unconvincing)
Secondly, people abide by systems of laws and norms, but we also have many cases of where individuals/parties/groups overturned these norms when they had accumulated enough power and didn’t feel the need to abide by the existing regime. This doesn’t have to look like the traditional DSA model where humanity gets instantly wiped out, but I don’t see why there couldn’t be a future where an AI makes move like Sulla using force to overthrow and depower the opposing factions, or the 18 Brumaire.
For what it’s worth, the Metaculus crowd forecast for the question “Will transformative AI result in a singleton (as opposed to a multipolar world)?” is currently “60%”. That is, forecasters believe it’s more likely than not that there won’t be competing AIs with comparable power, which runs counter to your claim.
(I bring this up seeing as you make a forecasting-based argument for your claim.)
I hold a few core ethical ideas that are extremely unpopular: the idea that we should treat the natural suffering of animals as a grave moral catastrophe, the idea that old age and involuntary death is the number one enemy of humanity, the idea that we should treat so-called farm animals with an very high level of compassion.
Given the unpopularity of these ideas, you might be tempted to think that the reason they are unpopular is that they are exceptionally counterinuitive ones. But is that the case? Do you really need a modern education and philosphical training to understand them? Perhaps I shouldn’t blame people for not taking things seriously that which they lack the background to understand.
Yet, I claim that these ideas are not actually counterintuitive: they are the type of things you would come up on your own if you had not been conditioned by society to treat them as abnormal. A thoughtful 15 year old who was somehow educated without human culture would find no issue taking these issues seriously. Do you disagree? Let’s put my theory to the test.
In order to test my theory—that caring about wild animal suffering, aging, animal mistreatment—are the things that you would care about if you were uncorrupted by our culture, we need look no further than the bible.
It is known that the book of Genesis was written in ancient times, before anyone knew anything of modern philosophy, contemporary norms of debate, science, advanced mathematics. The writers of Genesis wrote of a perfect paradise, the one that we fell from after we were corrupted. They didn’t know what really happened, of course, so they made stuff up. What is that perfect paradise that they made up?
From Anwers In Genesis, a creationist website,
Since creationists believe that humans are responsible for all the evil in the world, they do not make the usual excuse for evil that it is natural and therefore necessary. They openly call death an enemy, that which to be destroyed.
Later,
So in the garden, animals did not hurt one another. Humans did not hurt animals. But this article even goes further, and debunks the infamous “plants tho” objection to vegetarianism,
In God’s perfect creation, the one invented by uneducated folks thousands of years ago, we can see that wild animal suffering did not exist, nor did death from old age, or mistreatment of animals.
In this article, I find something so close to my own morality, it strikes me a creationist of all people would write something so elegant,
Unfortunately, it continues
I contend it doesn’t really take a modern education to invent these ethical notions. The truly hard step is accepting that evil is bad even if you aren’t personally responsible.
Some people seem to think the risk from AI comes from AIs gaining dangerous capabilities, like situational awareness. I don’t really agree. I view the main risk as simply arising from the fact that AIs will be increasingly integrated into our world, diminishing human control.
Under my view, the most important thing is whether AIs will be capable of automating economically valuable tasks, since this will prompt people to adopt AIs widely to automate labor. If AIs have situational awareness, but aren’t economically important, that’s not as concerning.
The risk is not so much that AIs will suddenly and unexpectedly take control of the world. It’s that we will voluntarily hand over control to them anyway, and we want to make sure this handoff is handled responsibly.
An untimely coup, while possible, is not necessary.
I have now posted as a comment on Lesswrong my summary of some recent economic forecasts and whether they are underestimating the impact of the coronavirus. You can help me by critiquing my analysis.
A trip to Mars that brought back human passengers also has the chance of bringing back microbial Martian passengers. This could be an existential risk if microbes from Mars harm our biosphere in a severe and irreparable manner.
From Carl Sagan in 1973, “Precisely because Mars is an environment of great potential biological interest, it is possible that on Mars there are pathogens, organisms which, if transported to the terrestrial environment, might do enormous biological damage—a Martian plague, the twist in the plot of H. G. Wells’ War of the Worlds, but in reverse.”
Note that the microbes would not need to have independently arisen on Mars. It could be that they were transported to Mars from Earth billions of years ago (or the reverse occurred). While this issue has been studied by some, my impression is that effective altruists have not looked into this issue as a potential source of existential risk.
A line of inquiry to launch could be to determine whether there are any historical parallels on Earth that could give us insight into whether a Mars-to-Earth contamination would be harmful. The introduction of an invasive species into some region loosely mirrors this scenario, but much tighter parallels might still exist.
Since Mars missions are planned for the 2030s, this risk could arrive earlier than essentially all the other existential risks that EAs normally talk about.
See this Wikipedia page for more information: https://en.wikipedia.org/wiki/Planetary_protection
In response to human labor being automated, a lot of people support a UBI funded by a tax on capital. I don’t think this policy is necessarily unreasonable, but if later the UBI gets extended to AIs, this would be pretty bad for humans, whose only real assets will be capital.
As a result, the unintended consequence of such a policy may be to set a precedent for a massive wealth transfer from humans to AIs. This could be good if you are utilitarian and think the marginal utility of wealth is higher for AIs than humans. But selfishly, it’s a big cost.