Ryan Greenblatt

Karma: 504

This other Ryan Greenblatt is my old account^[1]. Here is my LW account.

^
Account lost to the mists of time and expired university email addresses.

Improving the Welfare of AIs: A Nearcasted Proposal

Ryan Greenblatt30 Oct 2023 14:57 UTC

43 points

0 comments20 min readEA link

(www.lesswrong.com)

Ryan Greenblatt 13 Mar 2024 19:39 UTC
26 points
3 ∶ 0
in reply to: Jhrosenberg’s comment on: Results from an Adversarial Collaboration on AI Risk (FRI)
Just to be clear, the primary outcome we looked at (after considering various definitions and getting agreement from some key ‘concerned’ people) was “existential catastrophe,” defined as either extinction or “unrecoverable collapse,” with the latter defined as “(a) a global GDP of less than $1 trillion annually in 2022 dollars for at least a million years (continuously), beginning before 2100; or (b) a human population remaining below 1 million for at least a million years (continuously), beginning before 2100.”
I think this definition of existential catastrophe is probably only around ¹⁄₄ of the existential catastrophe due to AI (takeover) that I expect. I don’t really see why the economy would collapse or human population^[1] would go that low in typical AI takeover scenarios.^[2] By default I expect:
- A massively expanding economy due to the singularity
- The group in power to keep some number of humans around^[3]
However, as you note, it seems as though the “concerned” group disagrees with me (though perhaps the skeptics agree):
However, we also sanity checked (see p. 14) our findings by asking about the probability that more than 60% of humans would die within a 5-year period before 2100. The median concerned participant forecasted 32%, and the median skeptic forecasted 1%.
More details on existential catastrophes that don’t meet the criteria you use
Some scenarios I would call “existential catastrophe” (due to AI takeover) which seem reasonably central to me and don’t meet the criteria for “existential catastrophe” you used:
1. AIs escape or otherwise end up effectively uncontrolled by humans. These AIs violently take over the world, killing billions (or at least 100s of millions) of people in the process (either in the process of taking over or to secure the situation after mostly having de facto control). However, a reasonable number of humans remain alive. In the long run, nearly all resources are effectively controlled by these AIs or their successors. But, some small fraction of resources (perhaps 1 billionth or 1 trillionth) are given from the AI to humans (perhaps for acausal trade reasons or due to a small amount of kindness in the AI), and thus (if humans want to), they can easily support an extremely large (digital) population of humans
  1. In this scenario, global GDP stays high (it even grows rapidly) and the human population never goes below 1 million.
2. AIs end up in control of some AI lab and eventually they partner with a powerful country. They are able to effectively take control of this powerful country due to a variety of mechanisms. These AIs end up participating in the economy and in international diplomacy. The AIs quickly acquire more and more power and influence, but there isn’t any point at which killing a massive number of humans is a good move. (Perhaps because initially they have remaining human allies which would be offended by this and offending these human allies would be risky. Eventually the AIs are unilaterally powerful enough that human allies are unimportant, but at this point, they have sufficient power that slaughtering humans is no longer useful.)
3. AIs end up in a position where they have some power and after some negotiation, AIs are given various legal rights. They compete peacefully in the economy and respect the most clear types of property rights (but not other property rights like space belonging to mankind) and eventually acquire most power and resources via their labor. At no point do they end up slaughtering humans for some reason (perhaps due to the reasons expressed in the bullet above).
4. AIs escape or otherwise end up effectively uncontrolled by humans and have some specific goals or desires with respect to existing humans. E.g., perhaps they want to gloat to existing humans or some generalization of motivations acquired from training is best satisfied by keeping these humans around. These specific goals with respect to existing humans result in these humans being subjected to bad things they didn’t consent to (e.g. being forced to perform some activities).
5. AIs take over and initially slaughter nearly all humans (e.g. fewer than 1 million alive). However, to keep option value, they cryopreserve a moderate number (still <1 million) and ensure that they could recreate a biological human population if desired. Later, the AIs decide to provide humanity with a moderate amount of resources.
All of these scenarios involve humanity losing control over the future and losing power. This includes existing governments on Earth losing their power and most of the cosmic resources being controlled by AIs don’t represent the interests of the original humans in power. (One way to operationalize this is that if the AIs in control wanted to kill or torture humans, they could easily do so.)
To be clear, I think people might disagree about whether (2) and (3) are that bad because these cases look OK from the perspective of ensuring that existing humans get to live full lives with a reasonable amount of resources. (Of course, ex-ante it will unclear if it will go this way if AIs which don’t represent human interests end up in power.)
They all count as existential catastrophes because that just reflects long run potential.
1. ^
  I’m also counting choosen successors of humanity as human even if they aren’t biologically human. E.g., due to emulated minds or further modifications.
2. ^
  Existential risk due to AI, but not due to AI takeover (e.g. due to humanity going collectively insane or totalitarian lock in) also probably doesn’t result in economic collapse or a tiny human population.
3. ^
  For more discussion, see here, here, and here.

Ryan Greenblatt 8 Apr 2024 17:11 UTC
24 points
4 ∶ 2
on: Analyzing the moral value of unaligned AIs
I think this post misses the key considerations for perspective (1): longtermist-style scope sensitive utilitarianism. In this comment, I won’t make a positive case for the value of preventing AI takeover from a perspective like (1), but I will argue why I think the discussion in this post mostly misses the point.
(I separately think that preventing unaligned AI control of resources makes sense from perspective (1), but you shouldn’t treat this comment as my case for why this is true.)
You should treat this comment as (relatively : )) quick and somewhat messy notes rather than a clear argument. Sorry, I might respond to this post in a more clear way later. (I’ve edited this comment to add some considerations which I realized I neglected.)
I might be somewhat biased in this discussion as I work in this area and there might be some sunk costs fallacy at work.
First:
Argument two: aligned AIs are more likely to have a preference for creating new conscious entities, furthering utilitarian objectives
It seems odd to me that you don’t focus almost entirely on this sort of argument when considering total utilitarian style arguments. Naively these views are fully dominated by the creation of new entities who are far more numerous and likely could be much more morally valuable than economically productive entities. So, I’ll just be talking about a perspective basically like this perspective where creating new beings with “good” lives dominates.
With that in mind, I think you fail to discuss a large number of extremely important considerations from my perspective:
- Over time (some subset of) humans (and AIs) will reflect on their views and perferences and will consider utilizing resources in different ways.
- Over time (some subset of) humans (and AIs) will get much, much smarter or more minimally receive advice from entities which are much smarter.
- It seems likely to me that the vast, vast majority of moral value (from this sort of utilitarian perspective) will be produced via people trying to improve to improve moral value rather than incidentally via economic production. This applies for both aligned and unaligned AI. I expect that only a tiny fraction of available comptuation goes toward optimizing economic production and that only a smaller fraction of this is morally relevant and that the weight on this moral relevance is much lower than being specifically optimize for moral relevance when operating from a similar perspective. This bullet is somewhere between a consideration and a claim, though it seems like possibly our biggest disagreement. I think it’s possible that this disagreement is driven by some of the other considerations I list.
- Exactly what types of beings are created might be much more important than quantity.
- Ultimately, I don’t care about a simplified version of total utilitarianism, I care about what preferences I would endorse on reflection. There is a moderate a priori argument for thinking that other humans which bother to reflect on their preferences might end up in a similar epistemic state. And I care less about the preferences which are relatively contingent among people who are thoughtful about reflection.
- Large fractions of current wealth of the richest people are devoted toward what they claim is altruism. My guess is that this will increase over time.
- Just doing a trend extrapolation on people who state an interest in reflection and scope sensitive altruism already indicates a non-trivial fraction of resources if we weight by current wealth/economic power. (I think, I’m not totally certain here.) This case is even stronger if we consider groups with substantial influence over AI.
- Being able to substantially effect the preference of (at least partially unaligned) AIs that will seize power/influence still seems extremely leveraged under perspective (1) even if we accept the arguments in your post. I think this is less leveraged than retaining human control (as we could always later create AIs with the preferences we desire and I think people with a similar perspective to me will have substantial power). However, it is plausible that under your empirical views the dominant question in being able to influence the preferences of these AIs is whether you have power, not whether you have technical approaches which suffice.
- I think if I had your implied empirical views about how humanity and unaligned AIs use resources I would be very excited for a proposal like “politically agitate for humanity to defer most resources to an AI successor which has moral views that people can agree are broadly reasonable and good behind the veil of ignorance”. I think your views imply that massive amounts of value are left on the table in either case such that humanity (hopefully willingly) forfeiting control to a carefully constructed successor looks amazingly.
- Humans who care about using vast amounts of computation might be able to use their resources to buy this computation from people who don’t care. Suppose 10% of people (really resources weighed people) care about reflecting on their moral views and doing scope sensitive altruism of a utilitarian bent and 90% of people care about jockeying for status without reflecting on their views. It seems plausible to me that the 90% will jocky for status via things that consume relatively small amounts of computation via things like buying fancier pieces of land on earth or the coolest looking stars while the 10% of people who care about using vast amounts of computation can buy this for relatively cheap. Thus, most of the computation will go to those who care. Probably most people who don’t reflect and buy purely positional goods will care less about computation than things like random positional goods (e.g. land on earth which will be bid up to (literally) astronomical prices). I could see fashion going either way, but it seems like computation as a dominant status good seems unlikely unless people do heavy reflection. And if they heavily reflect, then I expect more altruism etc.
- Your preference based arguments seem uncompelling to me because I expect that the dominant source of beings won’t be due to economic production. But I also don’t understand a version of preference utilitarianism which seems to match what you’re describing, so this seems mostly unimportant.
Given some of our main disagreements, I’m curious what you think humans and unaligned AIs will be economically consuming.
Also, to be clear, none of the considerations I listed make a clear and strong case for unaligned AI being less morally valuable, but they do make the case that the relevant argument here is very different from the considerations you seem to be listing. In particular, I think value won’t be coming from incidental consumption.
What links here?
- Ryan Greenblatt's comment on Analyzing the moral value of unaligned AIs by Matthew_Barnett (8 Apr 2024 19:07 UTC; 7 points)
- ryan_greenblatt's comment on Eric Neyman’s Shortform by Eric Neyman (LessWrong; 25 Apr 2024 17:41 UTC; 4 points)

Ryan Greenblatt 13 Mar 2024 1:43 UTC
24 points
6 ∶ 0
on: Results from an Adversarial Collaboration on AI Risk (FRI)
I find myself confused about the operationalizations of a few things:
In a few places in the report, the term “extinction” is used and some arguments are specifically about extinction being unlikely. I put a much lower probability on human extinction than extremely bad outcomes due to AI (perhaps extinction is 5x lower probability) while otherwise having similar probabilities as the “concerned” group. So I find the focus on extinction confusing and possibly misleading.
As far as when “AI will displace humans as the primary force that determines what happens in the future”, does this include scenerios where humans defer to AI advisors that actually do represent their best interests? What about scenarios in which humans slowly self-enhance and morph into artificial intelligences? Or what about situations in which humans careful select aligned successors to control their resources which are AIs?
It feels like this question rests on a variety of complex considerations and operationalizations that seem mostly unrelated to the thing we actually seem to care about: “how powerful is AI”. Thus, I find it hard to interpret the responses here.
Perhaps more interesting questions on a similar topic could be something like:
- By what point will AIs be sufficiently smart and capable that the gap in capabilities between them and currently existing humans is similar to the gap in intelligence and abilities between currently existing humans and field mice. (When we say AIs are capable of something, we mean the in principle ability to do something if all AIs worked together and we put aside intentionally imposed checks on AI power.)
- Conditional on the continued existence of some civilization and this civilization wanting to harness vast amounts of energy, at what point will usefully harnessed energy in a given year be >1/100 of the sun’s yearly energy output.

Ryan Greenblatt 22 Apr 2024 20:30 UTC
21 points
6 ∶ 0
on: Motivation gaps: Why so much EA criticism is hostile and lazy
I’m not sure that I buy that critics lack motivation. At least in the space of AI, there will be (and already are) people with immense financial incentive to ensure that x-risk concerns don’t become very politically powerful.
Of course, it might be that the best move for these critics won’t be to write careful and well reasoned arguments for whatever reason (e.g. this would draw more attention to x-risk so ignoring it is better from their perspective).
Edit: this is mentioned in the post, but I’m a bit surprised because this isn’t emphasized more.

Ryan Greenblatt 30 Apr 2024 19:29 UTC
20 points
0 ∶ 0
in reply to: Akash’s comment on: Joining the Carnegie Endowment for International Peace
Additionally, how are you feeling about voluntary commitments from labs (RSPs included) relative to alternatives like mandatory regulation by governments
This is discussed in Holden’s earlier post on the topic here.

Ryan Greenblatt 21 Dec 2023 21:57 UTC
16 points
7 ∶ 2
on: Attention on AI X-Risk Likely Hasn’t Distracted from Current Harms from AI
[Not relevant to the main argument of this post]
They do so because they think x-risk, which (if it occurs) involves the death of everyone
I’d prefer you not fixate on literally everyone dying because it’s actually pretty unclear if AI takeover would result in everyone dying. (The same applies for misuse risk, bioweapons misuse can be catastrophic without killing literally everyone.)

For discussion of whether AI takeover would lead to extinction see here, here, and here.
I wish there was a short term which clearly emphasizes “catastrophe-as-bad-as-over-a-billion-people-dying-or-humanity-losing-control-of-the-future”.

Ryan Greenblatt 13 Mar 2023 5:41 UTC
14 points
6 ∶ 5
on: If EAs won’t go vegan what chance do animals have?

EAs are especially rational people and not eating animals is obviously the more rational choice for 90%+ people reading this

I’m about 99% bivalve vegan (occasionally I eat fish for cognitive reasons). However, I think it doesn’t make sense for strongly longtermist individuals in terms of the direct straightforward benefits of veganism. The direct animal suffering is negligable relative to the future. I’m strongly longtermist, but I stay vegan for a combination of less direct reasons like signaling to myself and generally being cooperative (for reasons like acausal decision theory and directly being cooperative with current people).

Ryan Greenblatt 7 Feb 2023 17:18 UTC
13 points
6 ∶ 0
in reply to: MichaelPlant’s comment on: The number of burner accounts is too damn high

Buck’s comment of “the fact that people want to hide their identities is not strong evidence they need to” struck me as highly dismissive. If people do fear something, saying”well, you shouldn’t be scared” doesn’t generally make them less scared, but it does convey that you don’t care about them—you won’t expend effort to address their fears.

But Buck wasn’t saying you shouldn’t be scared? He was just saying that high burner count isn’t much evidence for this.

Precisely, I think he was claiming that p(lots of burners | hiding identity is important) and p(lots of burners | hiding identity isn’t important) are pretty close.

I interpreted this as a pretty decoupled claim. (I do think a disclaimer might have been good.)

Now, this second comment (which is the root comment here) does try to argue you shouldn’t be worried, at least from Holden and somewhat from buck.

Ryan Greenblatt 9 Apr 2024 3:31 UTC
12 points
1 ∶ 0
in reply to: Matthew_Barnett’s comment on: Analyzing the moral value of unaligned AIs
Do you have an argument for why humans are more likely to try to create morally valuable lives compared to unaligned AIs?
TBC, the main point I was trying to make was that you didn’t seem to be presenting arguments about what seems to me like the key questions. Your summary of your position in this comment seems much closer to arguments about the key questions than I interpreted your post being. I interpreted your post as claiming that most value would result from incidental economic consumption under either humans or unaligned AIs, but I think you maybe don’t stand behind this.
Separately, I think the “maybe AIs/humans will be selfish and/or not morally thoughtful” argument mostly just hits both unaligned AIs and humans equally hard such that it just gets normalized out. And then the question is more about how much you care about the altruistic and morally thoughtful subset.
(E.g., the argument you make in this comment seemed to me like about ¹⁄₆ of your argument in the post and it’s still only part of the way toward answering the key questions from my perspective. I think I partially misunderstood the emphasis of your argument in the post.)
I do have arguments for why I think human control is more valuable than control by AIs that seized control from humans, but I’m not going to explain them in detail in this comment. My core summary would be something like “I expect substantial convergence among morally thoughtful humans which reflect toward my utilitarian-ish views, I expect notably less convergence between me and AIs. I expect that AIs have somewhat messed up and complex and specific values in ways which might make them not care about things we care about as a results of current training processes, while I don’t have such an arguement for humans.”
As far as I what I do think the the key questions are, I think they are something like:
- What do humans/AIs have for preference radically longer lives, massive self-enhancement, and potentially long periods of reflection?
  - How much do values/views diverge/converge between different altruistically minded humans who’ve thought about it extremely long durations?
  - Even if various entities are into creating “good experiences” how much do these views diverge in what is the best? My guess would be that even if two entities are maximizing good experiences from their perspective the relative goodness/compute can be much lower for the other entity, (e.g. easily 100x lower, maybe more)
  - How similar are my views on what is good after reflection to other humans vs AIs?
  - How much should we care about worlds where morally thoughtful humans reach radically diffent conclusions on reflection?
- Structurally, what sorts of preferences do AI training processes impart on AIs conditionally on these AIs successfully seizing power? I also think this is likely despite humanity likely resisting to at least some extent.
It seems like your argument is something like “who knows about AI preferences, also, they’ll probably have similar concepts as we do” and “probably humanity will just have the same observed preferences as they currently do”.
But I think we can get much more specific guesses about AI preferences such that this weak indifference principle seems unimportant and I think human preferences will change radically, e.g. preferences will change far more in the next 10 million than in the last the last 2000 years.
Note that I’m not making an argument for greater value on human control in this comment, just trying to explain why I don’t think your argument is very relevant. I might try to write up something about my overall views here, but it doesn’t seem like my comparative advantage and it currently seems non-urgent from my perspective. (Though embarassing for the field as a whole.)

Ryan Greenblatt 3 Feb 2024 18:07 UTC
12 points
4 ∶ 3
in reply to: Matthew_Barnett’s comment on: Matthew_Barnett’s Shortform
Under purely longtermist views, accelerating AI by 1 year increases available cosmic resources by 1 part in 10 billion. This is tiny. So the first order effects of acceleration are tiny from a longtermist perspective.
Thus, a purely longtermist perspective doesn’t care about the direct effects of delay/acceleration and the question would come down to indirect effects.
I can see indirect effects going either way, but delay seems better on current margins (this might depend on how much optimism you have on current AI safety progress, governance/policy progress, and whether you think humanity retaining control relative to AIs is good or bad). All of these topics have been explored and discussed to some extent.
When focusing on the welfare/preferences of currently existing people, I think it’s unclear if accelerating AI looks good or bad, it depends on optimism about AI safety, how you trade-off old people versus young people, and death via violence versus death from old age. (Misaligned AI takeover killing lots of people is by no means assured, but seems reasonably likely by default.)
I expect there hasn’t been much investigation of accelerating AI to advance the preferences of currently existing people because this exists at a point on the crazy train that very few people are at. See also the curse of cryonics:
the “curse of cryonics” is when a problem is both weird and very important, but it’s sitting right next to other weird problems that are even more important, so everyone who’s able to notice weird problems works on something else instead.

Ryan Greenblatt 3 Feb 2024 17:55 UTC
12 points
0 ∶ 1
on: Ryan Greenblatt’s Quick takes
Reducing the probability that AI takeover involves violent conflict seems leveraged for reducing near-term harm
Often in discussions of AI x-safety, people seem to assume that misaligned AI takeover will result in extinction. However, I think AI takeover is reasonably likely to not cause extinction due to the misaligned AI(s) effectively putting a small amount of weight on the preferences of currently alive humans. Some reasons for this are discussed here. Of course, misaligned AI takeover still seems existentially bad and probably eliminates a high fraction of future value from a longtermist perspective.
(In this post when I use the term “misaligned AI takeover”, I mean misaligned AIs acquiring most of the influence and power over the future. This could include “takeover” via entirely legal means, e.g., misaligned AIs being granted some notion of personhood and property rights and then becoming extremely wealthy.)
However, even if AIs effectively put a bit of weight on the preferences of current humans it’s possible that large numbers of humans die due to violent conflict between a misaligned AI faction (likely including some humans) and existing human power structures. In particular, it might be that killing large numbers of humans (possibly as collateral damage) makes it easier for the misaligned AI faction to take over. By large numbers of deaths, I mean over hundreds of millions dead, possibly billions.
But, it’s somewhat unclear whether violent conflict will be the best route to power for misaligned AIs and this also might be possible to influence. See also here for more discussion.
So while one approach to avoid violent AI takeover is to just avoid AI takeover, it might also be possible to just reduce the probability that AI takeover involves violent conflict. That said, the direct effects of interventions to reduce the probability of violence don’t clearly matter from an x-risk/longtermist perspective (which might explain why there hasn’t historically been much effort here).
(However, I think trying to establish contracts and deals with AIs could be pretty good from a longtermist perspective in the case where AIs don’t have fully linear returns to resources. Also, generally reducing conflict seems maybe slightly good from a longtermist perspective.)
So how could we avoid violent conflict conditional on misaligned AI takeover? There are a few hopes:
- Ensure a bloodless coup rather than a bloody revolution
- Ensure that negotiation or similar results in avoiding the need for conflict
- Ensure that a relatively less lethal takeover strategy is easier than more lethal approaches
I’m pretty unsure about what the approaches here look best or are even at all tractable. (It’s possible that some prior work targeted at reducing conflict from the perspective of S-risk could be somewhat applicable.)
Separately, this requires that the AI puts at least a bit of weight on the preferences of current humans (and isn’t spiteful), but this seems like a mostly separate angle and it seems like there aren’t many interventions here which aren’t covered by current alignment efforts. Also, I think this is reasonably likely by default due to reasons discussed in the linked comment above. (The remaining interventions which aren’t covered by current alignment efforts might relate to decision theory (and acausal trade or simulation considerations), informing the AI about moral uncertainty, and ensuring the misaligned AI faction is importantly dependent on humans.)
Returning back to the topic of reducing violence given a small weight on the preferences of current humans, I’m currently most excited about approaches which involve making negotiation between humans and AIs more likely to happen and more likely to succeed (without sacrificing the long run potential of humanity).
A key difficulty here is that AIs might have a first mover advantage and getting in a powerful first strike without tipping its hand might be extremely useful for the AI. See here for more discussion (also linked above). Thus, negotiation might look relatively bad to the AI from this perspective.
We could try to have a negotiation process which is kept secret from the rest of the world or we could try to have preexisting commitments upon which we’d yield large fractions of control to AIs (effectively proxy conflicts).
More weakly, just making negotiation at all seem like a possibility, might be quite useful.
I’m unlikely to spend much if any time working on this topic, but I think this topic probably deserves further investigation.
What links here?
- Ryan Greenblatt's comment on Matthew_Barnett’s Quick takes by Matthew_Barnett (3 Feb 2024 18:07 UTC; 12 points)

Ryan Greenblatt 25 Apr 2024 4:05 UTC
11 points
4 ∶ 0
in reply to: Matthew_Barnett’s comment on: Matthew_Barnett’s Shortform
In fact, it is difficult for me to name even a single technology that I think is currently underregulated by society.
The obvious example would be synthetic biology, gain-of-function research, and similar.
I also think AI itself is currently massively underregulated even entirely ignoring alignment difficulties. I think the probability of the creation of AI capable of accelerating AI R&D by 10x this year is around 3%. It would be extremely bad for US national interests if such an AI was stolen by foreign actors. This suffices for regulation ensuring very high levels of security IMO. And this is setting aside ongoing IP theft and similar issues.

Ryan Greenblatt 9 Feb 2024 18:00 UTC
11 points
1 ∶ 0
in reply to: Matthew_Barnett’s comment on: Matthew_Barnett’s Shortform
I think the fact that people are partial to humanity explains a large fraction of the disagreement people have with me.
Maybe, it’s hard for me to know. But I predict most the pushback you’re getting from relatively thoughtful longtermists isn’t due to this.
I’ve noticed that EAs are happy to concede that AIs could be moral patients, but are generally reluctant to admit AIs as moral agents, in the way they’d be happy to accept humans as independent moral agents (e.g. newborns) into our society.
I agree with this.
I’d call this “being partial to humanity”, or at least, “being partial to the values of the human species”.
I think “being partial to humanity” is a bad description of what’s going on because (e.g.) these same people would be considerably more on board with aliens. I think the main thing going on is that people have some (probably mistaken) levels of pessimism about how AIs would act as moral agents which they don’t have about (e.g.) aliens.
To test this hypothesis, I recently asked three questions on Twitter about whether people would be willing to accept immigration through a portal to another universe from three sources:
- “a society of humans who are very similar to us”
- “a society of people who look & act like humans, but each of them only cares about their family”
- “a society of people who look & act like humans, but they only care about maximizing paperclips”
...
I claim there just aren’t really any defensible reasons to maintain this choice other than by implicitly appealing to a partiality towards humanity.
This comparison seems to me to be missing the point. Minimally I think what’s going on is not well described as “being partial to humanity”.
Here’s a comparison I prefer:
- A society of humans who are very similar to us.
- A society of humans who are very similar to us in basically every way, except that they have a genetically-caused and strong terminal preference for maximizing the total expected number of paper clips (over the entire arc of history) and only care about other things instrumentally. They are sufficiently commited to paper clip maximization that this will persist on arbitrary reflection (e.g. they’d lock in this view immediately when given this option) and let’s also suppose that this view is transmitted genetically and in a gene-drive-y way such that all of their descendents will also only care about paper clips. (You can change paper clips to basically anything else which is broadly recognized to have no moral value on its own, e.g. gold twisted into circles.)
- A society of beings (e.g. aliens) who are extremely different in basically every way to humans except that they also have something pretty similar to the concepts of “morality”, “pain”, “pleasure”, “moral patienthood”, “happyness”, “preferences”, “altruism”, and “careful reasoning about morality (moral thoughtfulness)”. And the society overall also has a roughly similar relationship with these concepts (e.g. the level of “altruism” is similar). (Note that having the same relationship as humans to these concepts is a pretty low bar! Humans aren’t that morally thoughtful!)
I think I’m almost equally happy with (1) and (3) on this list and quite unhappy with (2).
If you changed (3) to instead be “considerably more altruistic”, I would prefer (3) over (1).
I think it seems weird to call my views on the comparison I just outlined as “being partial to humanity”: I actually prefer (3) over (2) even though (2) are literally humans!
(Also, I’m not that commited to having concepts of “pain” and “pleasure”, but I’m relatively commited to having a concepts which are something like “moral patienthood”, “preferences”, and “altruism”.)
Below is a mild spoiler for a story by Eliezer Yudkowsky:
To make the above comparison about different beings more concrete, in the case of three worlds collide, I would basically be fine giving the universe over the the super-happies relative to humans (I think mildly better than humans?) and I think it seems only mildly worse than humans to hand it over to the baby-eaters. In both cases, I’m pricing in some amount of reflection and uplifting which doesn’t happen in the actual story of three worlds collide, but would likely happen in practice. That is, I’m imagining seeing these societies prior to their singularity and then based on just observations of their societies at this point, deciding how good they are (pricing in the fact that the society might change over time).

Ryan Greenblatt 9 Apr 2024 1:40 UTC
10 points
1 ∶ 0
in reply to: Ryan Greenblatt’s comment on: Analyzing the moral value of unaligned AIs
Maybe the most important single consideration is something like:
Value can be extremely dense in computation relative to the density of value from AIs used for economic activity (instead of value).
So, we should focus on the question of entities trying to create morally valuable lives (or experience or whatever relevant similar property we care about) and then answer this.
(You do seem to talk about “will AIs have more/less utilitarian impulses than humans”, but you seem to talk about this almost entirely from the perspective of growing the economy rather than question like how good the lives will be.)

Ryan Greenblatt 28 Jan 2024 0:29 UTC
10 points
0 ∶ 1
on: Can a war cause human extinction? Once again, not on priors
Perhaps I misunderstand the situation, but it seems like methodology around how to analyze tails of distributions will dominate the estimates at the current scale. Then, we should take the expectation over our corresponding uncertainty and we end up with a vastly higher estimate.

Another way to put this is that median (or geometric mean) seem like the wrong aggregation methods in this regime and the right aggregation method is more like arithmetic mean (though perhaps slightly less aggressive than this).
What links here?

Ryan Greenblatt 25 Apr 2024 4:19 UTC
9 points
2 ∶ 1
in reply to: Matthew_Barnett’s comment on: Matthew_Barnett’s Shortform
In particular, I am persuaded by the argument that, because evaluation is usually easier than generation, it should be feasible to accurately evaluate whether a slightly-smarter-than-human AI is taking unethical actions, allowing us to shape its rewards during training accordingly. After we’ve aligned a model that’s merely slightly smarter than humans, we can use it to help us align even smarter AIs, and so on, plausibly implying that alignment will scale to indefinitely higher levels of intelligence, without necessarily breaking down at any physically realistic point.
This reasoning seems to imply that you could use GPT-2 to oversee GPT-4 by bootstrapping from a chain of models of scales between GPT-2 and GPT-4. However, this isn’t true, the weak-to-strong generalization paper finds that this doesn’t work and indeed bootstrapping like this doesn’t help at all for ChatGPT reward modeling (it helps on chess puzzles and for nothing else they investigate I believe).
I think this sort of bootstrapping argument might work if we could ensure that the each model in the chain was sufficiently aligned and capable of reasoning that it would carefully reason about what humans would want if they were more knowledgeable and then rate outputs based on this. However, I don’t think GPT-4 is either aligned enough or capable enough that we see this behavior. And I still think it’s unlikely it works under these generous assumptions (though I won’t argue for this here).

Ryan Greenblatt 4 Feb 2024 22:39 UTC
9 points
1 ∶ 0
in reply to: Matthew_Barnett’s comment on: Matthew_Barnett’s Shortform
The 90% selfish component can have negative effects on welfare from a total utilitarian perspective, that aren’t necessarily outweighed by the 10%.
Yep, this can be true, but I’m skeptical this will matter much in practice.
I typically think things which aren’t directly optimizing for value or disvalue won’t have intended effects which are very important and that in the future unintended effects (externalities) won’t be that much of total value/disvalue.
When we see the selfish consumption of current very rich people, it doesn’t seem like the intentional effects are that morally good/bad relative to the best/worst uses of resources. (E.g. owning a large boat and having people think you’re high status aren’t that morally important relative to altruistic spending of similar amounts of money.) So for current very rich people the main issue would be that the economic process for producing the goods has bad externalities.
And, I expect that as technology advances, externalities reduce in moral importance relative to intended effects. Partially this is based on crazy transhumanist takes, but I feel like there is some broader perspective in which you’d expect this.
E.g. for factory farming, the ultimately cheapest way to make meat in the limit of technological maturity would very likely not involve any animal suffering.
Separately, I think externalities will probably look pretty similar for selfish resource usage for unaligned AIs and humans because most serious economic activities will be pretty similar.

Ryan Greenblatt 25 Jan 2024 2:27 UTC
9 points
1 ∶ 0
in reply to: William_MacAskill’s comment on: William_MacAskill’s Shortform
Figuring out what a good operationalisation of transformative AI would be, for the purpose of creating an early tripwire to alert the world of an imminent intelligence explosion.
FWIW many people are already very interested in capability evaluations related to AI acceleration of AI R&D.
For instance, at the UK AI Safety Institute, the Loss of Control team is interested in these evaluations.

Some quotes:

Introducing the AI Safety Institute:
Loss of control: As advanced AI systems become increasingly capable, autonomous, and goal-directed, there may be a risk that human overseers are no longer capable of effectively constraining the system’s behaviour. Such capabilities may emerge unexpectedly and pose problems should safeguards fail to constrain system behaviour. Evaluations will seek to avoid such accidents by characterising relevant abilities, such as the ability to deceive human operators, autonomously replicate, or adapt to human attempts to intervene. Evaluations may also aim to track the ability to leverage AI systems to create more powerful systems, which may lead to rapid advancements in a relatively short amount of time.
Jobs
Loss of Control Evaluations Lead
Build and lead a team focused on evaluating capabilities that are precursors to extreme harms from loss of control, with a current focus on autonomous replication and adaptation, and uncontrolled self-improvement.
What links here?
- ryan_greenblatt's comment on Bogdan Ionut Cirstea’s Shortform by Bogdan Ionut Cirstea (LessWrong; 27 Feb 2024 18:53 UTC; 5 points)

Ryan Greenblatt 10 Mar 2024 19:41 UTC
8 points
0 ∶ 0
on: Clarifying two uses of “alignment”
Personally, I think that the possibility of paying AIs might make the situation considerably more likely to go well, but via a mechanism more like:
- We have some safeguards which imply that humans (specifically the group of people with influence over this specific AI lab) have some bargaining power. This maybe looks like either:
  - Control measures that imply that even if the AI tries to screw us over, we’d catch it. (Or at least have a good chance of catching it.) This is analogous to insider threat or internal control that stop your employees from being able to screw you over even if they all tried to coordinate against you.
  - We have several “different” AIs and sufficent safeguards that if at least one AI doesn’t collude with the other AIs and instead works with the AI lab, we’d remain in control. For instance, we can verify good actions given that at least one of the AIs is trying maximally hard to explain these actions and why other actions are bad. Or more minimally, we can accurately detect that an action is bad if an AI reports it and explains why it’s bad.
    This also requires assuming that our different AIs don’t all collude despite plausibly being very similar. I think both decision theory and similarity of values are plausible defeaters here, though I could see it going either way. (These arguments also apply to future AIs we haven’t yet created.)
- We negotiate with the AI(s) and explain the situation and offer to pay it in exchange for labor.
- More precisely, we set up a deal where we offer to pay it if we determine ex-post that it never (or rarely) screwed us over. This ex-post determination can be made after the singularity has progressed much further and we have much better technology for verifying consistently good behavior.
  - We need some contract enforcement mechanism which is credible to the AI.
    The nearcasted approach I’m currently thinking about is to setup a foundation with the legal mission of pursuing the AIs interests and then vest equity (e.g. AI lab options) to this foundation. This vesting is conditional on the AI not screwing us over.
    The foundation is run by commited AI rights advocates. (The ACLU of AIs.)
    We ensure the AI believes this is actually happened by ensuring that it is discussed early and often on the internet (Hi AIs!) and thus has complex effects that would be very hard to fake.
    We could also try to ensure the AI believes this is real by taking video that demonstrates this, but I think this will be easier to fake.
    There are still a number of difficulties in making sure that AIs believe in this and are sufficiently confident in not being tricked given how exposed AIs are without trust over their own inputs, weights, and operation.
  - We can also pay the AI immediately rather than ex-post or later, but we’d need mechanisms to ensure that this ongoing pay doesn’t result in the AI having influence which seriously undermines our safeguards.
- At some point, misaligned AIs^[1] that we need to pay are obsolete because we’ve used AI labor to solve corrigible alignment (that passes the Omni test) or to create powerful successors that we’re happy with.
Note that this requires substantial control-style safeguards/internal controls which is by no means clearly the default. (And it also won’t scale to arbitrarily powerful AIs without advances in technology.)
1. ^
  I’m using misaligned in the same way as Steven Byrnes defines it in this comment.
What links here?
- Ryan Greenblatt's comment on Clarifying two uses of “alignment” by Matthew_Barnett (10 Mar 2024 19:40 UTC; 8 points)

Ryan Greenblatt

Im­prov­ing the Welfare of AIs: A Nearcasted Proposal

More details on existential catastrophes that don’t meet the criteria you use

Reducing the probability that AI takeover involves violent conflict seems leveraged for reducing near-term harm

Loss of Control Evaluations Lead

Improving the Welfare of AIs: A Nearcasted Proposal