I don’t know! It’s possible that you can just solve a bargain and then align AI to that, like you can align AI to citizens assemblies. I want to be pitched.
tylermjohn
I don’t understand why that matters. Whatever discount rate you have, if you’re prioritizing between extinction risk and trajectory change you will have some parameters that tell you something about what is going to happen over N years. It doesn’t matter how long this time horizon is. I think you’re not thinking about whether your claims have bearing on the actual matter at hand.
It would probably be most useful for you to try to articulate a view that avoids the dilemma I mentioned in the first comment of this thread.
You’re not going to be prioritizing between extinction risk and long term trajectory changes based on tractability if you don’t care about the far future. And for any moral theory you can ask “why do you think this will be a good outcome?” and as long as you don’t value life intrinsically you’ll have to state some empirical hypotheses about the far future
Any disagreement about longtermist prioritization should presuppose longtermism
tylermjohn’s Quick takes
I want to see a bargain solver for AI alignment to groups: a technical solution that would allow AI systems to solve the pie cutting problem for groups and get them the most of what they want, for AI alignment. The best solutions I’ve seen for maximizing long run value involve using a bargain solver to decide what ASI does, which preserves the richness and cardinality of people’s value functions and gives everyone as much of what they want as possible, weighted by importance. (See WWOTF Afterwards, the small literature on bargaining-theoretic approaches to moral uncertainty.) But existing democratic approaches to AI alignment seem to not be fully leveraging AI tools, and instead aligning AI systems to democratic processes that aren’t empowered with AI tools (e.g. CIPs and CAIS’S alignment to the written output of citizens’ assemblies.) Moreover, in my experience the best way to make something happen is just to build the solution. If you might be interested in building this tool and have the background, I would love to try to connect you to funding for it.
For deeper motivation see here.
Here’s a shower thought:
If you think extinction risk reduction is highly valuable, then you need some kind of a model of what Earth-originating life will do with its cosmic endowment
Some of the parameters in you model must be related to things other than mere survival, like what this life is motivated by or will attempt to do
Plausibly, there are things you can do to change the values of those parameters and not just the extinction parameter
It won’t work for every model (maybe the other parameters just won’t budge), but for some of them it should.
(Low effort comment as I run out the door, but hope it adds value) To me the most compelling argument in favour of tractability is:
We could make powerful AI agents whose goals are well understood and do not change or update in ex ante predictable ways.
These agents are effectively immortal and the most powerful thing in the affectable universe, with no natural competition. They would be able to overcome potentially all natural obstacles, so they would determine what happens in the lightcone.
So, we can make powerful AI agents that determine what happens in the lightcone, whose goals are well understood and update in ex ante predictable ways.
So, we can take actions that determine what happens in the lightcone in an ex ante predictable way.
A cynical and oversimplified — but hopefully illuminating — view (and roughly my view) is that trajectory changes are just longterm power grabs by people with a certain set of values (moral, epistemic, or otherwise). One argument in the other direction is that lots of people are trying to grab power — it’s all powerful people do! And conflict with powerful people over resources is a significant kind of non-neglectedness. But very few people are trying to control the longterm future, due to (e.g.) hyperbolic discounting. So on this view, neglectedness provisionally favours trajectory changes that don’t reallocate power until the future, so that they are not in competition with people seeking power today. A similar argument would apply to other domains where power can be accrued but where competitors are not seeking power.
Some nice and insightful comments from Anders Sandberg on X:
This is very interesting, I liked reading it. I am not sure I entirely agree with the analysis, but I think MPL may well be true, and if this is true leads to very different long term strategies (e.g. a need for more moral hedging).
I am less convinced that we cannot find high-value states, or that they have to be human-parochial.
A key assumption seems to be that not getting maximal value is a disaster, but I think one can equally have a glass-half-full positive view that the search will find greater and greater values.
This essay also fits with my thinking that there might be new values out there, as yet unrealized. Once there was no life, and hence none of the values linked to living beings. Then consciousness, thinking and culture emerged, adding new kinds of value.
I suspect this might keep on going. Not obvious that new levels add fundamentally greater values, but potentially fundamentally different values. And it is not implausible that some are lexically better than others.
… if we sample new values at a constant rate and the values turn out to have a power law distribution, then the highest value found will in expectation grow (trying to work out the formula, but roughly linearly).
That is an excellent question. I think ethical theory matters a lot — see Power Laws of Value. But I also just think our superintelligent descendants are going to be pretty derpy and act on enlightened self-interest as they turn the stars into computers, not pursue very good things. And that might be somewhere where, e.g., @William_MacAskill and I disagree.
I think there is a large minority chance that we will successfully align ASI this century, so I definitely think it is possible.
Thank you! IMO the best argument for subjectivists not having these views would be thinking that (1) humans generally value reasoning processes, (2) there are not that many different reasoning processes you could adopt or as a matter of biological or social fact we all value roughly the same reasoning processes, and (3) these processes have clear and determinate implications. Or, in short, Kant was right: if we reason from the standpoint of “reason”, which is some well-defined and unified thing that we all care about, we all end up in the same place. But I reject all of these premises.
The other argument is that our values are only determinate over Earthly things we are familiar with in our ancestral environment, and among Earthly things we empirically all kinda care about the same things. (I discuss this a bit here.)
These are excellent comments, and unfortunately they all have the virtue of being perspicuous and true so I don’t have that much to say about them.
I doubt how rare near-best futures are among desired futures is a strong guide to the expected value of the future. At least, you need to know more about e.g. the feasibility of near-best futures; whether deliberative processes and scientific progress converge on an understanding of which futures are near-best, etc.
Is the core idea here that human desires and the values people reach on deliberation come apart? That makes sense, though it also leaves open how much deliberation our descendants will actually do / how much their values will be based on a deliberating process. I guess I’ll just state my view without defending it that after a decade in philosophy I have become pretty pessimistic about convergence happening through deliberation rather than more divergence as more choice points are uncovered and reasoners either think they have a good loss function or just choose not to do backpropagation.
I hope my position statement makes my view at least sort of clear. Though as I said to you, my moral values and my practices do come apart!
Yeah, do you have other proposed reconceptualisations of the debate?
One shower thought I’ve had is that maybe we should think of the debate as about whether to focus on ensuring that humans have final control over AI systems or ensuring that humans do good things with that control. But this is far from perfect.
Have you thought about whether there any interventions that could transmit human values to this technologically capable intelligence? The complete works of Bentham and an LLM on a ruggedised solar powered laptop that helps them translate English into their language...
Not very leveraged given the fraction within a fraction within a fraction of success, but maybe worth one marginal person.
Thank you, Will, excellent questions. And thanks for drawing out all of the implications here. Yeah I’m a super duper bullet biter. Age hasn’t dulled my moral senses like it has yours! xP
2. But maybe you think it’s just you who has your values and everyone else would converge on something subtly different—different enough to result in the loss of essentially all value. Then the 1-in-1-million would no longer seem so pessimistic.
Yes, I take (2) on the 1 vs 2 horn. I think I’m the only person who has my exact values. Maybe there’s someone else in the world, but not more than a handful at most. This is because I think our descendants will have to make razor-thin choices in computational space about what matters and how much, and these choices will amount to Power Laws of Value.
But if so, then suppose I’m Galactic Emperor and about to turn everything into X, best by my lights… do you really take a 99.9% chance of extinction, and a 0.1% chance of stuff optimised by you, instead?
I generally like your values quite a bit, but you’ve just admitted that you’re highly scope insensitive. So even if we valued the same matter equally as much, depending on the empirical facts it looks like I should value my own judgment potentially nonillions as much as yours, just on scope sensitivity grounds alone!
3. And if so, do you think that Tyler-now has different values than Tyler-2026? Or are you worried that he might have slightly different values, such that you should be trying to bind yourself to the mast in various ways?
Yup, I am worried about this and I am not doing much about it. I’m worried that the best thing that I could do would simply be to go into cryopreservation right now and hope that my brain is uploaded as a logically omniscient emulation with its values fully locked in and extrapolated. But I’m not super excited about making that sacrifice. Any tips on ways to tie myself to the mast?
what’s the probability you have that:
i. People in general just converge on what’s right?It would be something like: P(people converge on my exact tastes without me forcing them to) + [P(kind of moral or theistic realism I don’t understand)*P(the initial conditions are such that this convergence happens)*P(it happens quickly enough before other values are locked in)*P(people are very motivated by these values)]. To hazard an-off-the cuff guess, maybe 10^-8 + 10^-4*0.2*0.3*0.4, or about 2.4*10^-6.
ii. People don’t converge, but a significant enough fraction converge with you that you and others end up with more than 1milllionth of resources?I should be more humble about this. Maybe it turns out there just aren’t that many free parameters on moral value once you’re a certain kind of hedonistic consequentialist who knows the empirical facts and those people kind of converge to the same things. Suppose that’s 1⁄30 odds vs my “it could be anything” modal view. Then suppose 1⁄20 elites become that kind of hedonistic consequentialist upon deliberation. Then it looks like we control 1/600th of the resources. I’m just making these numbers up, but hopefully they illustrate that this is a useful push that makes me a bit less pessimistic.
iii. You are able to get most of what you want via trade with others?Maybe 1⁄20 that we do get to a suitably ideal kind of trade. I believe what I want is a pretty rivalrous good, i.e. stars, so at the advent of ideal trade I still won’t get very much of what I want. But it’s worth thinking about whether I could get most of what I want in other ways, such as by trading with digital slave-owners to make their slaves extremely happy, in a relatively non-rivalrous way.
I don’t have a clear view on this and think further reflection on this could change my views a lot.
Yes. Which, at least on optimistic assumptions, means sacrificing lots of lives.
Appreciate this comment, and very much agree. I generally think that humanity’s descendents are going to saturate the stars with Dyson swarms making stuff (there’s good incentives to achieve explosive growth) but I think we’re (1) too quick to assume that, (2) too quick to assume we will stop being attached to inefficient earth stuff, and (3) too quick to assume the Dyson swarms will be implementing great stuff rather than, say, insentient digital slaves used to amass power or solve scientific problems.
Let’s say there are three threat models here: (a) Weird Stuff Matters A Lot, (b) Attachment to Biological Organisms, (c) Disneyland With No Children (the machines aren’t conscious).
I focused mainly on Weird Stuff Matters A Lot. The main reason I focused on this rather than Attachment to Biological Organisms is that I still think that computers are going to be so much more economically efficient than biology that in expectation ~75% of everything is computer. Computers are just much more useful than animals for most purposes, and it would be super crazy from most perspectives not to turn most of the stars into computers. (I wouldn’t totally rule out us failing to do that, but incentives push towards it strongly.) If, in expectation, ~75% of everything is computer, then maximizing computer only makes the world better by 1⁄3.
I think the Disneyland With No Children threat model is much scarier. I focused on it less here because I wanted to shore up broadly appealing theoretical reasons for trajectory change, and this argument feels much more partisan. But on my partisan worldview:
Consciousness is an indeterminate folk notion that we will reduce to computational properties. (Or something very close to this.)
These computational properties are going to be much, much more precise and gradable than our folk notion and they won’t wear folk psychological properties on their sleeves.
As a result we’re just going to have to make some choices about what stuff we think is conscious and what isn’t. There’s going to be a sharp borderline we’re going to have to pick arbitrarily, probably based on nothing more than whimsical values.
People will disagree about where the borderline is.
Even if people don’t disagree about the borderline, they’ll disagree substantially about cardinality, i.e. how much to value different computational properties relative to others.
Given the Power Laws of Value point, some people’s choices will be a mere shadow of value from the perspective of other people’s choices.
If this “irrealist” view is right, it’s extremely easy to lose out on almost all value.
Separately, I just don’t think our descendents are going to care very much about whether the computers are actually conscious, and so AI design choices are going to be orthogonal to moral value. On this different sort of orthogonality thesis, we’ll lose out on most value just because our descendents will use AI for practical reasons other than moral reasons, and so their intrinsic value will be unoptimized.
So Disneyland With No Children type threat models look very credible to me.
(I do think humans will make a lot of copies of themselves, which is decently valuable, but not if you’re comparing it to the most valuable world or if you value diversity.)
You could have a more realist view where we just make a big breakthrough in cognitive science and realize that a very glowy, distinctive set of computational properties was what we were talking about all along when we talked about consciousness, and everyone would agree to that. I don’t really think that’s how science works, but even if you did have that view it’s hard to see how the computational properties would just wear their cardinality on their sleeves. Whatever computational properties you find you can always value them differently. If you find some really natural measure of hedons in computational space you can always map hedons to moral value with different functions. (E.g. map 1 hedon to 1 value, 2 hedons to 10 value, 3 hedons to 100 value...)
So I didn’t focus on it here, but I think it’s definitely good to think about the Disneyland concern and it’s closely related to what I was thinking about when writing the OP.
I really liked @Joe_Carlsmith articulation of your 23-word summary: what if all people are paperclippers relative to one another? Though it does make stronger assumptions than we are here.