This other Ryan Greenblatt is my old account[1]. Here is my LW account.
- ^
Account lost to the mists of time and expired university email addresses.
This other Ryan Greenblatt is my old account[1]. Here is my LW account.
Account lost to the mists of time and expired university email addresses.
In other words, agents optimizing for their own happiness, or the happiness of those they care about, seem likely to be the primary force behind the creation of hedonium-like structures. They may not frame it in utilitarian terms, but they will still be striving to maximize happiness and well-being for themselves and others they care about regardless. And it seems natural to assume that, with advanced technology, they would optimize pretty hard for their own happiness and well-being, just as a utilitarian might optimize hard for happiness when creating hedonium.
Suppose that a single misaligned AI takes control and it happens to care somewhat about its own happiness while not having any more “altruistic” tendencies that I would care about or you would care about. (I think misaligned AIs which seize control caring about their own happiness substantially seems less likely than not, but let’s suppose this for now.) (I’m saying “single misaligned AI” for simplicity, I get that a messier coalition might be in control.) It now has access to vast amounts of computation after sending out huge numbers of probes to take control over all available energy. This is enough computation to run absolutely absurd amounts of stuff.
What are you imagining it spends these resources on which is competitive with optimized goodness? Running >10^50 copies of itself which are heavily optimized for being as happy as possible while spending?
If a small number of agents have a vast amount of power, and these agents don’t (eventually, possibly after a large amount of thinking) want to do something which is de facto like the values I end up caring about upon reflection (which is probably, though not certainly, vaguely like utilitarianism in some sense), then from my perspective it seems very likely that the resources will be squandered.
If you’re imagining something like:
It thinks carefully about what would make “it” happy.
It realizes it cares about having as many diverse good experience moments as possible in a non-indexical way.
It realizes that heavy self-modification would result in these experience moments being better and more efficient, so it creates new versions of “itself” which are radically different and produce more efficiently good experiences.
It realizes it doesn’t care much about the notion of “itself” here and mostly just focuses on good experiences.
It runs vast numbers of such copies with diverse experiences.
Then this is just something like utilitarianism by another name via a differnet line of reasoning.
I thought your view was that step (2) in this process won’t go like this. E.g., currently self-ish entities will retain indexical preferences. If so, then I do see where the goodness can plausibly come from.
The fact that our current world isn’t well described by the idea that what matters most is the number of explicit utilitarians, strengthens my point here.
When I look at very rich people (people with >$1 billion), it seems like the dominant way they make the world better via spending money (not via making money!) is via thoughtful altuistic giving not via consumption.
Perhaps your view is that with the potential for digital minds this situation will change?
(Also, it seems very plausible to me that the dominant effect on current welfare is driven mostly by the effect on factory farming and other animal welfare.)
I expect this trend to further increase as people get much, much wealthier and some fraction (probably most) of them get much, much smarter and wiser with intelligence augmentation.
Additionally, how are you feeling about voluntary commitments from labs (RSPs included) relative to alternatives like mandatory regulation by governments
This is discussed in Holden’s earlier post on the topic here.
Explicit +1 to what Owen is saying here.
(Given that I commented with some counterarguments, I thought I would explicitly note my +1 here.)
In particular, I am persuaded by the argument that, because evaluation is usually easier than generation, it should be feasible to accurately evaluate whether a slightly-smarter-than-human AI is taking unethical actions, allowing us to shape its rewards during training accordingly. After we’ve aligned a model that’s merely slightly smarter than humans, we can use it to help us align even smarter AIs, and so on, plausibly implying that alignment will scale to indefinitely higher levels of intelligence, without necessarily breaking down at any physically realistic point.
This reasoning seems to imply that you could use GPT-2 to oversee GPT-4 by bootstrapping from a chain of models of scales between GPT-2 and GPT-4. However, this isn’t true, the weak-to-strong generalization paper finds that this doesn’t work and indeed bootstrapping like this doesn’t help at all for ChatGPT reward modeling (it helps on chess puzzles and for nothing else they investigate I believe).
I think this sort of bootstrapping argument might work if we could ensure that the each model in the chain was sufficiently aligned and capable of reasoning that it would carefully reason about what humans would want if they were more knowledgeable and then rate outputs based on this. However, I don’t think GPT-4 is either aligned enough or capable enough that we see this behavior. And I still think it’s unlikely it works under these generous assumptions (though I won’t argue for this here).
In fact, it is difficult for me to name even a single technology that I think is currently underregulated by society.
The obvious example would be synthetic biology, gain-of-function research, and similar.
I also think AI itself is currently massively underregulated even entirely ignoring alignment difficulties. I think the probability of the creation of AI capable of accelerating AI R&D by 10x this year is around 3%. It would be extremely bad for US national interests if such an AI was stolen by foreign actors. This suffices for regulation ensuring very high levels of security IMO. And this is setting aside ongoing IP theft and similar issues.
Sure, but there are many alternative explanations:
There is internal and external pressure to avoid downplaying AI safety.
Regulation is inevitable, so it would be better to ensure that you can at least influence it somewhat. Purely fighting against regulation might go poorly for you.
The leaders care at least a bit about AI safety either out of a bit of altruism or self interest. (Or at least aren’t constantly manipulative to such an extent that they choose all words to maximize their power.)
Not to mention that Big Tech companies whose business plans might be most threatened by “AI pause” advocacy are currently seeing more general “AI safety” arguments as an opportunity to achieve regulatory capture...
Why do you think this? It seems very unclear if this is true to me.
I’m not sure that I buy that critics lack motivation. At least in the space of AI, there will be (and already are) people with immense financial incentive to ensure that x-risk concerns don’t become very politically powerful.
Of course, it might be that the best move for these critics won’t be to write careful and well reasoned arguments for whatever reason (e.g. this would draw more attention to x-risk so ignoring it is better from their perspective).
Edit: this is mentioned in the post, but I’m a bit surprised because this isn’t emphasized more.
because it feels very differently about “99% of humanity is destroyed, but the remaining 1% are able to rebuild civilisation” and “100% of humanity is destroyed, civilisation ends”
Maybe? This depends on what you think about the probability that intelligent life re-evolves on earth (it seems likely to me) and how good you feel about the next intelligent species on earth vs humans.
the particular focus on extinction increases the threat from AI and engineered biorisks
IMO, most x-risk from AI probably doesn’t come from literal human extinction but instead AI systems acquiring most of the control over long run resources while some/most/all humans survive, but fair enough.
Where the main counterargument is that now the groups in power can be immortal and digital minds will be possible.
See also: AGI and Lock-in
What about “Is Power-Seeking AI an Existential Risk?”?
I don’t know if you’d count it as quantitative, but it is detailed.
My views are reasonably messy, complicated, hard to articulate, and based on a relatively diffuse set of intuitions. I think we also reason in a pretty different way about the situation than you seem to (3). I think it wouldn’t be impossible to try to write up a post on my views, but I would need to consolidate and think about how exactly to express where I’m at. (Maybe 2-5 person days of work.) I haven’t really consolidated my views or something close to reflective equilibrium.
I also just that arguing about pure philosophy very rarely gets anywhere and is very hard to make convincing in general.
I’m somewhat uncertain on the “inside view/mechanistic” level. (But my all considered view is partially defering to some people which makes me overall less worried that I should immediately reconsider my life choices.)
I think my views are compelling, but I’m not sure if I’d say “very compelling”
I’m in agreement that this consideration makes it hard to do a direct comparison. But I think this consideration should mostly make us more uncertain, rather than making us think that humans are better than the alternative.
Actually, I was just trying to say “I can see what humans are like, and it seems pretty good relative to me current guesses about AIs in ways that dont just update me up about AIs” sorry about the confusion.
Currently, humans seem much closer to me in a values level than GPT-4 base. I think this is also likely to be true of future AIs, though I understand why you might not find this convincing.
I think the architecture (learning algorithm, etc.) and training environment between me and other humans seems vastly more similar than between me and likely AIs.
I don’t think I’m going to flesh this argument out to an extent to which you’d find it sufficiently rigorous or convincing, sorry.
You should compare against human nature, which was optimized for something quite different from utilitarianism. If I add up the pros and cons of the thing humans were optimized for and compare it against the thing AIs will be optimized for, I don’t see why it comes out with humans on top, from a utilitarian perspective. Can you elaborate on your reasoning here?
I can’t quickly elaborate in a clear way, but some messy combination of:
I can currently observe humans which screens off a bunch of the comparison and let’s me do direct analysis.
I can directly observe AIs and make predictions of future training methods and their values seem to result from a much more heavily optimized and precise thing with less “slack” in some sense. (Perhaps this is related to genetic bottleneck, I’m unsure.)
AIs will be primarily trained in things which look extremely different from “cooperatively achieving high genetic fitness”.
Current AIs seem to use the vast, vast majority of their reasoning power for purposes which aren’t directly related to their final applications. I predict this will also apply for internal high level reasoning of AIs. This doesn’t seem true for humans.
Humans seem optimized for something which isn’t that far off from utilitarianism from some perspective? Make yourself survive, make your kin group survive, make your tribe survive, etc? I think utilitarianism is often a natural generalization of “I care about the experience of XYZ, it seems arbitrary/dumb/bad to draw the boundary narrowly, so I should extend this further” (This is how I get to utilitarianism.) I think the AI optimization looks considerably worse than this by default.
(Again, note that I said in my comment above: “Some of these can be defeated relatively easily if we train AIs specifically to be good successors, but the default AIs which end up with power over the future will not have this property.” I edited this in to my prior comment, so you might have missed it, sorry.)
What are these a priori reasons and why don’t they similarly apply to AI?
I am a human. Other humans might end up in a similar spot on reflection.
(Also I care less about values of mine which are highly contingent wrt humans.)
The ones I would say are something like (approximately in priority order):
AI’s values could result mostly from playing the training game or other relatively specific optimizations they performed in training which might result in extremely bizarre values from our perspective.
More generally AI values might be highly alien in a way where caring about experience seems very strange to them.
AIs by default will be optimized for very specific commercial purposes with narrow specializations and a variety of hyperspecific heuristics and the resulting values and generalizations of these will be problematic
I care ultimately about what I would think is good upon (vast amounts of) reflection and there are good a priori reasons to think this is similar to what other humans (who care about using vast amounts of compute) will end up thinking is good.
As a sub argument, I might care specifically about things which are much more specific than “lots of good diverse experience”. And, divergences from what I care about (even conditioning on something roughly utilitarian) might result in massive discounts from my perspective.
I care less about my values and preferences in worlds where they seem relatively contingent, e.g. they aren’t broadly shared on reflection by reasonable fractions of humanity.
AIs don’t have a genetic bottleneck and thus can learn much more specific drives that perform well while evolution had to make values more discoverable and adaptable.
E.g. various things about empathy.
AIs might have extremely low levels of cognitive diversity in their training environments as far as co-workers go which might result in very different attitudes as far as caring about diverse experience.
Some of these can be defeated relatively easily if we train AIs specifically to be good successors, but the default AIs which end up with power over the future will not have this property.
Also, I should note that this isn’t a very strong list, though in aggregate it’s sufficient to make me think that human control is perhaps 4x better than AIs. (For reference, I’d say that me personally being in control is maybe 3x better than human control.) I disagree with a MIRI style view about the disvalue of AI and the extent of fragility of value that seems implicit.
Another relevant consideration along these lines is that people who selfishly desire high wealth might mostly care about positional goods which are similar to current positional goods. Usage of these positional goods won’t burn much of any compute (resources for potential minds) even if these positional goods become insanely valuable in terms of compute. E.g., land values of interesting places on earth might be insanely high and people might trade vast amounts of comptuation for this land, but ultimately, the computation will be spent on something else.
why you care about the small fraction of resources spent on altruism
I’m also not sold it’s that small.
Regardless, doesn’t seem like we’re making progresss here.
My proposed counter-argument loosely based on the structure of yours.
Summary of claims
A reasonable fraction of computational resources will be spent based on the result of careful reflection.
I expect to be reasonably aligned with the result of careful reflection from other humans
I expect to be much less aligned with result of AIs-that-seize-control reflecting due to less similarity and the potential for AIs to pursue relatively specific objectives from training (things like reward seeking).
Many arguments that human resource usage won’t be that good seem to apply equally well to AIs and thus aren’t differential.
Full argument
The vast majority of value from my perspective on reflection (where my perspective on reflection is probably somewhat utilitarian, but this is somewhat unclear) in the future will come from agents who are trying to optimize explicitly for doing “good” things and are being at least somewhat thoughtful about it, rather than those who incidentally achieve utilitarian objectives. (By “good”, I just mean what seems to them to be good.)
At present, the moral views of humanity are a hot mess. However, it seems likely to me that a reasonable fraction of the total computational resources of our lightcone (perhaps 50%) will in expectation be spent based on the result of a process in which an agent or some agents think carefully about what would be best in a pretty delibrate and relatively wise way. This could involve eventually deferring to other smarter/wiser agents or massive amounts of self-enhancement. Let’s call this a “reasonably-good-reflection” process.
Why think a reasonable fraction of resources will be spent like this?
If you self-enhance and get smarter, this sort of reflection on your values seems very natural. The same for deferring to other smarter entities. Further, entities in control might live for an extremely long time, so if they don’t lock in something, as long as they eventually get around to being thoughtful it should be fine.
People who don’t reflect like this probably won’t care much about having vast amounts of resources and thus the resources will go to those who reflect.
The argument for “you should be at least somewhat thoughtful about how you spend vast amounts of resources” is pretty compelling at an absolute level and will be more compelling as people get smarter.
Currently a variety of moderately powerful groups are pretty sympathetic to this sort of view and the power of these groups will be higher in the singularity.
I expect that I am pretty aligned (on reasonably-good-reflection) with the result of random humans doing reasonably-good-reflection as I am also a human and many of the underlying arguments/intuitions I think seem important seem likely to seem important to many other humans (given various common human intuitions) upon those humans becoming wiser. Further, I really just care about the preferences of (post-)humans who end care most about using vast, vast amounts of computational resources (assuming I end up caring about these things on reflection), because the humans who care about other things won’t use most of the resources. Additionally, I care “most” about the on-reflection preferences I have which are relatively less contingent and more common among at least humans for a variety of reasons. (One way to put this is that I care less about worlds in which my preferences on reflection seem highly contingent.)
So, I’ve claimed that reasonably-good-reflection resource usage will be non-trivial (perhaps 50%) and that I’m pretty aligned with humans on reasonably-good-reflection. Supposing these, why think that most of the value is coming from something like reasonably-good-reflection prefences rather than other things, e.g. not very thoughtful indexical preferences (selfish) consumption? Broadly three reasons:
I expect huge returns to heavy optimization of resource usage (similar to spending altruistic resources today IMO and in the future we’ll we smarter which will make this effect stronger).
I don’t think that (even heavily optimized) not-very-thoughtful indexical preferences directly result in things I care that much about relative to things optimized for what I care about on reflection (e.g. it probably doesn’t result in vast, vast, vast amounts of experience which is optimized heavily for goodness/$).
Consider how billionaries currently spend money which doesn’t seem to have have much direct value, certainly not relative to their altruistic expenditures.
I find it hard to imagine that indexical self-ish consumption results in things like simulating 10^50 happy minds. See also my other comment. It seems more likely IMO that people with self-ish preferences mostly just buy positional goods that involve little to no experience (separately, I expect this means that people without self-ish preferences get more of the compute, but this is counted in my earlier argument, so we shouldn’t double count it.)
I expect that indirect value “in the minds of the laborers producing the goods for consumption” is also small relative to things optimized for what I care about on reflection. (It seems pretty small or maybe net-negative (due to factory farming) today (relative to optimized altruism) and I expect the share will go down going forward.)
(Aside: I was talking about not-very-thoughtful indexical-preferences. It’s likely to me that doing a reasonably good job reflecting on selfish preferences get back to something like de facto utilitarianism (at least as far as how you spend the vast majority of computational resources) because personal identity and indexical preferences don’t make much sense and the thing you end up thinking is more like “I guess I just care about experiences in general”.)
What about AIs? I think there are broadly two main reasons to expect that what AIs do on reasonably-good-reflection to be worse from my perspective than what humans do:
As discussed above, I am more similar to other humans and when I inspect the object level of how other humans think or act, I feel reasonably optimistic about the results of reasonably-good-reflection for humans. (It seems to me like the main thing holding me back from agreement with other humans is mostly biases/communication/lack of smarts/wisdom given many shared intuitions.) However, AIs might be more different and thus result in less value. Further, the values of humans after reasonably-good-reflection seem close to saturating in goodness from my perspective (perhaps 1⁄3 or 1⁄2 of the value of purely my values), so it seems hard for AI to do better.
To better understand this argument, imagine that instead of humanity the question was between identical clones of myself and AIs. It’s pretty clear I share the same values the clones, so the clones do pretty much strictly better than AIs (up to self-defeating moral views).
I’m uncertain about the degree of similarity between myself and other humans. But, mostly the underlying similarity uncertainties also applies to AIs. So, e.g., maybe I currently think on reasonably-good-reflection humans spend resources 1⁄3 as well as I would and AIs spend resources 1⁄9 as well. If I updated to think that other humans after reasonably-good-reflection only spend resources 1⁄10 as well as I do, I might also update to thinking AIs spend resources 1⁄100 as well.
In many of the stories I imagine for AIs seizing control, very powerful AIs end up directly pursuing close correlated of what was reinforced in training (sometimes called reward-seeking, though I’m trying to point at a more general notion). Such AIs are reasonably likely to pursue relatively obviously valueless-from-my-perspective things on reflection. Overall, they might act more like a ultra powerful corporation that just optimizes for power/money rather than our children (see also here). More generally, AIs might in some sense be subjected to wildly higher levels of optimization pressure than humans while being able to better internalize these values (lack of genetic bottleneck) which can plausibly result in “worse” values from my perspective.
Note that we’re conditioning on safety/alignment technology failing to retain human control, so we should imagine correspondingly less human control over AI values.
I think that the fraction of computation resources of our lightcone used based on the result of a reasonably-good-reflection process seems similar between human control and AI control (perhaps 50%). It’s possible to mess this up of course and either mess up the reflection or to lock-in bad values too early. But, when I look at the balance of arguments, humans messing this up seems pretty similar to AIs messing this up to me. So, the main question is what the result of such a process would be. One way to put this is that I don’t expect humans to differ substantially from AIs in terms of how “thoughtful” they are.
I interpret one of your arguments as being “Humans won’t be very thoughtful about how they spend vast, vast amounts of computational resources. After all, they aren’t thoughtful right now.” To the extent I buy this argument, I think it applies roughly equally well to AIs. So naively, it just divides by both sides rather than making AI look more favorable. (At least, if you accept that all most all of the value comes from being at least a bit thoughtful, which you also contest. See my arguments for that.)