Rohin Shah

Karma: 4,386

Hi, I’m Rohin Shah! I work as a Research Scientist on the technical AGI safety team at DeepMind. I completed my PhD at the Center for Human-Compatible AI at UC Berkeley, where I worked on building AI systems that can learn to assist a human user, even if they don’t initially know what the user wants.

I’m particularly interested in big picture questions about artificial intelligence. What techniques will we use to build human-level AI systems? How will their deployment affect the world? What can we do to make this deployment go better? I write up summaries and thoughts about recent work tackling these questions in the Alignment Newsletter.

In the past, I ran the EA UC Berkeley and EA at the University of Washington groups.

http://rohinshah.com

Rohin Shah Jul 30, 2025, 7:54 AM
6 points
1 ∶ 1
in reply to: Hayley Clatterbuck’s comment on: Do short AI timelines make other cause areas useless?
If you think the following claim is true - ‘non-AI projects are never undercut but always outweighed’
Of course I don’t think this. AI definitely undercuts some non-AI projects. But “non-AI projects are almost always outweighed in importance” seems very plausible to me, and I don’t see why anything in the piece is a strong reason to disbelieve that claim, since this piece is only responding to the undercutting argument. And if that claim is true, then the undercutting point doesn’t matter.

Rohin Shah Jul 27, 2025, 2:36 PM
38 points
9 ∶ 2
on: Do short AI timelines make other cause areas useless?
We are disputing a general heuristic that privileges the AI cause area and writes off all the others.
I think the most important argument towards this conclusion is “AI is a big deal, so we should prioritize work that makes it go better”. But it seems you have placed this argument out of scope:
[The claim we are interested in is] that the coming AI revolution undercuts the justification for doing work in other cause areas, rendering work in those areas useless, or nearly so (for now, and perhaps forever).
[...]
AI causes might be more cost-effective than projects in other areas, even if AI doesn’t undercut those projects’ efficacy. Assessing the overall effectiveness of these broad cause areas is too big a project to take on here.
I agree that lots of other work looks about as valuable as it did before, and isn’t significantly undercut by AI. This seems basically irrelevant to the general heuristic you are disputing, whose main argument is “AI is a big deal so is way more important”.

Rohin Shah May 1, 2025, 7:56 PM
14 points
2 ∶ 0
in reply to: calebp’s comment on: calebp’s Shortform
I agree with some of the points on point 1, though other than FTX, I don’t think the downside risk of any of those examples is very large
Fwiw I find it pretty plausible that lots of political action and movement building for the sake of movement building has indeed had a large negative impact, such that I feel uncertain about whether I should shut it all down if I had the option to do so (if I set aside concerns like unilateralism). I also feel similarly about particular examples of AI safety research but definitely not for the field as a whole.
Agree that criticisms of AI companies can be good, I don’t really consider them EA projects but it wasn’t clear that was what I was referring to in my post
Fair enough for the first two, but I was thinking of the FrontierMath thing as mostly a critique of Epoch, not of OpenAI, tbc, and that’s the sense in which it mattered—Epoch made changes, afaik OpenAI did not. Epoch is at least an EA-adjacent project.
Sign seems pretty negative to me.
I agree that if I had to guess I’d say that the sign seems negative for both of the things you say it is negative for, but I am uncertain about it, particularly because of people standing behind a version of the critique (e.g. Habryka for the Nonlinear one, Alexander Berger for the Wytham Abbey one, though certainly in the latter case it’s a very different critique than what the original post said).
I think I stand by the claim that there aren’t many criticisms that clearly mattered, but this was a positive update for me.
Fwiw, I think there are probably several other criticisms that I alone could find given some more time, let alone impactful criticisms that I never even read. I didn’t even start looking for the genre of “critique of individual part of GiveWell cost-effectiveness analysis, which GiveWell then fixes”, I think there’s been at least one and maybe multiple such public criticisms in the past.
I also remember there being a StrongMinds critique and a Happier Lives Institute critique that very plausibly caused changes? But I don’t know the details and didn’t follow it

Rohin Shah May 1, 2025, 7:05 PM
77 points
12 ∶ 1
in reply to: calebp’s comment on: calebp’s Shortform
I’m not especially pro-criticism but this seems way overstated.
Almost all EA projects have low downside risk in absolute terms
I might agree with this on a technicality, in that depending on your bar or standard, I could imagine agreeing that almost all EA projects (at least for more speculative causes) have negligible impact in absolute terms.
But presumably you mean that almost all EA projects are such that their plausible good outcomes are way bigger in magnitude than their plausible bad outcomes, or something like that. This seems false, e.g.
- FTX
- Any kind of political action can backfire if a different political party gains power
- AI safety research could be used as a form of safety washing
- AI evaluations could primarily end up as a mechanism to speed up timelines (not saying that’s necessarily bad, but certainly under some models it’s very bad)
- Movement building can kill the movement by making it too diffuse and regressing to the mean, and by creating opponents to the movement
- Vegan advocacy could polarize people, such that factory farming lasts longer than it would be default (e.g. if cheap and tasty substitutes would have caused people to switch over if they weren’t polarized)
There are almost no examples of criticism clearly mattering
- ChatGPT can talk, but OpenAI employees sure can’t
- Habryka on Anthropic non-disparagements
- FrontierMath was funded by OpenAI
- Concerns with Intentional Insights
- It’s hard to tell, but I’d guess Critiques of Prominent AI Safety Labs changed who applied to the critiqued organizations
- Gossip-based criticism of Leverage clearly mattered and imo it would have been better if it was more public
- Sharing Information About Nonlinear clearly mattered in the sense of having some impact, though the sign is unclear
- Same deal for Why did CEA buy Wytham Abbey?
- Back in the era when EA discussions happened mainly on Facebook there were all sorts of critiques and flame wars between protest-tactics and incremental-change-tactics for animal advocacy, I don’t think this particularly changed what any given organization tried to do, but it surely changed views of individual people
I’d be happy to endorse something like “public criticism rarely causes an organization to choose to do something different in a major org-defining way” (but note that’s primarily because people in a good position to change an organization through criticism will just do so privately, not because criticism is totally ineffective).

Rohin Shah Mar 4, 2025, 10:27 AM
18 points
1 ∶ 0
in reply to: Sarah Cheng’s comment on: Habryka’s Quick takes
Of course, it’s true that they could ignore serious criticism is they wanted to, but my sense is that people actually quite often feel unable to ignore criticism.
As someone sympathetic to many of Habryka’s positions, while also disagreeing with many of Habryka’s positions, my immediate reaction to this was “well that seems like a bad thing”, c.f.
shallow criticism often gets valorized
I’d feel differently if you had said “people feel obliged to take criticism seriously if it points at a real problem” or something like that, but I agree with you that the mechanism is more like “people are unable to ignore criticism irrespective of its quality” (the popularity of the criticism matters, but sadly that is only weakly correlated with quality).
What links here?
- Rohin Shah's comment on calebp’s Quick takes by calebp (May 1, 2025, 7:05 PM; 77 points)

Rohin Shah Jan 24, 2025, 11:38 AM
2 points
0 ∶ 0
in reply to: trammell’s comment on: Notes on risk compensation
Tbc if the preferences are written in words like “expected value of the lightcone” I agree it would be relatively easy to tell which was which, mainly by identifying community shibboleths. My claim is that if you just have the input/output mapping of (safety level of AI, capabilities level of AI) --> utility, then it would be challenging. Even longtermists should be willing to accept some risk, just because AI can help with other existential risks (and of course many safety researchers—probably the majority at this point—are not longtermists).

Rohin Shah Jan 23, 2025, 9:39 AM
4 points
0 ∶ 0
on: Notes on risk compensation
What you call the “lab’s” utility function isn’t really specific to the lab; it could just as well apply to safety researchers. One might assume that the parameters would be set in such a way as to make the lab more C-seeking (e.g. it takes less C to produce 1 util for the lab than for everyone else).
But at least in the case of AI safety, I don’t think this is the case. I doubt I could easily distinguish a lab capabilities researcher (or lab leadership, or some “aggregate lab utility function”) from an external safety researcher if you just gave me their utility functions over C and S. (AI safety has significant overlap with transhumanism; relative to the rest of humanity they are way more likely to think there are huge benefits to development of safe AGI.) In practice it seems like the issue is more like epistemic disagreement.
You could still recover many of the conclusions in this post by positing that an increase to S leads to a proportional decrease in probability of non-survival, and the proportion is the same between the lab and everyone else, but the absolute numbers aren’t. I’d still feel like this was a poor model of the real situation though.

Rohin Shah Oct 20, 2024, 9:51 PM
6 points
0 ∶ 0
in reply to: Linch’s comment on: EA “Worldviews” Need Rethinking
I agree reductions in infant mortality likely have better long-run effects on capacity growth than equivalent levels of population growth while keeping infant mortality rates constant, which could mean that you still want to focus on infant mortality while not prioritizing increasing fertility.
I would just be surprised if the decision from the global capacity growth perspective ended up being “continue putting tons of resources into reducing infant mortality, but not much into increasing fertility” (which I understand to be the status quo for GHD), because:
- Probably the dominant consideration for importance is how good / bad it is to grow the population, and it is unlikely that the differential effects from reducing infant mortality vs increasing fertility end up changing the decision
- Probably it is easier / cheaper to increase fertility than to reduce infant mortality, because very little effort has been put into increasing fertility (to my knowledge)
That said, it’s been many years since I closely followed the GHD space, and I could easily be wrong about a lot of this.

Rohin Shah Jul 6, 2024, 8:29 AM
9 points
5 ∶ 3
in reply to: peterbarnett’s comment on: 80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)
?? It’s the second bullet point in the cons list, and reemphasized in the third bullet?
If you’re saying “obviously this is the key determinant of whether you should work at a leading AI company so there shouldn’t even be a pros / cons table”, then obviously 80K disagrees given they recommend some such roles (and many other people (e.g. me) also disagree so this isn’t 80K ignoring expert consensus). In that case I think you should try to convince 80K on the object level rather than applying political pressure.

Rohin Shah Jul 6, 2024, 8:23 AM
12 points
1 ∶ 0
in reply to: Owen Cotton-Barratt’s comment on: 80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)
… That paragraph doesn’t distinguish at all between OpenAI and, say, Anthropic. Surely you want to include some details specific to the OpenAI situation? (Or do your object-level views really not distinguish between them?)

Rohin Shah May 9, 2024, 8:34 AM
7 points
2 ∶ 1
on: Updates on the EA catastrophic risk landscape
There’s currently very little work going into issues that arise even if AI is aligned, including the deployment problem
The deployment problem (as described in that link) is a non-problem if you know that AI is aligned.

Rohin Shah May 2, 2024, 9:02 AM
2 points
0 ∶ 0
in reply to: Matthew_Barnett’s comment on: Analyzing the moral value of unaligned AIs
In contrast, I think the fact that these AIs will be trained on human-generated data and deliberately shaped by humans to fulfill human-like functions and to be human-compatible should be given substantial weight.
… This seems to be saying that because we are aligning AI, they will be more utilitarian. But I thought we were discussing unaligned AI?
I agree that the fact we are aligning AI should make one more optimistic. Could you define what you mean by “unaligned AI”? It seems quite plausible that I will agree with your position, and think it amounts to something like “we were pretty successful with alignment”.
The lack of robust evidence for your position seems sufficient, in my opinion, for the main thesis of my original post to hold.
I agree with theses like “it tentatively appears that the normative value of alignment work is very uncertain, and plausibly approximately neutral, from a total utilitarian perspective”, and would go further and say that alignment work is plausibly negative from a total utilitarian perspective.
I disagree with the implied theses in statements like “I’m not very sympathetic to pausing or slowing down AI as a policy proposal.”
If you wrote a post that just said “look, we’re super uncertain about things, here’s your reminder that there are worlds in which alignment work is negative”, I’d be on board with it. But it feels like a motte-and-bailey to write a post that is clearly trying to cause the reader to feel a particular way about some policy, and then retreat to “well my main thesis was very weak and unobjectionable”.
Some more minor comments:
You label this argument as strong but I think it is weak
Well, I can believe it’s weak in some absolute sense. My claim is that it’s much stronger than all of the arguments you make put together.
There is a strong selection pressure for AIs to display outward behaviors that are not overly alien-like. Indeed, the pressure seems to be for AIs to be inhumanly altruistic and kind in their actions.
This is a pretty good example of something I’d call different! You even use the adjective “inhumanly”!
To the extent your argument is that this is strong evidence that the AIs will continue to be altruistic and kind, I think I disagree, though I’ve now learned that you are imagining lots of alignment work happening when making the unaligned AIs, so maybe I’d agree depending on the specific scenario you’re imagining.
I disagree with the characterization that my argument relies primarily on the notion that “you can’t rule out” the possibility of AIs being even more utilitarian than humans.
Sorry, I was being sloppy there. My actual claim is that your arguments either:
- Don’t seem to bear on the question of whether AIs are more utilitarian than humans, OR
- Don’t seem more compelling than the reversed versions of those arguments.
I pointed out that AIs could potentially have a higher density of moral value per unit of matter, and I believe there are straightforward reasons to expect this to be the case, as AIs could be optimized very efficiently in terms of physical space. This is not merely a “you can’t rule it out” type of argument, in my view.
I agree that there’s a positive reason to expect AIs to have a higher density of moral value per unit of matter. I don’t see how this has any (predictable) bearing on whether AIs will be more utilitarian than humans.
Similarly, in the post, I pointed out that humans have many anti-utilitarian intuitions and it seems very plausible that AIs would not share (or share fewer of) these intuitions.
Applying the reversal test:
Humans have utilitarian intuitions too, and it seems very plausible that AIs would not share (or share fewer of) these intuitions.
I don’t especially see why one of these is stronger than the other.
(And if the AI doesn’t share any of the utilitarian intuitions, it doesn’t matter at all if it also doesn’t share the anti-utilitarian intuitions; either way it still won’t be a utilitarian.)
To give another example [...] AIs might care more about reproduction than humans (who by comparison, seem to want to have small population sizes with high per-capita incomes, rather than large population sizes with low per capita incomes as utilitarianism would recommend)
Applying the reversal test:
AIs might care less about reproduction than humans (a large majority of whom will reproduce at least once in their life).
Personally I find the reversed version more compelling.
I think you may be giving humans too much credit for being slightly utilitarian. [...] people (often those in the EA community) attribute a wide range of positive qualities to humans [...]
Fwiw my reasoning here mostly doesn’t depend on facts about humans other than binary questions like “do humans ever display property X”, since by and large my argument is “there is quite a strong chance that unaligned AIs do not have property X at all”.
Though again this might change depending on what exactly you mean by “unaligned AI”.
(I don’t necessarily disagree with your hypotheses as applied to the broader world—they sound plausible, though it feels somewhat in conflict with the fact that EAs care about AI consciousness a decent bit—I just disagree with them as applied to me in this particular comment thread.)
I think the real world data points should epistemically count for more than the intuitions?
I don’t buy it. The “real world data points” procedure here seems to be: take two high-level concepts (e.g. affective empathy, proclivity towards utilitarianism), draw a line between them, extrapolate way way out of distribution. I think this procedure would have a terrible track record when applied without the benefit of hindsight.
I expect my arguments based on intuitions would also have a pretty bad track record, but I do think they’d outperform the procedure above.
More generally, I think it is hard to argue about what might be true if real world data counts for less than intuitions.
Yup, this is an unfortunate fact about domains where you don’t get useful real world data. That doesn’t mean you should start using useless real world data.
Isn’t this the effect you alluded to, when you named reasons why some humans are utilitarians?
Yes, but I think the relevance is mostly whether or not the being feels pleasure or pain at all, rather than the magnitude with which it feels it. (Probably the magnitude matters somewhat, but not very much.)
Among humans I would weakly predict the opposite effect, that people with less pleasure-pain salience are more likely to be utilitarian (mostly due to a predicted anticorrelation with logical thinking / decoupling / systemizing nature).

Rohin Shah May 1, 2024, 8:04 AM
6 points
2 ∶ 0
in reply to: Matthew_Barnett’s comment on: Analyzing the moral value of unaligned AIs
This suggests affective empathy may not be strongly predictive of utilitarian motivations.
I can believe that if the population you are trying to predict for is just humans, almost all of whom have at least some affective empathy. But I’d feel pretty surprised if this were true in whatever distribution over unaligned AIs we’re imagining. In particular, I think if there’s no particular reason to expect affective empathy in unaligned AIs, then your prior on it being present should be near-zero (simply because there are lots of specific claims about unaligned AIs about that complicated most of which will be false). And I’d be surprised if “zero vs non-zero affective empathy” was not predictive of utilitarian motivations.
I definitely agree that AIs might feel pleasure and pain, though I’m less confident in it than you seem to be. It just seems like AI cognition could be very different from human cognition. For example, I would guess that pain/pleasure are important for learning in humans, but it seems like this is probably not true for AI systems in the current paradigm. (For gradient descent, the learning and the cognition happen separately—the AI cognition doesn’t even get the loss/reward equivalent as an input so cannot “experience” it. For in-context learning, it seems very unclear what the pain/pleasure equivalent would be.)
this line of argument would overlook the real possibility that unaligned AIs could [...] have an even stronger tendency towards utilitarian motivations.
I agree this is possible. But ultimately I’m not seeing any particularly strong reasons to expect this (and I feel like your arguments are mostly saying “nothing rules it out”). Whereas I do think there’s a strong reason to expect weaker tendencies: AIs will be different, and on average different implies fewer properties that humans have. So aggregating these I end up concluding that unaligned AIs will be less utilitarian in expectation.
(You make a bunch of arguments for why AIs might not be as different as we expect. I agree that if you haven’t thought about those arguments before you should probably reduce your expectation of how different AIs will be. But I still think they will be quite different.)
this line of argument would overlook the real possibility that unaligned AIs could be more conscious than humans,
I don’t see why it matters if AIs are more conscious than humans? I thought the relevant question we’re debating is whether they are more likely to be utilitarians. Maybe the argument is that if they are more conscious-in-the-sense-of-feeling-pleasure-and-pain they are more likely to be utilitarians? If so I might buy that but feel like it’s a weak effect.
As a consequence it’s hard for me to see a strong reason for preferring humans over AIs if you cared about pyramid-maximization.
Sure, but a big difference is that no human cares about pyramid-maximization, whereas some humans are utilitarians?
(Maybe some humans do care about pyramid-maximization? I’d need to learn more about those humans before I could have any guess about whether to prefer humans over AIs.)
Consciousness seems to be a fairly convergent function of intelligence
I would say “fairly convergent function of biologically evolved intelligence”. Evolution faced lots of constraints we don’t have in AI design. For example, cognition and learning had to be colocated in space and time (i.e. done in a single brain), whereas for AIs these can be (and are) separated. Seems very plausible that consciousness-in-the-sense-of-feeling-pleasure-and-pain is a solution needed under the former constraint but not the latter. (Maybe I’m at 20% chance that something in this vicinity is right, though that is a very made-up number.)

Rohin Shah Apr 30, 2024, 9:20 PM
4 points
0 ∶ 0
in reply to: Matthew_Barnett’s comment on: Analyzing the moral value of unaligned AIs
Given my new understanding of the meaning of “contingent” here, I’d say my claims are:
1. I’m unsure about how contingent the development of utilitarianism in humans was. It seems quite plausible that it was not very historically contingent. I agree my toy model does not accurately capture my views on the contingency of total utilitarianism.
2. I’m also unsure how contingent it is for unaligned AI, but aggregating over my uncertainty suggests more contingent.
One way to think about this is to ask: why are any humans utilitarians? To the extent it’s for reasons that don’t apply to unaligned AI systems, I think you should feel like it is less likely for unaligned AI systems to be utilitarians. So e.g. if I thought humans were utilitarians primarily because it is simple to express in concepts that humans and AIs share, then I would agree with you. But in fact I feel like it is pretty important that humans feel pleasure and pain, and have empathy, to explain why some humans are utilitarians. (Mostly I think the “true explanation” will have to appeal to more than simplicity, and the additional features this “true explanation” will appeal to are very likely to differ between humans and AIs.)
Compare to the hypothetical argument: humans aren’t very obsessed with building pyramids --> our current level of obsession with pyramid building is probably not unusual, in the sense that you might easily expect aliens/AIs to be similarly obsessed with building pyramids, or perhaps even more obsessed.
Indeed I feel like AIs probably build fewer pyramids in expectation, for basically the same reason. (The concrete hypothesis I generated for why humans build pyramids was “maybe pyramids were especially easy to build historically”.)
General note: I want to note that my focus on AI alignment is not necessarily coming from a utilitarian perspective. I work on AI alignment because in expectation I think a world with aligned AI will better reflect “my values” (which I’m uncertain about and may not reflect utilitarianism) than a world with unaligned AI. But I’m happy to continue talking about the implications for utilitarians.

Rohin Shah Apr 30, 2024, 8:02 PM
2 points
0 ∶ 0
in reply to: Matthew_Barnett’s comment on: Analyzing the moral value of unaligned AIs
I agree it’s clear that you claim that unaligned AIs are plausibly comparably utilitarian as humans, maybe more.
What I didn’t find was discussion of how contingent utilitarianism is in humans.
Though actually rereading your comment (which I should have done in addition to reading the post) I realize I completely misunderstood what you meant by “contingent”, which explains why I didn’t find it in the post (I thought of it as meaning “historically contingent”). Sorry for the misunderstanding.
Let me backtrack like 5 comments and retry again.

Rohin Shah Apr 30, 2024, 7:29 PM
2 points
0 ∶ 0
in reply to: Matthew_Barnett’s comment on: Analyzing the moral value of unaligned AIs
I was arguing that trying to preserve the present generation of humans looks good according to (2), not (1).
I was always thinking about (1), since that seems like the relevant thing. When I agreed with you that generational value drift seems worrying, that’s because it seems bad by (1). I did not mean to imply that I should act to maximize (2). I agree that if you want to act to maximize (2) then you should probably focus on preserving the current generation.
In my post, I fairly explicitly argued that the rough level of utilitarian values exhibited by humans is likely not very contingent, in the sense of being unusually high compared to other possibilities—and this was a crucial element of my thesis. This idea was particularly important for the section discussing whether unaligned AIs will be more or less utilitarian than humans.
Fwiw, I reread the post again and still failed to find this idea in it, and am still pretty confused at what argument you are trying to make.
At this point I think we’re clearly failing to communicate with each other, so I’m probably going to bow out, sorry.

Rohin Shah Apr 30, 2024, 6:13 AM
4 points
0 ∶ 0
in reply to: Matthew_Barnett’s comment on: Analyzing the moral value of unaligned AIs
Based on your toy model, my guess is that your underlying intuition is something like, “The fact that a tiny fraction of humans are utilitarian is contingent. If we re-rolled the dice, and sampled from the space of all possible human values again (i.e., the set of values consistent with high-level human moral concepts), it’s very likely that <<1% of the world would be utilitarian, rather than the current (say) 1%.”
No, this was purely to show why, from the perspective of someone with values, re-rolling those values would seem bad, as opposed to keeping the values the same, all else equal. In any specific scenario, (a) all else won’t be equal, and (b) the actual amount of worry depends on the correlation between current values and re-rolled values.
The main reason I made utilitarianism a contingent aspect of human values in the toy model is because I thought that’s what you were arguing (e.g. when you say things like “humans are largely not utilitarians themselves”). I don’t have a strong view on this and I don’t think it really matters for the positions I take.
For example:
- We could start doing genetic engineering of humans.
- We could upload humans onto computers.
- A human-level, but conscious, alien species could immigrate to Earth via a portal.
The first two seem broadly fine, because I still expect high correlation between values. (Partly because I think that cosmopolitan utilitarian-ish values aren’t fragile.)
The last one seems more worrying than human-level unaligned AI (more because we have less control over them) but less worrying than unaligned AI in general (since the aliens aren’t superintelligent).
Note I’ve barely thought about these scenarios, so I could easily imagine changing my mind significantly on these takes. (Though I’d be surprised if it got to the point where I thought it was comparable to unaligned AI, in how much the values could stop correlating with mine.)
One way of stopping generational value drift is to try to prevent the current generation of humans from dying, and/or having their preferences die out
It seems way better to simply try to spread your values? It’d be pretty wild if the EA field-builders said “the best way to build EA, taking into account the long-term future, is to prevent the current generation of humans from dying, because their preferences are most similar to ours”.

Rohin Shah Apr 29, 2024, 10:10 PM
6 points
1 ∶ 0
in reply to: Matthew_Barnett’s comment on: Analyzing the moral value of unaligned AIs
To the extent that future generations would have pretty different values than me, like “the only glory is in war and it is your duty to enslave your foes”, along with the ability to enact their values on the reachable universe, in fact that would seem pretty bad to me.
However, I expect the correlation between my values and future generation values is higher than the correlation between my values and unaligned AI values, because I share a lot more background with future humans than with unaligned AI. (This doesn’t require values to be innate, values can be adaptive for many human cultures but not for AI cultures.) So I would be less worried about generational value drift (but not completely unworried).
In addition, this worry is tempered even more by the possibility that values / culture will be set much more deliberately in the nearish future, rather than via culture, simply because with an intelligence explosion that becomes more possible to do than it is today.
If you do worry about generational value drift in the strong sense I’ve just described, I’d argue this should cause you to largely adopt something close to position (3) that I outlined in the post, i.e. the view that what matters is preserving the lives and preferences of people who currently exist (rather than the species of biological humans in the abstract).
Huh? I feel very confused about this, even if we grant the premise. Isn’t the primary implication of the premise to try to prevent generational value drift? Why am I only prioritizing people with similar values, instead of prioritizing all people who aren’t going to enact large-scale change? Why would the priority be on current people, instead of people with similar values (there are lots of future people who have more similar values to me than many current people)?

Rohin Shah Apr 29, 2024, 7:35 AM
9 points
2 ∶ 0
in reply to: Matthew_Barnett’s comment on: Analyzing the moral value of unaligned AIs
Fwiw I had a similar reaction as Ryan.
My framing would be: it seems pretty wild to think that total utilitarian values would be better served by unaligned AIs (whose values we don’t know) rather than humans (where we know some are total utilitarians). In your taxonomy this would be “humans are more likely to optimize for goodness”.
Let’s make a toy model compatible with your position:
A short summary of my position is that unaligned AIs could be even more utilitarian than humans are, and this doesn’t seem particularly unlikely either given that (1) humans are largely not utilitarians themselves, (2) consciousness doesn’t seem special or rare, so it’s likely that unaligned AIs could care about it too, and (3) unaligned AIs will be trained on human data, so they’ll likely share our high-level concepts about morality even if not our exact preferences.
Let’s say that there are a million values that one could have with “humanity’s high-level concepts about morality”, one of which is “Rohin’s values”.
For (3), we’ll say that both unaligned AI values and human values are a subset sampled uniformly at random from these million values (all values in the subset weighted equally, for simplicity).
For (1), we’ll say that the sampled human values include “Rohin’s values”, but only as one element in the set of sampled human values.
I won’t make any special distinction about consciousness so (2) won’t matter.
In this toy model you’d expect aligned AI to put ¹⁄_1,000 weight on “Rohin’s values”, whereas unaligned AI puts ¹⁄_1,000,000 weight in expectation on “Rohin’s values” (if the unaligned AI has S values, then there’s an S/1,000,000 probability of it containing “Rohin’s values”, and it is weighted 1/S if present). So aligned AI looks a lot better.
More generally, ceteris paribus, keeping values intact prevents drift and so looks strongly positive from the point of view of the original values, relative to resampling values “from scratch”.
(Feel free to replace “Rohin’s values” with “utilitarianism” if you want to make the utilitarianism version of this argument.)
Imo basically everything that Ryan says in this comment thread is a countercounterargument to a counterargument to this basic argument. E.g. someone might say “oh it doesn’t matter which values you’re optimizing for, all of the value is in the subjective experience of the AIs that are laboring to build new chips, not in the consumption of the new chips” and the rebuttal to that is “Value can be extremely dense in computation relative to the density of value from AIs used for economic activity (instead of value).”

Rohin Shah Mar 23, 2024, 10:10 AM
4 points
0 ∶ 0
in reply to: Richard Y Chappell🔸’s comment on: EA “Worldviews” Need Rethinking
Oh I see, sorry for misinterpreting you.