A minor detail: It’s a bit inaccurate to say that the Foundational Research Institute works on general x-risks. This text explains that FRI focuses on reducing risks of astronomical suffering, which is related to, but not the same, as x-risk reduction.
Tobias_Baumann
Thanks a lot, Peter, for taking the time to evaluate SHIC! I agree that their work seems to be very promising.
In particular, it seems that students and future leaders are one of the most important target groups of effective altruism.
Thanks for your post! I agree that work on preventing risks of future suffering is highly valuable.
It’s tempting to say that it implies that the expected value of a miniscule increase in existential risk to all sentient life is astronomical.
Even if the future is negative according to your values, there are strong reasons not to increase existential risk. This would be extremely uncooperative towards other value systems, and there are many good reasons to be nice to other value systems. It is better to pull the rope sideways by working to improve the future (i.e. reducing risks of astronomical suffering) conditional on there being a future.
In addition, I think it makes sense for utilitarians to adopt a quasi-deontological rule against using violence, regardless of whether one is a classical utilitarian or suffering-focused. This obviously prohibits something like increasing risks of extinction.
Great post! I agree with your overall assessment that other approaches may be more promising than HRAD.
I’d like to add that this may (in part) depend on our outlook on which AI scenarios are likely. Conditional on MIRI’s view that a hard or unexpected takeoff is likely, HRAD may be more promising (though it’s still unclear). If the takeoff is soft or AI will be more like the economy, then I personally think HRAD is unlikely to be the best way to shape advanced AI.
(I wrote a related piece on strategic implications of AI scenarios.)
Do you mean more promising than other technical safety research (e.g. concrete problems, Paul’s directions, MIRI’s non-HRAD research)?
Yeah, and also (differentially) more promising than AI strategy or AI policy work. But I’m not sure how strong the effect is.
If so, I’d be interested in hearing why you think hard / unexpected takeoff differentially favors HRAD.
In a hard / unexpected takeoff scenario, it’s more plausible that we need to get everything more or less exactly right to ensure alignment, and that we have only one shot at it. This might favor HRAD because a less principled approach makes it comparatively unlikely that we get all the fundamentals right when we build the first advanced AI system.
In contrast, if we think there’s no such discontinuity and AI development will be gradual, then AI control may be at least somewhat more similar (but surely not entirely comparable) to how we “align” contemporary software systems. That is, it would be more plausible that we could test advanced AI systems extensively without risking catastrophic failure or that we could iteratively try a variety of safety approaches to see what works best.
It would also be more likely that we’d get warning signs of potential failure modes, so that it’s comparatively more viable to work on concrete problems whenever they arise, or to focus on making the solutions to such problems scalable – which, to my understanding, is a key component of Paul’s approach. In this picture, successful alignment without understanding the theoretical fundamentals is more likely, which makes non-HRAD approaches more promising.
My personal view is that I find a hard and unexpected takeoff unlikely, and accordingly favor other approaches than HRAD, but of course I can’t justify high confidence in this given expert disagreement. Similarly, I’m not highly confident that the above distinction is actually meaningful.
I’d be interested in hearing your thoughts on this!
Thanks for writing this up! I agree that this is a relevant argument, even though many steps of the argument are (as you say yourself) not airtight. For example, consciousness or suffering may be related to learning, in which case point 3) is much less clear.
Also, the future may contain vastly larger populations (e.g. because of space colonization), which, all else being equal, may imply (vastly) more suffering. Even if your argument is valid and the fraction of suffering decreases, it’s not clear whether the absolute amount will be higher or lower (as you claim in 7.).
Finally, I would argue we should focus on the bad scenarios anyway – given sufficient uncertainty – because there’s not much to do if the future will “automatically” be good. If s-risks are likely, my actions matter much more.
(This is from a suffering-focused perspective. Other value systems may arrive at different conclusions.)
Thanks for writing this up!
I think the idea is intriguing, and I agree that this is possible in principle, but I’m not convinced of your take on its practical implications. Apart from heuristic reasons to be sceptical of a new idea on this level of abstractness and speculativeness, my main objection is that a high degree of similarity with respect to reasoning (which is required for the decisions to be entangled) probably goes along with at least some degree of similarity with respect to values. (And if the values of the agents that correlate with me are similar to mine, then the result of taking them into account is also closer to my own values than the compromise value system of all agents.)
You write:
Superrationality only motivates cooperation if one has good reason to believe that another party’s decision algorithm is indeed extremely similar to one’s own. Human reasoning processes differ in many ways, and sympathy towards superrationality represents only one small dimension of one’s reasoning process. It may very well be extremely rare that two people’s reasoning is sufficiently similar that, having common knowledge of this similarity, they should rationally cooperate in a prisoner’s dilemma.
Conditional on this extremely high degree of similarity to me, isn’t it also more likely that their values are also similar to mine? For instance, if my reasoning is shaped by the experiences I’ve made, my genetic makeup, or the set of all ideas I’ve read about over the course of my life, then an agent with identical or highly similar reasoning would also share a lot of these characteristics. But of course, my experiences, genes, etc. also determine my values, so similarity with respect to these factors implies similarity with respect to values.
This is not the same as claiming that a given characteristic X that’s relevant to decision-making is generally linked to values, in the sense that people with X have systematically different values. It’s a subtle difference: I’m not saying that certain aspects of reasoning generally go along with certain values across the entire population; I’m saying that a high degree of similarity regarding reasoning goes along with similarity regarding values.
Agreed. As someone who prioritises s-risk reduction, I find it odd that long-termism is sometimes considered equivalent to x-risk reduction. It is legitimate if people think that x-risk reduction is the best way to improve the long-term, but it should be made clear that this is based on additional beliefs about ethics (rejecting suffering-focused views and not being very concerned about value drift), about how likely x-risks in this century are, and about how tractable it is to reduce them, relative to other ways of improving the long-term. I for one think that none of these points is obvious.
So I feel that there is a representativeness problem between x-risk reduction and other ways of improving the long-term future (not necessarily only s-risk reduction), in addition to an underrepresentation of near-term causes.
Great point – I agree that it would be value to have a common scale.
I’m a bit surprised by the 1-10% estimate. This seems very low, especially given that “serious catastrophe caused by machine intelligence” is broader than narrow alignment failure. If we include possibilities like serious value drift as new technologies emerge, or difficult AI-related cooperation and security problems, or economic dynamics riding roughshod over human values, then I’d put much more than 10% (plausibly more than 50%) on something not going well.
Regarding the “other thoughtful people” in my 80% estimate: I think it’s very unclear who exactly one should update towards. What I had in mind is that many EAs who have thought about this appear to not have high confidence in successful narrow alignment (not clear if the median is >50%?), judging based on my impressions from interacting with people (which is obviously not representative). I felt that my opinion is quite contrarian relative to this, which is why I felt that I should be less confident than the inside view suggests, although as you say it’s quite hard to grasp what people’s opinions actually are.
On the other hand, one possible interpretation (but not the only one) of the relatively low level of concern for AI risk among the larger AI community and societal elites is that people are quite optimistic that “we’ll know how to cross that bridge once we get to it”.
Working on these problems makes a lot of sense, and I’m not saying that the philosophical issues around what “human values” means will likely be solved by default.
I think increasing philosophical sophistication (or “moral uncertainty expansion”) is a very good idea from many perspectives. (A direct comparison to moral circle expansion would also need to take relative tractability and importance into account, which seems unclear to me.)
Thanks for the detailed comments!
(Also, BTW, I would have preferred the word “narrow” or something like it in the post title, because some people use “alignment” in a broad sense and as a result may misinterpret you as being more optimistic than you actually are.)
Good point – changed the title.
Also, distributed emergence of AI is likely not safer than centralized AI, because an “economy” of AIs would be even harder to control and harness towards human values than a single or small number of AI agents.
As long as we consider only narrow alignment, it does seem safer to me in that local misalignment or safety issues in individual systems would not immediately cause everything to break down, because such a system would (arguably) not be able to obtain a decisive strategic advantage and take over the world. So there’d be time to react.
But I agree with you that an economy-like scenario entails other safety issues, and aligning the entire “economy” with human (compromise) values might be very difficult. So I don’t think this is safer overall, or at least it’s not obvious. (From my suffering-focused perspective, distributed emergence of AI actually seems worse than a scenario of the form “a single system quickly takes over and forms a singleton”, as the latter seems less likely to lead to conflict-related disvalue.)
This assumes that alignment work is highly parallelizable. If it’s not, then doing more alignment work now can shift the whole alignment timeline forward, instead of just adding to the total amount of alignment work in a marginal way.
Yeah, I do think that alignment work is fairly parallelizable, and future work also has a (potentially very big) information advantage over current work because they will know more about what AI techniques look like. Is there any precedent of a new technology where work on safety issues was highly serial and where it was therefore crucial to start working on safety a long time in advance?
This only applies to short-term “alignment” and not to long-term / scalable alignment. That is, I have an economic incentive to build an AI that I can harness to give me short-term profits, even if that’s at the expense of the long term value of the universe to humanity or human values. This could be done for example by creating an AI that is not at all aligned with my values and just giving it rewards/punishments so that it has a near-term instrumental reason to help me (similar to how other humans are useful to us even if they are not value aligned to us).
I think there are two different cases:
If the human actually cares only about short-term selfish gain, possibly at the expense of others, then this isn’t a narrow alignment failure, it’s a cooperation problem. (But I agree that it could be a serious issue).
If the human actually cares about the long term, then it appears that she’s making a mistake by buying an AI system that is only aligned in the short term. So it comes down to human inadequacy – given sufficient information she’d buy a long-term aligned AI system instead, and AI companies would have incentive to provide long-term aligned AI systems. Though of course the “sufficient information” part is crucial, and is a fairly strong assumption as it may be hard to distinguish between “short-term alignment” and “real” alignment. I agree that this is another potentially serious problem.
I think we ourselves don’t know how to reliably distinguish between “attempts to manipulate” and “attempts to help” so it would be hard to AIs to learn this. One problem is, our own manipulate/help classifier was trained on a narrow set of inputs (i.e., of other humans manipulating/helping) and will likely fail when applied to AIs due to distributional shift.
Interesting point. I think I still have an intuition that there’s a fairly simple core to it, but I’m not sure how to best articulate this intuition.
Thank you – great to hear that you’ve found it useful!
Great post – thanks for writing this up!
Excellent work!
Re: entomophagy, I think the problem isn’t just direct consumption, but also the use of insects as animal feed – see e.g. this article. Unlike directly eating insects, this doesn’t evoke a strong disgust reaction.
Thanks Jason – I’m excited to see more research on this!
What do you make of the possibility of flow-through effects on long-term attitudes towards insects / invertebrates? For instance, one could argue that entomophagy is particularly relevant because it involves a lot of people directly harming insects – which might, similar to meat consumption, bias people against giving moral weight to insects. (On the other hand, we already engage in many other everyday practices that harm insects or invertebrates – even just walking around outside will squash some bugs.)
Perhaps it would be interesting to study how the saliency of causing direct harm to insects / invertebrates affects people’s attitude?
Very interesting points! I largely agree with your (new) views. Some thoughts:
If you think that extinction risk this century is less than 1%, then in particular, you think that extinction risk from transformative AI is less than 1%. So, for this to be consistent, you have to believe either
a) that it’s unlikely that transformative AI will be developed at all this century,
b) that transformative AI is unlikely to lead to extinction when it is developed, e.g. because it will very likely be aligned in at least a narrow sense. (I wrote up some arguments for this a while ago.)
Which of the two do you believe to what extent? For instance, if you put 10% on transformative AI this century – which is significantly more conservative than “median EA beliefs” – then you’d have to believe that the conditional probability of extinction is less than 10%. (I’m not saying I disagree – in fact, I believe something along these lines myself.)
What do you think about the possibility of a growth mode change (i.e. much faster pace of economic growth and probably also social change, comparable to the industrial revolution) for reasons other than AI? I feel that this is somewhat neglected in EA – would you agree with that?
--
I’d also be interested in more details on what these beliefs imply in terms of how we can improve the long-term future. I suppose you are now more sceptical about work on AI safety as the “default” long-termist intervention. But what is the alternative? Do you think we should focus on broad improvements to civilisation, such as better governance, working towards compromise and cooperation rather than conflict / war, or generally trying to make humanity more thoughtful and cautious about new technologies and the long-term future? These are uncontroversially good but not very neglected, and it seems hard to get a lot of leverage in this way. (Then again, maybe there is no way to get extraordinary leverage over the long-term future.)
Also, if we aren’t at a particularly influential point in time regarding AI, then I think that expanding the moral circle, or otherwise advocating for “better” values, may be among the best things we can do. What are your thoughts on that?
I disagree with your implicit claim that Will’s views (which I mostly agree with) constitute an extreme degree of confidence. I think it’s a mistake to approach these questions with a 50-50 prior. Instead, we should consider the base rate for “events that are at least as transformative as the industrial revolution”.
That base rate seems pretty low. And that’s not actually what we’re talking about—we’re talking about AGI, a specific future technology. In the absense of further evidence, a prior of <10% on “AGI takeoff this century” seems not unreasonable to me. (You could, of course, believe that there is concrete evidence on AGI to justify different credences.)
On a different note, I sometimes find the terminology of “no x-risk”, “going well” etc. unhelpful. It seems more useful to me to talk about concrete outcomes and separate this from normative judgments. For instance, I believe that extinction through AI misalignment is very unlikely. However, I’m quite uncertain about whether people in 2019, if you handed them a crystal ball that shows what will happen (regarding AI), would generally think that things are “going well”, e.g. because people might disapprove of value drift or influence drift. (The future will plausibly be quite alien to us in many ways.) And finally, in terms to my personal values, the top priority is to avoid risks of astronomical suffering (s-risks), which is another matter altogether. But I wouldn’t equate this with things “going well”, as that’s a normative judgment and I think EA should be as inclusive as possible towards different moral perspectives.
There’s a lot of debate about the causes of the industrial revolution. Very few commentators point to some technological breakthrough as the cause, so it’s striking that people are inclined to point to a technological breakthrough in AI as the cause of the next growth mode transition. Instead, leading theories point to some resource overhang (‘colonies and coal’), or some innovation or change in institutions (more liberal laws and norms in England, or higher wages incentivising automation) or in culture. So perhaps there’s some novel governance system that could drive a higher growth mode, and that’ll be the decisive thing.
Strongly agree. I think it’s helpful to think about it in terms of the degree to which social and economic structures optimise for growth and innovation. Our modern systems (capitalism, liberal democracy) do reward innovation—and maybe that’s what caused the growth mode change—but we’re far away from strongly optimising for it. We care about lots of other things, and whenever there are constraints, we don’t sacrifice everything on the altar of productivity / growth / innovation. And, while you can make money by innovating, the incentive is more about innovations that are marketable in the near term, rather than maximising long-term technological progress. (Compare e.g. an app that lets you book taxis in a more convenient way vs. foundational neuroscience research.)
So, a growth mode could be triggered by any social change (culture, governance, or something else) resulting in significantly stronger optimisation pressures for long-term innovation.
That said, I don’t really see concrete ways in which this could happen and current trends do not seem to point in this direction. (I’m also not saying this would necessarily be a good thing.)
Great post! It’s great to see more thought going into these issues. Personally, I’m quite sceptical about claims that our time is especially influential, and I don’t have a strong view on whether our time is more or less hingy than other times. Some additional thoughts:
I got the impression that you assume that some time (or times) are particularly hingy (and then go on to ask whether it’s our time). But it is also perfectly possible that no time is hingy, so I feel that this assumption needs to be justified. Of course, there is some variation and therefore there is inevitably a most influential time, but the crux of the matter is whether there are differences by a large factor (not just 1.5x). And that is not obvious; for instance, if we look at how people in the past could have shaped 21st century societies, it is not clear to me whether any time was especially important.
I think a key question for longtermism is whether the evolution of values and power will eventually settle in some steady state (i.e. the end of history). It is plausible that hinginess increases as one gets closer to this point. (But it’s not obvious, e.g. there could just be a slow convergence to a world government without any pivotal events.) By contrast, if values and influence drift indefinitely, as they did so far in human history, then I don’t see strong reasons to expect certain times to be particularly hingy. So it is crucial to ask whether a (non-extinction) steady state will happen, and how far away we are from it. (See also this related post of mine.)
”I suggest that in the past, we have seen hinginess increase. I think that most longtermists I know would prefer that someone living in 1600 passed resources onto us, today, rather than attempting direct longtermist influence.”
Does this take into account that there have been fewer people around in 1600, and many ways to have an influence were far less competitive? I feel that a person in 1600 could have had a significant impact, e.g. via advocacy for the “right” moral views (e.g. publishing good arguments for consequentialism, antispeciesism, etc.) or by pushing for general improvements like reducing violence and increasing cooperation. So I don’t quite agree with your take on this, though I wouldn’t claim the opposite either – it is not obvious to me whether hinginess increased or decreased. (By your inductive argument, that suggests that it’s not clear whether the future will be more or less hingy than the present.)
”A related, but more general, argument, is that the most pivotal point in time is when we develop techniques for engineering the motivations and values of the subsequent generation (such as through AI, but also perhaps through other technology, such as genetic engineering or advanced brainwashing technology), and that we’re close to that point.”
Similar to your recent point about how creating smarter-than human intelligence has long been feasible, I’d guess that, given strong enough motivation, a lock-in would already be feasible via brainwashing, propaganda, and sufficiently ruthless oppression of opposition. (We’ve had these “technologies” for a long time.) The reason why this doesn’t quite work in totalitarian states is that a) what you want to lock in is usually the power of an individual dictator or some group of humans, but there’s no way to prevent death, and b) people are not fully aligned with the dictator even at the beginning, which limits what you can do (principal-agent problems etc.). The reason we don’t it in liberal democracies is that a) we strongly disapprove of the necessary methods, b) we value free speech and personal autonomy, and c) most people don’t really mind moderate forms of value drift. So it’s to a large extent a question of motivation and taboos, and it is quite possible that people will reject the use of future lock-in technologies for similar reasons.
Thanks for this great map!