AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher
[[THIRD EDIT: Thanks so much for all of the questions and comments! There are still a few more I’d like to respond to, so I may circle back to them a bit later, but, due to time constraints, I’m otherwise finished up for now. Any further comments or replies to anything I’ve written are also still be appreciated!]]
Hi!
I’m Ben Garfinkel, a researcher at the Future of Humanity Institute. I’ve worked on a mixture of topics in AI governance and in the somewhat nebulous area FHI calls “macrostrategy”, including: the long-termist case for prioritizing work on AI, plausible near-term security issues associated with AI, surveillance and privacy issues, the balance between offense and defense, and the obvious impossibility of building machines that are larger than humans.
80,000 Hours recently released a long interview I recorded with Howie Lempel, about a year ago, where we walked through various long-termist arguments for prioritizing work on AI safety and AI governance relative to other cause areas. The longest and probably most interesting stretch explains why I no longer find the central argument in Superintelligence, and in related writing, very compelling. At the same time, I do continue to regard AI safety and AI governance as high-priority research areas.
(These two slide decks, which were linked in the show notes, give more condensed versions of my views: “Potential Existential Risks from Artificial Intelligence” and “Unpacking Classic Arguments for AI Risk.” This piece of draft writing instead gives a less condensed version of my views on classic “fast takeoff” arguments.)
Although I’m most interested in questions related to AI risk and cause prioritization, feel free to ask me anything. I’m likely to eventually answer most questions that people post this week, on an as-yet-unspecified schedule. You should also feel free just to use this post as a place to talk about the podcast episode: there was a thread a few days ago suggesting this might be useful.
Have you considered doing a joint standup comedy show with Nick Bostrom?
Yes, but they’re typically invite-only.
I want to push back against this, from one of your slides:
I feel like the LW community did notice many important issues with the classic arguments. Personally, I was/am pessimistic about AI risk, but thought my reasons were not fully or most captured by the those arguments, and I saw various issues/caveats with them that I talked about on LW. I’m going to just cite my own posts/comments because they’re the easiest to find, but I’m sure there were lots of criticisms from others too. 1 2 3 4
Of course I’m glad that you thought about and critiqued those arguments in a more systematic and prominent way, but it seems wrong to say or imply that nobody noticed their issues until now.
Hi Wei,
I didn’t mean to imply that no one had noticed any issues until now. I talk about this a bit more in the podcast, where I mention people like Robin Hanson and Katja Grace as examples of people who wrote good critiques more than a decade ago, and I believe mention you as someone who’s had a different take on AI risk.
Over the past 2-3 years, it seems like a lot of people in the community (myself included) have become more skeptical of the classic arguments. I think this has at least partly been the result of new criticisms or improved formulations of old criticisms surfacing. For example, Paul’s 2018 post arguing against a “fast takeoff” seems to have been pretty influential in shifting views within the community. But I don’t think there’s any clear reason this post couldn’t have been written in the mid-2000s.
Do you think there were any deficits in epistemic modesty in the way the EA community prioritised AI risk, or do you think it was more that no-one sat down and examined the object-level arguments properly? Alternatively, do you think that there was too much epistemic modesty in the sense that everyone just deferred to everyone else on AI risk?
I feel that something went wrong, epistemically, but I’m not entirely sure what it was.
My memory is that, a few years ago, there was a strong feeling within the longtermist portion of the EA community that reducing AI risk was far-and-away the most urgent problem. I remember there being a feeling that the risk was very high, that short timelines were more likely than not, and that the emergence of AGI would likely be a sudden event. I remember it being an open question, for example, whether it made sense to encourage people to get ML PhDs, since, by the time they graduated, it might be too late. There was also, in my memory, a sense that all existing criticisms of the classic AI risk arguments were weak. It seemed plausible that the longtermist EA community would pretty much just become an AI-focused community. Strangely, I’m a bit fuzzy on what my own views were, but I think they were at most only a bit out-of-step.
This might be an exaggerated memory. The community is also, obviously, large enough for my experience to be significantly non-representative. (I’d be interested in whether the above description resonates with anyone else.) But, in any case, I am pretty confident that there’s been a real shift in average views over the past three years: credences in discontinuous progress and very short timelines have decreased; people’s concerns about AI have become more diverse; a broad portfolio approach to long-termism has become more popular; and, overall, there’s less of a doom-y vibe.
One explanation for the shift, if it’s real, is that the community has been rationally and rigorously responding to available evidence, and the available evidence has simply changed. I don’t think this could be the whole explanation, though. As I wrote in response to another question, many of the arguments for continuous AI progress, which seem to have had a significant impact over the past couple years, could have been published more than a decade ago—and, in some cases, were. An awareness of the differences between the ML paradigm and the “good-old-fashioned-AI” (GOFAI) paradigm has been another source of optimism, but ML had already largely overtaken GOFAI by the time Superintelligence was published. I also don’t think that much novel evidence for long timelines has emerged over the past few years, beyond the fact that we still don’t have AGI.
It’s possible that the community’s updated views, including my own updated views, are wrong: but even in this case, there needs to have been an epistemic mishap somewhere down the line. (The mishap would just be more recent.) I’m unfortunately pretty unsure of what actually happened. I do think that more energy should have gone into critiquing the classic AI risk arguments, porting them into the ML paradigm, etc., in the few years immediately after Superintelligence was published, and I do think that there’s been too much epistemic deference within the community. As Asya pointed out in a comment on this post, I think that misperception has also been an important issue: people have often underestimated how much uncertainty and optimism prominent community members actually have about AI risk. Another explanation—although this isn’t a very fundamental explanation—is that, over the past few years, many people with less doom-y views have entered the community and had an influence. But I’m still confused, overall.
I think that studying and explaining the evolution of views within the community would be an interesting and valuable project in its own right.
[[As a side note, partly in response to below comment: It’s possible that the community has still made pretty much the right prioritization decisions over the past few years, even if there have been significant epistemic mistakes. Especially since AI safety/governance were so incredibly neglected in 2017, I’m less confident that the historical allocation of EA attention/talent/money to AI risk has actually substantially overshot the optimal level. We should still be nervous, though, if it turns out that the right decisions were made despite significantly miscalibrated views within the community.]]
FWIW, it mostly doesn’t resonate with me. (Of course, my experience is no more representative than yours.) Just as you I’d be curious to hear from more people.
I think what matches my impression most is that:
There has been a fair amount of arguably dysfunctional epistemic deference (more at the very end of this comment); and
Concerns about AI risk have become more diverse. (Though I think even this has been a mix of some people such as Allan Dafoe raising genuinely new concerns and people such as Paul Christiano explaining the concerns which for all I know they’ve always had more publicly.)
On the other points, my impression is that if there were consistent and significant changes in views they must have happened mostly among people I rarely interact with personally, or more than 3 years ago.
One shift in views that has had major real-world consequences is Holden Karnofsky, and by extension Open Phil, taking AI risk more seriously. He posted about this in September 2016, so presumably he changed his mind over the months prior to that.
I started to engage more deeply with public discussions on AI risk, and had my first conversations with EA-ish researchers in the area, in mid 2016. As far as I can remember, the main contours of the views prominent today were already discernable then. (Of course, since then a lot of detail has been added. E.g. today I encounter people who make fairly specific claims about how, say, GPT-3 is evidence for TAI soon, which obviously wasn’t possible in 2016. Though people did talk about AlphaGo when it came out.) E.g. there was a “MIRI view” on one hand, and Paul Christiano’s writing on prosaic AI alignment and IDA on the other hand. And the Concrete Problems in AI Safety paper appeared. Key writing on issues such as takeoff speeds, e.g. Superintelligence, Yudkowsky’s Intelligence Explosion Microeconomics, the Yudkowsky-Hanson FOOM debate, or some of Brian Tomasik’s posts, are even more dated. I didn’t get the impression that any view was particularly prominent.
Already in summer 2017, I’ve witnessed a lot of talk of how the “Bostrom/Yudkowsky model of AI risk” had been replaced by something else, including by staff at key organizations and at the Leaders Forum. Note that this must refer to developments that happened a year before more publicly visible signs such as Paul Christiano’s post on takeoff speeds from February 2018. Similarly, Daniel Dewey’s post on his reservations about some of MIRI’s research appeared in summer 2017, which I think is ample evidence of fundamental disagreements on AI risk among people at key organizations; and again, the post surely is based on epistemic trajectories dating back even further.
In late 2017 / early 2018, at an AI-strategy-focused event which I think we both attended, I don’t recall that short timelines, rapid takeoff, or ‘sudden emergence’ were particularly common views.
I know people who are skeptical about the value of ML PhDs for unrelated reasons, but I don’t recall anyone seriously suggesting there might not be enough time to finish a PhD before AGI appears. (I only recall a joke to the opposite effect—i.e. saying there will be time to finish a PhD—with which Demis Hassabis dodged a question on his AI timelines on a panel at EAGx Oxford 2016.) [Though we both know a senior researcher whose median timelines come close to that implication, and I don’t think their timelines became any longer over the last 3 years, again contra the trend you perceived.]
Most people I can think of who in 2017 had any at least minimally considered view on questions such as probability of doom, takeoff speed, polarity, timelines, and which AI safety agendas are promising still hold roughly the same view as far as I can tell. E.g. I recall one influential AI safety researcher who in summer 2017 gave what I thought were extremely short timelines, and in 2018 they told me they had become even shorter. I also don’t think I have changed my views significantly—they do feel more nuanced, but my bottom line on e.g. timelines or probability of different scenarios hasn’t changed significantly as far as I can remember.
My impression is that there hasn’t so much been a shift in views within individual people than the influx of a younger generation who tends to have an ML background and roughly speaking tends to agree more with Paul Christiano than MIRI. Some of them are now somewhat prominent themselves (e.g. Rohin Shah, Adam Gleave, you), and so the distribution of views among the set of perceived “AI risk thought leaders” has changed. But arguably this is a largely sociological phenomenon (e.g. due to prominent ML successes there are just way more people with ML background in general). [ETA: As Rohin notes, neither he nor Paul or Adam had an ML background when they decided which kind of AI safety research to focus on—instead, they switched to ML because they thought that was the more promising approach. So the suggested sociological explanation fails in at least their cases.]
More broadly, my impression is that for years there have been intractable disagreements on several fundamental questions regarding AI risk, that there hasn’t been much progress on resolving them, that few people have changed their mind in major ways, and that sometimes people holding different views have mostly stopped talking to each other. E.g. I’ve for months shared an office with people who hold views which I think are really off but have never talked to them about it, and more broadly I think we both know that even within just FHI there is an arguably extreme spread of views on issues pertaining to AI risk and longtermism/macrostrategy more generally.
(NB I don’t think this is necessarily bad. When disagreements prove intractable, it might be best if different groups make different bets and pursue their agendas separately. It might also not be that unusual for cases without decisive uncontroversial evidence, e.g. I’m sure there are protracted and intractable disagreements between, say, Keynesian and neoclassical economists or proponents of different quantum gravity theories.)
At the other extreme, I’ve seen dozens of collective person-hours being invested into experimenting with social technologies (e.g. certain ways of “facilitating” conversations) that were supposed to help people with different views understand each other, and to transmit some of that understanding to an audience of spectators. (I thought these were poorly executed and largely failures, but other thoughtful people seemed to disagree and expressed an eagerness to invest much more time into similar activities.)
I do recall instances of what I thought constituted exaggerated epistemic deference, especially in 2016 and to some extent 2017. Some of them were I think quite bizarre, with people essentially engaging in a long exegesis of brief, cryptic remarks that someone they know had relayed as something someone they know had heard as attributed to some presumed epistemic authority. Sometimes it wasn’t even clear who the supposed source of some information was, e.g. I recall a period where people around me were fuzzed that “people at OpenAI had short timelines”, with both the identities of these people and the question of just how short their timelines were being unclear. Usually I think it would have been more productive for the participants (myself included) to take an online course in ML, to google for some relevant factual information, or to try to make their thoughts more precise by writing them down.
(Again, some amount of epistemic deference is of course healthy. And more specifically it does seem correct to give more weight to people who have more relevant expertise or experience.)
My experience matches Ben’s more than yours.
All of the people you named didn’t have an ML background. Adam and I have CS backgrounds (before we joined CHAI, I was a PhD student in programming languages, while Adam worked in distributed systems iirc). Ben is in international relations. If you were counting Paul, he did a CS theory PhD. I suspect all of us chose the “ML track” because we disagreed with MIRI’s approach and thought that the “ML track” would be more impactful.
(I make a point out of this because I sometimes hear “well if you started out liking math then you join MIRI and if you started out liking ML you join CHAI / OpenAI / DeepMind and that explains the disagreement” and I think that’s not true.)
I’ve heard this (might be a Bay Area vs. Europe thing).
Thanks, this seems like an important point, and I’ll edit my comment accordingly. I think I had been aware of at least Paul’s and your backgrounds, but made a mistake by not thinking of this and not distinguishing between your prior backgrounds and what you’re doing now.
(Nitpick: While Ben is doing an international relations PhD now, I think his undergraduate degree was in physics and philosophy.)
I still have the impression there is a larger influx of people with ML backgrounds, but my above comment overstates that effect, and in particular it seems clearly false to suggest that Adam / Paul / you preferring ML-based approaches has a primarily sociological explanation (which my comment at least implicitly does).
(Ironically, I have long been skeptical of the value of MIRI’s agent foundations research, and more optimistic about the value of ML-based approaches to AI safety and Paul’s IDA agenda in particular—though I’m not particularly qualified to make such assessments, certainly less so than e.g. Adam and you -, and my background is in pure maths rather than ML. That maybe could have tipped me off …)
This Robin Hanson quote is perhaps also evidence for a shift in views on AI risk, somewhat contra my above comment, though neutral on the “people changed their minds vs. new people have different views” and “when exactly did it happen?” questions:
(I expect many people worried about AI risk think that Hanson, in the above quote and elsewhere, misunderstands current concerns. But perceiving some change seems easier than correctly describing the target of the change, so arguably the quote is evidence for change even if you think it misunderstands current concerns.)
I think that instead of talking about potential failures in the way the EA community prioritized AI risk, it might be better to talk about something more concrete, e.g.
The views of the average EA
How much money was given to AI
How many EAs shifted their careers to be AI-focused as opposed to something else that deserved more EA attention
I think if we think there were mistakes in the concrete actions people have taken, e.g. mistaken funding decisions or mistaken career changes (I’m not sure that there were), we should look at the process that led to those decisions, and address that process directly.
Targeting ‘the views of the average EA’ seems pretty hard. I do think it might be important, because it has downstream effects on things like recruitment, external perception, funding, etc. But then I think we need to have a story for how we affect the views of the average EA (as Ben mentions). My guess is that we don’t have a story like that, and that’s a big part of ‘what went wrong’—the movement is growing in a chaotic way that no individual is responsible for, and that can lead to collectively bad epistemics.
‘Encouraging EAs to defer less’ and ‘expressing more public uncertainty’ could be part of the story for helping the average EA have better views. It also seems possible to me that we want some kind of centralized official source for presenting EA beliefs that keeps up to date the best case for and against certain views (though this obviously has its own issues). Then we can be more sure that people have come to their views after being exposed to alternatives, and we can have something concrete to point to when we worry that there hasn’t been enough criticism.
I second this. I think Halstead’s question is an excellent one and finding an answer to it is hugely important. Understanding what went wrong epistemically (or indeed if anything did in fact go wrong epistemically) could massively help us going forward.
I wonder how we get the ball rolling on this...?
Which of the EA-related views you hold are least popular within the EA community?
I’m not sure how unpopular these actually are, but a few at least semi-uncommon views would be:
I’m pretty sympathetic to non-naturalism, in the context of both normativity and consciousness
Controlling for tractability, I think it’s probably more important to improve the future (conditional on humanity not going extinct) than to avoid human extinction. (The gap between a mediocre future or bad future and the best possible future is probably vast.)
I don’t actually know what my credence is here, since I haven’t thought much about the issue, but I’m probably more concerned about growth slowing down and technological progress stagnating than the typical person in the community
What resources would you recommend on ethical non-naturalism? Seems like a plausible idea I don’t know much about.
Michael Huemer’s “Ethical Intuitionism” and David Enoch’s “Taking Morality Seriously” are both good; Enoch’s book is, I think, better, but Huemer’s book is a more quick and engaging read. Part Six of Parfit’s “On What Matters” is also good.
I don’t exactly think that non-naturalism is “plausible,” since I think there are very strong epistemological objections to it. (Since our brain states are determined entirely by natural properties of the world, why would our intuitions about non-natural properties track reality?) It’s more that I think the alternative positions are self-undermining or have implications that are unacceptable in other ways.
Parfit isn’t quite a non-naturalist (or rather, he’s a very unconventional kind of non-naturalist, not a Platonist) - he’s a ‘quietist’. Essentially, it’s the view that there are normative facts, they aren’t natural facts, but we don’t feel the need to say what category they fall into metaphysically, or that such a question is meaningless.
I think a variant of that, where we say ‘we don’t currently have a clear idea what they are, just some hints that they exist because of normative convergence, and the internal contradictions of other views’ is plausible:
What are the key issues or causes that longtermists should invest in, in your view? And how much should we invest in them, relatively speaking? What issues are we currently under-investing in?
Have you had any responses from Bostrom or Yudkowsky to your critiques?
Would you rather be one or two dogs?
I’m sorry, but I consider that a very personal question.
Hi Ben. I just read the transcript of your 80,000 Hours interview and am curious how you’d respond to the following:
Analogy to agriculture, industry
You say that it would be hard for a single person (or group?) acting far before the agricultural revolution or industrial revolution to impact how those things turned out, so we should be skeptical that we can have much effect now on how an AI revolution turns out.
Do you agree that the goodness of this analogy is roughly proportional to how slow our AI takeoff is? For instance if the first AGI ever created becomes more powerful than the rest of the world, then it seems that anyone who influenced the properties of this AGI would have a huge impact on the future.
Brain-in-a-box
You argue that if we transition more smoothly from super powerful narrow AIs that slowly expand in generality to AGI, we’ll be less caught off guard / better prepared.
It seems that even in a relatively slow takeoff, you wouldn’t need that big of a discontinuity to result in a singleton AI scenario. If the first AGI that’s significantly more generally intelligent than a human is created in a world where lots of powerful narrow AIs exist, wouldn’t having a super smart thing at the center of control of a bunch of narrow AI tools plausibly be way more powerful than having human brains at the center of that control?
It seems plausible that in a “smooth” scenario the time between when the first group created AGI and the second group creating an equally powerful one could be months apart. Do you think a months-long discontinuity is not enough for an AGI to pull sufficiently ahead?
Even if multiple groups create AGIs within a short time, isn’t having a bunch of unaligned AGIs all trying to get power at the same time also an existential risk? It doesn’t seem clear that they’d automatically keep each other in check. One might simply be better at growing or better at sabotaging other AIs. Or if they reach a stalemate they might start cooperating with each other to achieve unaligned goals as a compromise.
Maybe narrow AIs will work better
You say that since today’s AIs are narrow, and since there’s often benefit in specialization, maybe in the future specialized AIs will continue to dominate. You say “maybe the optimal level of generality actually isn’t that high.”
My model is: if you have a central control unit (a human brain, or group of human brains) who is deciding how to use a bunch of narrow AIs, then if you replace that central control unit with one that it more intelligent / fast acting, the whole system will be more effective.
The only way I can think of where that wouldn’t be true would be if the general AI required so many computational resources that the narrow AIs that were acting as tools of the AGI were crippled by lack of resources. Is that what you’re imagining?
Deadline model of AI progress
You say you disagree with the idea that the day when we create AGI acts as a sort of ‘deadline’, and if we don’t figure out alignment before then we’re screwed.
A lot of your argument is about how increasing AI capability and alignment are intertwined processes, so that as we increase an AI’s capabilities we’re also increasing its alignment. You discuss how it’s not like we’re going to create a super powerful AI and then give it a module with its goals at the end of the process.
I agree with that, but I don’t see it as substantially affecting the Bostrom/Yudkowsky arguments.
Isn’t the idea that we would have something that seemed aligned as we were training it (based on this continuous feedback we were giving it), but then only when it became extremely powerful we’d realize it wasn’t actually aligned?
This seems to be a disagreement about “how hard is AI alignment?”. I think Yudkowsky would say that it’s super hard such that your AI can look perfectly aligned when it’s less powerful than you, but you get something slightly wrong that only manifests itself when it has taken over. Do you agree that’s a crux?
You talk about how AIs can behave very differently in different environments. Isn’t the environment of an AI which happens to be the most powerful agent on earth fundamentally different than the any environment we could provide when training an AI (in terms of resources at its disposal, strategies it might be aware of, etc)?
Instrumental convergence
You talk about how even if almost all goals would result in instrumental convergence, we’re free to pick any goals we like, so we can pick from a very small subset of all goals which don’t result in instrumental convergence.
It seems like there’s a tradeoff between AI capability and not exhibiting instrumental convergence, since to avoid instrumental convergence you basically need to tell the AI “You’re not allowed to do anything in this broad class of things that will help you achieve your goals.” An AI that amasses power and is willing to kill to achieve its goals is by definition more powerful than one that eschews becoming powerful and killing.
In a situation where they may be many groups trying to create an AGI, doesn’t this imply that the first AGI that does exhibit instrumental convergence will have a huge advantage over any others?
Hi Elliot,
Thanks for all the questions and comments! I’ll answer this one in stages.
On your first question:
I agree with this.
To take the fairly extreme case of the Neolithic Revolution, I think that there are at least a few reasons why groups at the time would have had trouble steering the future. One key reason is what the world was highly “anarchic,” in the international relations sense of the term: there were many different political communities, with divergent interests and a limited ability to either coerce one another or form credible commitments. One result of anarchy is that, if the adoption of some technology or cultural/institutional practice would give some group an edge, then it’s almost bound to be adopted by some group at some point: other groups will need to either lose influence or adopt the technology/innovation to avoid subjugation. This explains why the emergence and gradual spread of agricultural civilization was close to inevitable, even though (there’s some evidence) people often preferred the hunter-gatherer way of life. There was an element of technological or economic determinism that put the course of history outside of any individual group’s control (at least to a significant degree).
Another issue, in the context of the Neolithic Revolution, is that norms, institutions, etc., tend to shift over time, even in there aren’t very strong selection pressures. This was even more true before the advent of writing. So we do have a few examples of religious or philosophical traditions that have stuck around, at least in mutated forms, for a couple thousand years; but this is unlikely, in any individual case, and would have been even more unlikely 10,000 years ago. At least so far, we also don’t have examples of more formal political institutions (e.g. constitutions) that have largely stuck around for more than few thousand years either.
There are a couple reasons why AI could be different. The first reason is that—under certain scenarios, especially ones with highly discontinuous and centralized progress—it’s perhaps more likely that one political community will become much more powerful than all others and thereby make the world less “anarchic.” Another is that, especially if the world is non-anarchic, values and institutions might naturally be more stable in a heavily AI-based world. It seems plausible that humans will eventually step almost completely out of the loop, even if they don’t do this immediately after extremely high levels of automation are achieved. At this point, if one particular group has disproportionate influence over the design/use of existing AI systems, then that one group might indeed have a ton of influence over the long-run future.
Thanks to Ben for doing this AMA, and to Elliot for this interesting set of questions!
Just wanted to mention two links that readers might find interesting in this context. Firstly, Tomasik’s Will Future Civilization Eventually Achieve Goal Preservation? Here’s the summary:
Secondly, Bostrom’s What is a Singleton? Here’s a quote:
I think there are a couple different bits to my thinking here, which I sort of smush together in the interview.
The first bit is that, when developing an individual AI system, its goals and capabilities/intelligence tend to take shape together. This is helpful, since it increases the odds that we’ll notice issues with the system’s emerging goals before they result in truly destructive behavior. Even if someone didn’t expect a purely dust-minimizing house-cleaning robot to be a bad idea, for example, they’ll quickly realize their mistake as they train the system. The mistake will be clear well before the point when the simulated robot learns how to take over the world; it will probably be clear even before the point when the robot learns how to operate door knobs.
The second bit is that there are many contexts in which pretty much any possible hand-coded reward function will either quickly reveal itself as inappropriate or be obviously inappropriate before the training process even begins. This means that sane people won’t proceed in developing and deploying things like house-cleaning robots or city planners until they’ve worked out alignment techniques to some degree; they’ll need to wait until we’ve moved beyond “hand-coding” preferences, toward processes that more heavily involve ML systems learning what behaviors users or developers prefer.
It’s still conceivable that, even given these considerations, people will still accidentally develop AI systems that commit omnicide (or cause similarly grave harms). But the likelihood at least goes down. First of all, it needs to be the case that (a): training processes that use apparently promising alignment techniques will still converge on omnicidal systems. Second, it needs to be the case that (b): people won’t notice that these training processes have serious issues until they’ve actually made omnicidal AI systems.
I’m skeptical of both (a) and (b). My intuition, regarding (a), is that some method that involves learning human preferences would need to be really terrible to result in systems that are doing things on the order of mass murder. Although some arguments related to mesa-optimization may push against this intuition.
Then my intuition, regarding (b), is that the techniques would likely display serious issues before anyone creates a system capable of omnicide. For example, if these techniques tend to induce systems to engage in deceptive behaviors, I would expect there to be some signs that this is an issue early on; I would expect some failed or non-catastrophic acts of deception to be observed first. However, again, my intuition is closely tied to my expectation that progress will be pretty continuous. A key thing to keep in mind about highly continuous scenarios is that there’s not just one single consequential ML training run, where the ML system might look benign at the start but turn around and take over the world at the end. We’re instead talking about countless training runs, used to develop a wide variety of different systems of intermediate generality and competency, deployed across a wide variety of domains, over a period of multiple years. We would have many more opportunities to notice issues with available techniques than we would in a “brain in a box” scenario. In a more discontinuous scenario, the risk would presumably be higher.
This might just be a matter of semantics, but I don’t think “how hard is AI alignment?” is the main question I have in mind here. I’m mostly thinking about the question of whether we’ll unwittingly create existentially damaging systems, if we don’t work out alignment techniques first. For example, if we don’t know how to make benign house cleaners, city planners, or engineers by year X, will we unwittingly create omnicidal systems instead? Certainly, the harder it is to work out alignment techniques, the higher the risks become. But it’s possible for accident risk to be low even if alignment techniques are very hard to work out.
I would say that, in a scenario with relatively “smooth” progress, there’s not really a clean distinction between “narrow” AI systems and “general” AI systems; the line between “we have AGI” and “we don’t have AGI” is either a bit blurry or a bit arbitarily drawn. Even if the management/control of large collections of AI systems is eventually automated, I would also expect this process of automation to unfold over time rather than happening in single go.
In general, the smoother things are, the harder it is to tell a story where one group gets out way ahead of others. Although I’m unsure just how “unsmooth” things need to be for this outcome to be plausible.
I think that if there were multiple AGI or AGI-ish systems in the world, and most of them were badly misaligned (e.g. willing to cause human extinction for instrumental reasons), this would present an existential risk. I wouldn’t count on them balancing each other out, in the same way that endangered gorilla populations shouldn’t count on warring communities to balance each other out.
I think the main benefits of smoothness have to do with risk awareness (e.g. by observing less catastrophic mishaps) and, especially, with opportunities for trial-and-error learning. At least when the concern is misalignment risk, I don’t think of the decentralization of power as a really major benefit in its own right: the systems in this decentralized world still mostly need to be safe.
I think it’s plausible that especially general systems would be especially useful for managing the development, deployment, and interaction of other AI systems. I’m not totally sure this is the case, though. For example, at least in principle, I can imagine an AI system that is good at managing the training of other AI systems—e.g. deciding how much compute to devote to different ongoing training processes—but otherwise can’t do much else.
What would you recommend as the best introduction to concerns (or lack thereof) about risks from AI?
If you have time and multiple recommendations, I would be interested in a taxonomy. (E.g. this is the best blog post for non-technical readers, this is the best book-length introduction for CS undergrads.)
I agree with Aidan’s suggestion that Human Compatible is probably the best introduction to risks from AI (for both non-technical readers and readers with CS backgrounds). It’s generally accessible and engagingly written, it’s up-to-date, and it covers a number of different risks. Relative to many other accounts, I think it also has the virtue of focusing less on any particular development scenario and expressing greater optimism about the feasibility of alignment. If someone’s too pressed for time to read Human Comptabile, the AI risk chapter in The Precipice would then be my next best bet. Another very readable option, mainly for non-CS people, would be the AI risk chapters in The AI Does Not Hate You: I think they may actually be the cleanest distillation of the “classic” AI risk argument.
For people with CS backgrounds, hoping for a more technical understanding of the problems safety/alignment researchers are trying to solve, I think that Concrete Problems in AI Safety, Scalable Agent Alignment Via Reward Modeling, and Rohin Shah’s blog post sequence on “value learning” are especially good picks. Although none of these resources frames safety/alignment research as something that’s intended to reduce existential risks.
I think that AI Governance: A Research Agenda would be the natural starting point for social scientists, especially if they have a substantial interest in risks beyond alignment.
Of course, for anyone interested in digging into arguments around AI risk, I think that Superintelligence is still a really important read. (Even beyond its central AI risk argument, it also has a ton of interesting ideas on the future of intelligent life, ethics, and the strategic landscape that other resources don’t.) But it’s not where I think people should start.
FWIW, here’s an introduction to longtermism and AI risks I wrote for a friend. (My friend has some technical background, he had read Doing Good Better but not engaged further with EA, and I thought he’d be a good fit for AI Policy research but not technical research.)
Longtermism: Future people matter, and there might be lots of them, so the moral value of our actions is significantly determined by their effects on the long-term future. We should prioritize reducing “existential risks” like nuclear war, climate change, and pandemics that threaten to drive humanity to extinction, preventing the possibility of a long and beautiful future.
Quick intro to longtermism and existential risks from 80,000 Hours
Academic paper arguing that future people matter morally, and we have tractable ways to help them, from the Doing Good Better philosopher
Best resource on this topic: The Precipice, a book explaining what risks could drive us to extinction and how we can combat them, released earlier this year by another Oxford philosophy professor
Artificial intelligence might transform human civilization within the next century, presenting incredible opportunities and serious potential problems
Elon Musk, Bill Gates, Stephen Hawking, and many leading AI researchers worry that extremely advanced AI poses an existential threat to humanity (Vox)
Best resource on this topic: Human Compatible, a book explaining the threats, existential and otherwise, posed by AI. Written by Stuart Russell, CS professor at UC Berkeley and author of the leading textbook on AI. Daniel Kahneman calls it “the most important book I have read in quite some time”. (Or this podcast with Russell)
CS paper giving the technical explanation of what could go wrong (from Google/OpenAI/Berkeley/Stanford)
How you can help by working on US AI policy, explains 80,000 Hours
(AI is less morally compelling if you don’t care about the long-term future. If you want to focus on the present, maybe focus on other causes: global poverty, animal welfare, grantmaking, or researching altruistic priorities.)
Generally, I’d like to hear more about how different people introduce the ideas of EA, longtermism, and specific cause areas. There’s no clear cut canon, and effectively personalizing an intro can difficult, so I’d love to hear how others navigate it.
This seems like a promising topic for an EA Forum question. I would consider creating one and reposting your comment as an answer to it. A separate question is probably also a better place to collect answers than this thread, which is best reserved for questions addressed to Ben and for Ben’s answers to those questions.
Good idea, thanks! I’ve posted a question here.
More broadly, should AMA threads be reserved for direct questions to the respondent and the respondent’s answers? Or should they encourage broader discussion of those questions and ideas by everyone?
I’d lean towards AMAs as a starting point for broader discussion, rather than direct Q&A. Good examples include the AMAs by Buck Shlegeris and Luke Muehlhauser. But it does seem that most AMAs are more narrow, focusing on direct question and answer.
[For example, this question isn’t really directed towards Ben, but I’m asking anyways because the context and motivations are clearer here than they would be elsewhere, making productive discussion more likely. But I’m happy to stop distracting if there’s consensus against this.]
I personally would lean towards the “most AMAs” approach of having most dialogue be with the AMA-respondent. It’s not quite “questions after a talk”, since question-askers have much more capacity to respond and have a conversation, but I feel like it’s more in that direction than, say, a random EA social. Maybe something like the vibe of a post-talk mingling session?
I think this is probably more important early in a comment tree than later. Directly trying to answer someone else’s question seems odd/out-of-place to me, whereas chiming in 4 levels down seems less so. I think this mirrors how the “post-talk mingling” would work: if I was talking to a speaker at such an event, and I asked them a question, someone else answering before them would be odd/annoying – “sorry, I wasn’t talking to you”. Whereas someone else chiming in after a little back-and-forth would be much more natural.
Of course, you can have multiple parallel comment threads here, which alters things quite a bit. But that’s the kind of vibe that feels natural to me, and Pablo’s comment above suggests I’m not alone in this.
What do you think is the probability of AI causing an existential catastrophe in the next century?
I currently give it something in the .1%-1% range.
For reference: My impression is that this is on the low end, relative to estimates that other people in the long-termist AI safety/governance community would give, but that it’s not uniquely low. It’s also, I think, more than high enough to justify a lot of work and concern.
I am curious whether you are, in general, more optimistic about x-risks [say, than Toby Ord]. What are your estimates of total and unforeseen anthropogenic risks in the next century?
Toby’s estimate for “unaligned artificial intelligence” is the only one that I meaningfully disagree with.
I would probably give lower numbers for the other anthropogenic risks as well, since it seems really hard to kill virtually everyone, and since the historical record suggests that permanent collapse is unlikely. (Complex civilizations were independently developed multiple times; major collapses, like the Bronze Age Collapse or fall of the Roman Empire, were reversed after a couple thousand years; it didn’t take that long to go from the Neolithic Revolution to the Industrial Revolution; etc.) But I haven’t thought enough about civilizational recovery or, for example, future biological weapons to feel firm in my higher level of optimism.
Thanks for sharing your probability estimate; I’ve now added it to my database of existential risk estimates.
Your estimate is the second lowest one I’ve come across, with the lower one being from someone (James Fodor) who I don’t think is in the longtermist AI safety/governance community (though they’re an EA and engage with longtermist thinking). But I’m only talking about the relatively small number of explicit, public estimates people have given, not all the estimates that relevant people would give, so I’d guess that your statement is accurate.
(Also, to be clear, I don’t mean to be imply we should be more skeptical of estimates that “stand out from the pack” than those that are closer to other estimates.)
I’m curious as to whether most of that .1-1% probability mass is on existential catastrophe via something like the classic Bostrom/Yudkowsky type scenario, vs something like what Christiano describes in What failure looks like, vs deliberate misuse of AI, vs something else? E.g., is it like you still see the classic scenarios as the biggest cause for concern here? Or is it like you now see those scenarios as extremely unlikely, yet have a residual sense that something as massive as AGI could cause massive bad consequences somehow?
You said in the podcast that the drop was ‘an order of magnitude’, so presumably your original estimate was 1-10%? I note that this is similar to Toby Ord’s in The Precipice (~10%) so perhaps that should be a good rule of thumb: if you are convinced by the classic arguments your estimate of existential catastrophe from AI should be around 10% and if you are unconvinced by specific arguments, but still think AI is likely to become very powerful in the next century, then it should be around 1%?
Those numbers sound pretty reasonable to me, but, since they’re roughly my own credences, it’s probably unsurprising that I’m describing them as “pretty reasonable” :)
On the other hand, depending on what counts as being “convinced” of the classic arguments, I think it’s plausible they actually support a substantially higher probability. I certainly know that some people assign a significantly higher than 10% chance to an AI-based existential catastrophe this century. And I believe that Toby’s estimate, for example, involved weighing up different possible views.
What have you changed your mind about recently?
Suppose there was an operational long-term investment fund a la Phil Trammel. Where would you donate?
I would strongly consider donating to the long-term investment fund. (But I haven’t thought enough about this to be sure.)
From the podcast transcript:
I continue to have a lot of uncertainty about how likely it is that AI development will look like “there’s this separate project of trying to figure out what goals to give these AI systems” vs a development process where capability and goals are necessarily connected. (I didn’t find your arguments in favor of the latter very persuasive.) For example it seems GPT-3 can be seen as more like the former than the latter. (See this thread for background on this.)
To the extent that AI development is more like the latter than the former, that might be bad news for (a certain version of) the orthogonality thesis, but it can be even worse news for the prospect of AI alignment. Because instead of disaster striking only if we can’t figure out the right goals to give to the AI, it can also be the case that we know what goals we want to give it, but due to constraints of the development process, we can’t give it those goals and can only build AI with unaligned goals. So it seems to me that the latter scenario can also be rightly described as “exogenous deadline of the creep of AI capability progress”. (In both cases, we can try to refrain from developing/deploying AGI, but it may be a difficult coordination problem for humanity to stay in a state where we know how to build AGI but chooses not to, and in any case this consideration cuts equally across both scenarios.)
I think that the comment you make above is right. In the podcast, we only discuss this issue in a super cursory way:
Fortunately, I’m not too worried about this possibility. Partly, as background, I expect us to have moved beyond using hand-coded reward functions—or, more generally, what Stuart Russell calls the “standard model”—by the time we have the ability to create broadly superintelligent and highly agential/unbounded systems. There are really strong incentives to do this, since there are loads of useful applications that seemingly can’t be developed using hand-coded reward functions. This is some of the sense in which, in my view, capabilities and alignment research is mushed up. If progress is sufficiently gradual, I find it hard to imagine that the ability to create things like world-destroying paperclippers comes before (e.g.) the ability to make at least pretty good use of reward modeling techniques.
(To be clear, I recognize that loads of alignment researchers also think that there will be strong economic incentives for alignment research. I believe there’s a paragraph in Russell’s book arguing this. I think DM’s “scalable agent alignment” paper also suggests that reward modeling is necessary to develop systems that can assist us in most “real world domains.” Although I don’t know how much optimism other people tend to take from this observation. I don’t actually know, for example, whether or not Russell is less optimisic than me.)
If we do end up in a world where people know they can create broadly superintelligent and highly agential/unbounded AI systems, but we’re still haven’t worked out alternatives to Russell’s “standard model,” then no sane person really has any incentive to create and deploy these kinds of systems. Training up a broadly superintelligent and highly agential system using something like a hand-coded reward function is likely to be an obviously bad idea; if it’s not obviously bad, a priori, then it will likely become obviously bad during the training process. There wouldn’t be much of a coordination problem, since, at least in normal circumstances, no one has an incentive to knowingly destroy themselves.
If I then try to tell a story where humanity goes extinct, due to a failure to move beyond the standard model in time, two main scenarios come to mind.
Doomsday Machine: States develop paperclipper-like systems, while thinking of them as doomsday machines, to serve as a novel alternative or complement to nuclear deterrents. They end up being used, either accidentally or intentionally.
Apocalyptic Residual: The ability to develop paperclipper-like systems diffuses broadly. Some of the groups that gain this ability have apocalyptic objectives. They groups intentionally develop and deploy the systems, with the active intention of destroying humanity.
The first scenario doesn’t seem very likely to me. Although this is obviously very speculative, paperclippers seem much worse than nuclear or even biological deterrents. First, your own probability of survival, if you use a paperclipper, may be much lower than your probability of survival if you used nukes or biological weapons. Second, and somewhat ironically, it may actually be hard to convince people that your paperclipper system can actually do a ton of damage; it seems hard to know that the result would actually be as bad as feared, without real-world experience using it before. States would also, likely, be slow to switch to this new deterrence strategy, providing even more time for alignment techniques to be worked out. As a further bit of friction/disincentive, these systems might also just be extremely expensive (depending on compute or environment design requirements). Finally, for doomsday to occur, it’s actually necessary for a paperclipper system to be used—and for its effect to be as bad as feared. The history of nuclear weapons suggests that the annual probability of use is probably pretty low.
The second scenario also doesn’t seem very likely to me, since: (a) I think there would probably be an initial period where large quantities of resources (e.g. compute and skilled engineers) are required to make world-destroying paperclippers. (b) Only a very small portion of people want to destroy the world. (c) There would be unusually strong incentives for states to prevent apocalyptic groups or individuals from gaining access to the necessary resources.
Although see Asya’s “AGI in Vulnerable World” post for a discussion of some conditions under which malicious use concerns might loom larger.
(Apologies for the super long response!)
My guess would be that if you play with GPT-3, it can talk about as well about human values (or AI alignment for that matter) as it can talk about anything else. In that sense, it seems like stronger capabilities for GPT-3 also potentially help solve the alignment problem.
Edit: More discussion here:
https://www.lesswrong.com/posts/BnDF5kejzQLqd5cjH/alignment-as-a-bottleneck-to-usefulness-of-gpt-3?commentId=vcPdcRPWJe2kFi4Wn
I don’t think I caught the point about GPT-3, although this might just be a matter of using concepts differently.
In my mind: To whatever extent GPT-3 can be said to have a “goal,” its goal is to produce text that it would be unsurprising to find on the internet. The training process both imbued it with this goal and made the system good at achieving it.
There are other things we might want spin-offs of GPT-3 to do: For example, compose better-than-human novels. Doing this would involve shifting both what GPT-3 is “capable” of doing and shifting what its “goal” is. (There’s not really a clean practical or conceptual distinction between the two.) It would also probably require making progress on some sort of “alignment” technique, since we can’t (e.g.) write down a hand-coded reward function that quantifies novel quality.
Planned summary of the podcast episode for the Alignment Newsletter:
Planned opinion:
Rohin’s opinion:
Thanks for the great summary! A few questions about it
1. You call mesa-optimization “the best current case for AI risk”. As Ben noted at the time of the interview, this argument hasn’t yet really been fleshed out in detail. And as Rohin subsequently wrote in his opinion of the mesa-optimization paper, “it is not yet clear whether mesa optimizers will actually arise in practice”. Do you have thoughts on what exactly the “Argument for AI Risk from Mesa-Optimization” is, and/or a pointer to the places where, in your opinion, that argument has been made (aside from the original paper)?
2. I don’t entirely understand the remark about the reference class of ‘new intelligent species’. What species are in that reference class? Many species which we regard as quite intelligent (orangutans, octopuses, New Caledonian crows) aren’t risky. Probably, you mean a reference class like “new species as smart as humans” or “new ‘generally intelligent’ species”. But then we have a very small reference class and it’s hard to know how strong that prior should be. In any case, how were you thinking of this reference class argument?
3. ‘The Boss Baby’, starring Alec Baldwin, is available for rental on Amazon Prime Video for $3.99. I suppose this is more of a comment than a question.
1. Oh man, I wish. :( I do think there are some people working on making a crisper case, and hopefully as machine learning systems get more powerful we might even see early demonstrations. I think the crispest statement of it I can make is “Similar to how humans are now optimizing for goals that are not just the genetic fitness evolution wants, other systems which contain optimizers may start optimizing for goals other than the ones specified by the outer optimizer.”
Another related concept that I’ve seen (but haven’t followed up on) is what johnswentworth calls “Demons in Imperfect Search”, which basically advocates for the possibility of runaway inner processes in a variety of imperfect search spaces (not just ones that have inner optimizers). This arguably happened with metabolic reactions early in the development of life, greedy genes, managers in companies. Basically, I’m convinced that we don’t know enough about how powerful search mechanisms work to be sure that we’re going to end up somewhere we want.
I should also say that I think these kinds of arguments feel like the best current cases for AI alignment risk. Even if AI systems end up perfectly aligned with human goals, I’m still quite worried about what the balance of power looks like in a world with lots of extremely powerful AIs running around.
2. Yeah, here I should have said ‘new species more intelligent than us’. I think I was thinking of two things here:
Humans causing the extinction of less intelligent species
Some folk intuition around intelligent aliens plausibly causing human extinction (I admit this isn’t the best example...).
Mostly I meant here that since we don’t actually have examples of existentially risky technology (yet), putting AI in the reference class of ‘new technology’ might make you think it’s extremely implausible that it would be existentially bad. But we do have examples of species causing the extinction of lesser species (and scarier intuitions around it), so in the sense that AI is a new, more intelligent species, we should think there’s at least some chance that it could be existentially bad.
3. Obviously not the same thing, but ‘The Boss Baby: Back in Business’, a spin-off of the original, not starring Alec Baldwin, is available on Netflix right now. I’ve watched about 20 seconds of it and feel comfortable saying that the money would be better spent on AI safety and governance work.
I have nothing to add to the discussion but wanted to say that this was my favourite episode, which given how big a fan I am of the podcast is a very high bar.
Thanks so much for letting me know! I’m really glad to hear :)
How entrenched do you think are old ideas about AI risk in the AI safety community? Do you think that it’s possible to have a new paradigm quickly given relevant arguments?
I’d guess that like most scientific endeavours, there are many social aspects that make people more biased toward their own old way of thinking. Research agendas and institutions are focused on some basic assumptions—which, if changed, could be disruptive to the people involved or the organisation. However, there seems to be a lot of engagement with the underlying questions about the paths to superintelligence and the consequences thereof, and also the research community today is heavily involved with the rationality community—both of these makes me hopeful that more minds can be changed given appropriate argumentation.
I actually don’t think they’re very entrenched!
I think that, today, most established AI researchers have fairly different visions of the risks from AI—and of the problems that they need to solve—than the primary vision discussed in Superintelligence and in classic Yudkowsky essays. When I’ve spoken to AI safety researchers about issues with the “classic” arguments, I’ve encountered relatively low levels of disagreement. Arguments that heavily emphasize mesa-optimization or arguments that are more in line with this post seem to be more influential now. (The safety researchers I know aren’t a random sample, though, so I’d be interested in whether this sounds off to anyone in the community.)
I think that “classic” ways of thinking about AI risk are now more prominent outside the core AI safety community than they are within it. I think that they have an important impact on community beliefs about prioritization, on individual career decisions, etc., but I don’t think they’re heavily guiding most of the research that the safety community does today.
(Unfortunately, I probably don’t make this clear in the podcast.)
What is your theory of change for work on clarifying arguments for AI risk?
Is the focus more on immediate impact on funding/research or on the next generation? Do you feel this is important more to direct work to the most important paths or to understand how sure are we about all this AI stuff and grow the field or deprioritize it accordingly?
I think the work is mainly useful for EA organizations making cause prioritization decisions (how much attention should they devote to AI risk relative to other cause areas?) and young/early-stage people deciding between different career paths. The idea is mostly to help clarify and communicate the state of arguments, so that more fully informed and well-calibrated decisions can be made.
A couple other possible positive impacts:
Developing and shifting to improved AI risk arguments—and publicly acknowledging uncertainties/confusions—may, at least in the long run, cause other people to take the EA community and existential-risk-oriented AI safety communities more seriously. As one particular point, I think that a lot of vocal critics (e.g. Pinker) are mostly responding to the classic arguments. If the classic arguments actually have significant issues, then it’s good to acknowledge this; if other arguments (e.g. these) are more compelling, then it’s good to work them out more clearly and communicate them more widely. As another point, I think that sharing this kind of work might reduce perceptions that the EA is more group-think-y/unreflective than it actually is. I know that people have sometimes pointed to my EAG talk from a couple years back, for example, in response to concerns that the EA community is too uncritical in its acceptance of AI risk arguments.
I think that it’s probably useful for the AI safety community to have a richer and more broadly shared understanding of different possible “AI risk threat models”; presumably, this would feed into research agendas and individual prioritization decisions to some extent. I think that work that analyzes newer AI risk arguments, especially, would be useful here. For example, it seems important to develop a better understanding of the role that “mesa-optimization” plays in driving existential risk.
(There’s also the possibility of negative impact, of course: focusing too much on the weaknesses of various arguments might cause people to downweight or de-prioritize risks more than they actually should.)
I haven’t thought very much about the timelines of which this kind of work is useful, but I think it’s plausible that the delayed impact on prioritization and perception is more important than the immediate impact.
You say that there hasn’t been much literature arguing for Sudden Emergence (the claim that AI progress will look more like the brain-in-a-box scenario than the gradual-distributed-progress scenario). I am interested in writing some things on the topic myself, but currently think it isn’t decision-relevant enough to be worth prioritizing. Can you say more about the decision-relevance of this debate?
Toy example: Suppose I write something that triples everyone’s credence in Sudden Emergence. How does that change what people do, in a way that makes the world better (or worse, depending on whether Sudden Emergence is true or not!)
I would be really interested in you writing on that!
It’s a bit hard to say what the specific impact would be, but beliefs about the magnitude of AI risk of course play at least an implicit role in lots of career/research-focus/donation decisions within the EA community; these beliefs also affect the extent to which broad EA orgs focus on AI risk relative to other cause areas. And I think that people’s beliefs about the Sudden Emergence hypothesis at least should have a large impact in their level of doominess about AI risk; I regard it as one of the biggest cruxes. So I’d at least be hopeful that, if everyone’s credences in Sudden Emergence changed by a factor of three, this had some sort of impact on the portion of EA attention devoted to AI risk. I think that credences in the Sudden Emergence hypothesis should also have an impact on the kinds of risks/scenarios that people within the AI governance and safety communities focus on.
I don’t, though, have a much more concrete picture of the influence pathway.
OK, thanks. Not sure I can pull it off, that was just a toy example. Probably even my best arguments would have a smaller impact than a factor of three, at least when averaged across the whole community.
I agree with your explanation of the ways this would improve things… I guess I’m just concerned about opportunity costs.
Like, it seems to me that a tripling of credence in Sudden Emergence shouldn’t change what people do by more than, say, 10%. When you factor in tractability, neglectedness, personal fit, doing things that are beneficial under both Sudden Emergence and non-Sudden Emergence, etc. a factor of 3 in the probability of sudden emergence probably won’t change the bottom line for what 90% of people should be doing with their time. For example, I’m currently working on acausal trade stuff, and I think that if my credence in sudden emergence decreased by a factor of 3 I’d still keep doing what I’m doing.
Meanwhile, I could be working on AI safety directly, or I could be working on acausal trade stuff (which I think could plausibly lead to a more than 10% improvement in EA effort allocation. Or at least, more plausibly than working on Sudden Emergence, it seems to me right now).
I’m very uncertain about all this, of course.
Did you end up writing this post? (I looked through your LW posts since the timestamp of the parent comment but it doesn’t seem like you did.) If not, I would be interested in seeing some sort of outline or short list of points even if you don’t have time to write the full post.
Thanks for following up. Nope, I didn’t write it, but comments like this one and this one are making me bump it up in priority! Maybe it’s what I’ll do next.
How confident are you in brief arguments for rapid and general progress outlined in the section 1.1 of GovAI’s research agenda? Have the arguments been developed further?
What is your overall probability that we will, in this century, see progress in artificial intelligence that is at least as transformative as the industrial revolution?
What is your probability for the more modest claim that AI will be at least as transformative as, say, electricity or railroads?
I think this is a little tricky. The main way in which the Industrial Revolution was unusually transformative is that, over the course of the IR, there were apparently unusually large pivots in several important trendlines. Most notably, GDP-per-capita began to increase at a consistently much higher rate. In more concrete terms, though, the late nineteenth and early twentieth centuries probably included even greater technological transformations.
From David Weil’s growth textbook (pg. 265-266):
I think it’s a bit unclear, then, how to think about AI progress that’s at least as transformative as the IR. If economic growth rates radically increase in the future, then we might apply the label “transformative AI” to the the period where the change in growth rates becomes clear. But it’s also possible that growth rates won’t ultimately go up that much. Maybe the trend in the labor force participation rate is the one to look at, since there’s a good chance it will eventually decline to nearly zero; but it’s also possible the decline will be really protracted, without a particularly clean pivot.
None of this is an answer to your question, of course. (I will probably circle back and try to give you a probability later.) But I am sort of wary of “transformative AI” as a forecasting target; if I was somehow given access to a video recording of the future of AI, I think it’s possible I would have a lot of trouble labeling the decade where “AI progress as transformative as the Industrial Revolution” has been achieved.
Also a little bit tricky, partly because electricity underlies AI. As one operationalization, then, suppose we were to ask an economist in 2100: “Do you think that the counterfactual contribution of AI to American productivity growth between 2010 and 2100 was at least as large as the counterfactual contribution of electricity to American productivity growth between 1900 and 1940?” I think that the economist would probably agree—let’s say, 50% < p < 75% -- but I don’t have a very principled reason for thinking this and might change my mind if I thought a bit more.
I agree that it’s tricky, and am quite worried about how the framings we use may bias our views on the future of AI. I like the GDP/productivity growth perspective but feel free to answer the same questions for your preferred operationalisation.
Another possible framing: given a crystal ball showing the future, how likely is it that people would generally say that AI is the most important thing that happens this century?
Interesting. So you generally expect (well, with 50-75% probability) AI to become a significantly bigger deal, in terms of productivity growth, than it is now? I have not looked into this in detail but my understanding is that the contribution of AI to productivity growth right now is very small (and less than electricity).
If yes, what do you think causes this acceleration? It could simply be that AI is early-stage right now, akin to electricity in 1900 or earlier, and the large productivity gains arise when key innovations diffuse through society on a large scale. (However, many forms of AI are already widespread.) Or it could be that progress in AI itself accelerates, or perhaps linear progress in something like “general intelligence” translates to super-linear impact on productivity.
I mostly have in mind the idea that AI is “early-stage,” as you say. The thought is that “general purpose technologies” (GPTs) like electricity, the steam engine, the computer, and (probably) AI tend to have very delayed effects.
For example, there was really major progress in computing in the middle of the 20th century, and lots of really major invents throughout the 70s and 80s, but computers didn’t have a noticeable impact on productivity growth until the 90s. The first serious electric motors were developed in the mid-19th century, but electricity didn’t have a big impact on productivity until the early 20th. There was also a big lag associated with steam power; it didn’t really matter until the middle of the 19th century, even though the first steam engines were developed centuries earlier.
So if AI takes several decades to have a large economic impact, this would be consistent with analagous cases from history. It can take a long time for the technology to improve, for engineers to get trained up, for complementary inventions to be developed, for useful infrastructure to be built, for organizational structures to get redesigned around the technology, etc. I don’t think it’d be very surprising if 80 years was enough for a lot of really major changes to happen, especially since the “time to impact” for GPTs seems to be shrinking over time.
Then I’m also factoring in the additional possibility that there will be some unusually dramatic acceleration, which is distinguishes AI from most earlier GPTs.
That seems plausible and is also consistent with Amara’s law (the idea that the impact of technology is often overestimated in the short run and underestimated in the long run).
I’m curious how likely you think it is that productivity growth will be significantly higher (i.e. levels at least comparable with electricity) for any reason, not just AI. I wouldn’t give this much more than 50%, as there is also some evidence that stagnation is on the cards (see e.g. 1, 2). But that would mean that you’re confident that the cause of higher productivity growth, assuming that this happens, would be AI? (Rather than, say, synthetic biotechnology, or genetic engineering, or some other technological advance, or some social change resulting in more optimisation for productivity.)
While AI is perhaps the most plausible single candidate, it’s still quite unclear, so I’d maybe say it’s 25-30% likely that AI in particular will cause significantly higher levels of productivity growth this century.
In the episode you say:
I was wondering what you think of the potential of broader attempts to influence the long-run future (e.g. promoting positive values, growing the EA movement) as opposed to the more targeted attempts to reduce x-risks that are most prominent in the EA movement.
In brief, I feel positively about these broader attempts!
It seems like some of these broad efforts could be useful, instrumentally, for reducing a number of different risks (by building up the pool of available talent, building connections, etc.) The more unsure about what risks matter most, as well, the more valuable broad capacity-building efforts are.
It’s also possible that some shifts in values, institutions, or ideas could actually be long-lasting. (This is something that Will MacAskill, for example, is currently interested in.) If this is right, then I think it’s at least conceivable that trying to positively influence future values/institutions/ideas is more important than reducing the risk of global catastrophes: the goodness of different possible futures might vary greatly.
Thanks for your reply! I also feel positively about broader attempts and am glad that these are being taken more seriously by prominent EA thinkers.
In “Unpacking Classic Arguments for AI Risk”, you defined The Process Orthogonality Thesis as: The process of imbuing a system with capabilities and the process of imbuing a system with goals are orthogonal.
Then, gave several examples of cases where this does not hold: thermostat, Deep Blue, OpenAI Five, the Human brain. Could you elaborate a bit on these examples?
I am a bit confused about it. In Deep Blue, I think that most of the progress has been general computational advances, and the application of an evaluation system given later. The human brain value system can be changed quite a lot without apparent changes in the capacity to achieve one’s goals (consider psychopaths for extreme example here).
Also, general RL systems have had successes in applying themselves to many different circumstances. Say, the work of DeepMind on Atari. Doesn’t that point in favor of the Process Orthogonality Thesis?
I think that my description of the thesis (and, actually, my own thinking on it) is a bit fuzzy. Nevertheless, here’s roughly how I’m thinking about it:
First, let’s say that an agent has the “goal” of doing X if it’s sometimes useful to think of the system as “trying to do X.” For example, it’s sometimes useful to think of a person as “trying” to avoid pain, be well-liked, support their family, etc. It’s sometimes useful to think of a chess program as “trying” to win games of chess.
Agents are developed through a series of changes. In the case of a “hand-coded” AI system, the changes would involve developers adding, editing, or removing lines of code. In the case of an RL agent, the changes would typically involve a learning algorithm updating the agent’s policy. In the case of human evolution, the changes would involve genetic mutations.
If the “process orthogonality thesis” were true, then this would mean that we can draw a pretty clean line between between “changes that affect an agent’s capabilities” and “changes that affect an agent’s goals.” Instead, I want to say that it’s really common for changes to affect both capabilities and goals. In practice, we can’t draw a clean line between “capability genes” and “goal genes” or between “RL policy updates that change goals” and “RL policy updates that change capabilities.” Both goals and capabilities tend to take shape together.
That being said, it is true that some changes do, intuitively, mostly just affect either capabilities or goals. I wouldn’t be surprised, for example, if it’s possible to introduce a minus sign somewhere into Deep Blue’s code and transform it into a system that looks like it’s trying to lose at chess; although the system will probably be less good at losing than it was a winning, it may still be pretty capable. So the processes of changing a system’s capabilities and changing its goals can still come apart to some degree.
It’s also possible to do fundamental research and engineering work that is useful for developing a wide variety of systems. For example, hardware progress has, in general, made it easier to develop highly competent RL agents in all sorts of domains. But, when it comes time to train a new RL agent, its goals and capabilities will still take shape together.
(Hope that clarifies things at least a bit!)
Thanks! This does clarify things for me, and I think that the definition of a “goal” is very helpful here. I do still have some uncertainty here about the claim of process orthogonality which I can better understand:
Let’s define an “instrumental goal” as a goal X for which there is a goal Y such that whenever it is useful to think of the agent as “trying to do X” it is in fact also useful to think of it as “trying to do Y”; In this case we can think that X is instrumental to Y. Instrumental goals can be generated at the development phase or by the agent itself (implicitly or explicitly).
I think that the (non-process) orthogonality thesis does not hold with respect to instrumental goals. A better selection of instrumental goals will enable better capabilities, and with greater capabilities comes greater planning capacity.
Therefore, the process orthogonality thesis does not hold as well for instrumental goals. This means that instrumental goals are usually not the goals of interest when trying to discern between process and non-process orthogonality theses, and we should focus on terminal goal (those which aren’t instrumental).
In the case of an RL agent or Deep Blue, I can only see one terminal goal—maximize defined score or win chess. These won’t really be change together with capabilities.
I thought a bit about humans, but I feel that this is much more complicated and needs more nuanced definitions of goals. (is avoiding suffering a terminal goal? It seems that way, but who is doing the thinking in which it is useful to think of one thing or another as a goal? Perhaps the goal is to reduce specific neuronal activity for which avoiding suffering is merely instrumental?)
I’m actually not very optimistic about a more complex or formal definition of goals. In my mind, the concept of a “goal” is often useful, but it’s sort of an intrinisically fuzzy or fundamentally pragmatic concept. I also think that, in practice, the distinction between an “intrinsic” and “instrumental” goal is pretty fuzzy in the same way (although I think your definition is a good one).
Ultimately, agents exhibit behaviors. It’s often useful to try to summarize these behaviors in terms of what sorts of things the agent is fundamentally “trying” to do and in terms of the “capabilities” that the agent brings to bear. But I think this is just sort of a loose way of speaking. I don’t really think, for example, that there are principled/definitive answers to the questions “What are all of my cat’s goals?”, “Which of my cat’s goals are intrinsic?”, or “What’s my cat’s utility function?” Even if we want to move beyond behavioral definitions of goals, to ones that focus on cognitive processes, I think these sorts of questions will probably still remain pretty fuzzy.
(I think that this way of thinking—in which evolutionary or engineering selection processes ultimately act on “behaviors,” which can only somewhat informally or imprecisely be described in terms of “capabilities” and “goals”—also probably has an influence on my relative optimism about AI alignment. )
I was thinking it over, and I think that I was implicitly assuming that process orthogonality follows from orthogonality in some form or something like that.
The Deep Blue question still holds, I think.
The human brain should be thought of as designed by evolution. What I wrote is strictly (non-process) orthogonality. An example could be given that the cognitive breakthrough might have been enlargement of the neocortex, while civilization was responsible for the values.
I guess that the point is that there are example of non-orthogonality? (Say, the evaluation function of DeepBlue being critical for it’s success)
Do you still think that Robin Hanson’s critique of Christiano’s scenario is worth exploring in more detail?
I do think there’s still more thinking to be done here, but, since I recorded the episode, Alexis Carlier and Tom Davidson have actually done some good work in response to Hanson’s critique. I was pretty persuaded of their conclusion:
On a scale from 1 to 10 what would you rate The Boss Baby? :)
I actually haven’t seen The Boss Baby. A few years back, this ad was on seemingly all of the buses in Oxford for a really long time. Something about them made a lasting impression on me. Maybe it was the smug look on the boss baby’s face.
Reviewing it purely on priors, though, I’ll give it a 3.5 :)
What priorities for TAI strategy does your skepticism towards classical work dictates? Some argued, that we have greater leverage over the scenarios with discrete/discontinuous deployment.
From a long-termist perspective, I think that—the more gradual AI progress is—the more important concerns about “bad attractor states” and “instability” become relative to concerns about AI safety/alignment failures. (See slides).
I think it is probably true, though, that AI safety/alignment risk is more tractable than these other risks. To some extent, the solution to safety risk is for enough researchers to put their heads down and work really hard on technical problems; there’s probably some amount of research effort that would be enough, even if this quantity is very large. In contrast, the only way to avoid certain risks associated with “bad attractor states” might be to establish stable international institutions that are far stronger than any that have come before; there might be structural barriers, here, that no amount of research effort or insight would be enough to overcome.
I think it’s at least plausible that the most useful thing for AI safety and governance researchers to do ultimately focus on brain-in-a-box-ish AI risk scenarios, even they’re not very likely relative to other scenarios. (This would still entail some amount of work that’s useful for multiple scenarios; there would also be instrumental reasons, related to skill-building and reputation-building, to work on present-day challenges.) But I have some not-fully-worked-out discomfort with this possibility.
One thing that I do feel comfortable saying is that more effort should go into assessing the tractability of different influence pathways, the likelihood of different kinds of risks beyond the classic version of AI risk, etc.
What writings have influenced your thinking the most?
What are the arguments that speeding up economic growth has a positive long run impact?
Partly, I had in mind a version of the astronomical waste argument: if you think that we should basically ignore the possibility of preventing extinction or pre-mature stagnation (e.g. for Pascal’s mugging reasons), and you’re optimistic about where the growth process is bringing us, then maybe we should just try to develop an awesome technologically advanced civilization as quickly as possible so that more people can ultimately live in it. IRRC Tyler Cowen argues for something at least sort of in this ballpark, in Stubborn Attachments. I think you’d need pretty specific assumptions to make this sort of argument work, though.
Jumping the growth process forward can also reduce some existential risks. The risk of humanity getting wiped out by a natural disasters, like asteroids, probably gets lower the more technologically sophisticated we become; so, for example, kickstarting the Industrial Revolution earlier would have meant a shorter “time of peril” for natural risks. Leo Aschenbrenner’s paper “Economic Growth and Existential Risk” considers a more complicated version of this argument in the context of anthropogenic risks, which takes into account the fact that growth can also contribute to these risks.
What do you think is the most important role people without technical/quantitative educational backgrounds can play in AI safety/governance?
I don’t have a single top pick; I think this will generally depend on a person’s particular interests, skills, and “career capital.”
I do just want to say, though, that I don’t think it’s at all necessary to have a strong technical background to do useful AI governance work. For example, if I remember correctly, most of the research topics discussed in the “AI Politics” and “AI Ideal Governance” sections of Allan Dafoe’s research agenda don’t require a significant technical background. A substantial portion of people doing AI policy/governance/ethics research today also have a primarily social science or humanities background.
Just as one example that’s salient to me, because I was a co-author on it, I don’t think anything in this long report on distributing the benefits of AI required substantial technical knowledge or skills.
(That being said, I do think it’s really important for pretty much anyone in the AI governance space to understand at least the core concepts of machine learning. For example, it’s important to know things like the difference between “supervised” and “unsupervised” learning, the idea of stochastic gradient descent, the idea of an “adversarial example,” and so on. Fortunately, I think this is pretty do-able even without a STEM background; it’s mostly the concepts, rather than the math, that are important. Certain kinds of research or policy work certainly do require more in-depth knowledge, but a lot of useful work doesn’t.)
Hi Ben—this episode really gave me a lot to think about! Of the ‘three classic arguments’ for AI X-risk you identify, I argued in a previous post that the ‘discontinuity premise’ is based on taking a high-level argument that should be used to establish that sufficiently capable AI will produce very fast progress too literally and assuming the ‘fast progress’ has to happen suddenly and in a specific AI.
Your discussion of the other two arguments led me to conclude that the same sort of mistake is at work in all of them, as I explain here—each is (I think) a case of ‘directly applying a (correct) abstract argument (incorrectly) to the real world’. So we shouldn’t say that the classic arguments are wrong, just overextended/incorrectly applied, as I argue here.
If rapid capability gain, the orthogonality thesis and instrumental convergence are good reasons to suggest AI might pose an existential risk, but were just interpreted too literally, and it’s also true that the ‘new’ arguments make use of these old arguments along with further premises and evidence, then that should raise our confidence that some basic issues have been correctly dealt with since the 2000s. You suggest something like this in the podcast episode, but the discussion never got far into exactly what the underlying intuitions might be:
Do you think there actually is an ‘intuitive core’ to the old arguments that is correct?
Hi Sammy,
Thanks for the links—both very interesting! (I actually hadn’t read your post before.)
I’ve tended to think of the intuitive core as something like: “If we create AI systems that are, broadly, more powerful than we are, and their goals diverge from ours, this would be bad—because we couldn’t stop them from doing things we don’t want. And it might be hard to ensure, as we’re developing increasingly sophisticated AI systems, that there aren’t actually subtle but extremely important divergences in some of these systems’ goals.”
At least in my mind, both the classic arguments and the arguments in “What Failure Looks Like” share this common core. Mostly, the challenge is to explain why it would be hard to ensure that there wouldn’t be subtle-but-extremely-important divergences; there are different possible ways of doing this. For example: Although an expectation of discontinous (or at least very fast) progress is a key part of the classic arguments, I don’t consider it part of the intuitive core; the “What Failure Looks Like” picture doesn’t necessarily rely on it.
I’m not sure if there’s actually a good way to take the core intuition and turn it into a more rigorous/detailed/compelling argument that really works. But I do feel that there’s something to the intuition; I’ll probably still feel like there’s something to the intuition, even if I end feeling like the newer arguments have major issues too.
[[Edit: An alternative intuitive core, which I sort of gesture at in the interview, would simply be: “AI safety and alignment issues exist today. In the future, we’ll have crazy powerful AI systems with crazy important responsibilities. At least the potential badness of safety and alignment failures should scale up with these systems’ power and responsibility. Maybe it’ll actually be very hard to ensure that we avoid the worst-case failures.”]]
Hi Ben,
Thanks for the reply! I think the intuitive core that I was arguing for is more-or-less just a more detailed version of what you say here:
The key difference is that I don’t think orthogonality thesis, instrumental convergence or progress being eventually fast are wrong—you just need extra assumptions in addition to them to get to the expectation that AI will cause a catastrophe.
My point in this comment (and follow up) was that the Orthogonality Thesis, Instrumental Convergence and eventual fast progress are essential for any argument about AI risk, even if you also need other assumptions in there—you need to know the OT will apply to your method of developing AI, you need more specific reasons to think the particular goals of your system look like those that lead to instrumental convergence.
If you approached the classic arguments with that framing, then perhaps it begins to look like less a matter of them being mistaken and more a case of having a vague philosophical picture that then got filled in with more detailed considerations—that’s how I see the development over the last 10 years.
The only mistake was in mistaking the vague initial picture for the whole argument—and that was a mistake, but it’s not the same kind of mistake as just having completely false assumptions. You might compare it to the early development of a new scientific field. Perhaps seeing it that way might lead you to have a different view about how much to update against trusting complicated conceptual arguments about AI risk!
This is how Stuart Russell likes to talk about the issue, and I have a go at explaining that line of thinking here.
Quick belated follow-up: I just wanted to clarify that I also don’t think that the orthogonality thesis or instrumental convergence thesis are incorrect, as they’re traditionally formulated. I just think they’re not nearly sufficient to establish a high level of risk, even though, historically, many presentations of AI risk seemed to treat them as nearly sufficient. Insofar as there’s a mistake here, the mistake concerns way conclusions have been drawn from these theses; I don’t think the mistake is in the theses themselves. (I may not stress this enough in the interview/slides.)
On the other hand, progress/growth eventually becoming much faster might be wrong (this is an open question in economics). The ‘classic arguments’ also don’t just predict that growth/progress will become much faster. In the FOOM debate, for example, both Yudkowsky and Hanson start from the position that growth will become much faster; their disagreement is about how sudden, extreme, and localized the increase will be. If growth is actually unlikely to increase in a sudden, extreme, and localized fashion, then this would be a case of the classic arguments containing a “mistaken” (not just insufficient) premise.
Wow, I am quite surprised it took a year to produce. @80K, does it always take so long?
There’s often a few months between recording and release and we’ve had a handful of episodes that took a frustratingly long time to get out the door, but never a year.
The time between the first recording and release for this one was actually 9 months. The main reason was Howie and Ben wanted to go back and re-record a number of parts they didn’t think they got right the first time around, and it took them a while to both be free and in the same place so they could do that.
A few episodes were also pushed back so we could get out COVID-19 interviews during the peak of the epidemic.
Wait, what’s your probability that we’re past the peak (in terms of, eg, daily worldwide deaths)?
I think you know what I mean — the initial peak in the UK, the country where we are located, in late March/April.
Sorry if I sounded mean! I genuinely didn’t know what you meant! I live in the US and I assumed that most of 80k’s audiences will be more concerned about worldwide numbers or their home country’s, then that of 80k’s “base.” (I also didn’t consider the possibility that there are other reasons than audience interest for you to be prioritizing certain podcasts, like logistics)
I really appreciate a lot of your interviews on covid-19, btw. Definitely didn’t intend my original comment in a mean way!
Poll time:
FWIW, I didn’t really think about what Rob meant when I read his first comment, but when I read Linch’s question, I thought “Eh, Rob probably meant something like ‘the point at which interest, confusion, and urgency seemed especially high, as people were now realising this was huge but hadn’t yet formed clear views on what to do about it’.” So Linch’s question felt somewhat off topic or unnecessary, but also not like it had an obvious answer (and apparently my guess about the answer was wrong).
(But I can also see why Linch saw the question as important, and didn’t think Linch’s question sounded snarky or whatever.)
As Michael says, common sense would indicate I must have been referring to the initial peak, or the peak in interest/panic/policy response, or the peak in the UK/Europe, or peak where our readers are located, or — this being a brief comment on an unrelated topic — just speaking loosely and not putting much thought into my wording.
FWIW it looks like globally the rate of new cases hasn’t peaked yet. I don’t expect the UK or Europe will return to a situation as bad as the one they went through in late March and early April. Unfortunately the US and Latin America are already doing worse than it was then.
This sounds like a status move. I asked a sincere question and maybe I didn’t think too carefully when I asked it, but there’s no need to rub it in.
Thanks, I appreciate the clarification! :)
Upvote this comment if Robert referring to “peak of the epidemic” as the initial peak in the UK was not a hypothesis that occurred to you.
Upvote this comment if you originally thought that Robert was referring to “peak of the epidemic” as the initial peak in the UK.
Hi Ben,
You suggested in the podcast that it’s not clear how to map some of the classic arguments—and especially their manifestation in thought experiments like the paper clip maximizer—to contemporary machine learning methods. I’d like to push back on that view.
Deep reinforcement learning is a popular contemporary ML approach for training agents that act in simulated and real-world environments. In deep RL, an agent is trained to maximize its reward (more precisely, the sum of discounted rewards over time steps), which perfectly fits the “agent” abstraction that is used throughout the book Superintelligence. I don’t see how classic arguments about the behavior of utility maximizing agents fail to apply to deep RL agents. Suppose we replace every occurrence of the word “agent” in the classic arguments with “deep RL agent”; are the modified arguments false? Here’s the result of doing just that for the instrumental convergence thesis (the original version is from Superintelligence, p. 109):
For the sake of concreteness, consider the algorithm that Facebook uses to create the feed that each user sees (which is an example that Stuart Russell has used). Perhaps there’s very little public information about that algorithm, but it’s reasonable to guess they’re using some deep RL algorithm and a reward function that roughly corresponds to user engagement. Conditioned on that, do you agree that in the limit (i.e. when using whatever algorithm and architecture they’re currently using, at a sufficiently large scale), the arguments about instrumental convergence seem to apply?
Regarding the treacherous turn problem, you said:
Suppose Facebook’s scaled-up-algorithm-for-feed-creation would behave deceptively in some way. Suppose it uses some unacceptable technique to increase user engagement (e.g. making users depressed), but it refrains from doing so whenever there’s a risk that Facebook engineers would notice. How confident should we be that Facebook engineers would notice the deceptive behavior (i.e. the avoidance of unacceptable behavior in situations where the unacceptable behavior might be noticed)?
Hi Ofer,
Thanks for the comment!
I actually do think that the instrumental convergence thesis, specifically, can be mapped over fine, since it’s a fairly abstract principle. For example, this recent paper formalizes the thesis within a standard reinforcement learning framework. I just think that the thesis at most weakly suggests existential doom, unless we add in some other substantive theses. I have some short comments on the paper, explaining my thoughts, here.
Beyond the instrumental convergence thesis, though, I do think that some bits of the classic arguments are awkward to fit onto concrete and plausible ML-based development scenarios: for example, the focus on recursive self-improvement, and the use of thought experiments in which natural language commands, when interpretted literally and single-mindedly, lead to unforeseen bad behaviors. I think that Reframing Superintelligence does a good job of pointing out some of the tensions between classic ways of thinking and talking about AI risk and current/plausible ML engineering practices.
This may not be what you have in mind, but: I would be surprised if the FB newsfeed selection algorithm became existentially damaging (e.g. omnicidal), even in the limit of tremendous amounts of training data and compute. I don’t know the algorithm actually works, but as a simplication: let’s imagine that it produces an ordered list of posts to show a user, from the set of recent posts by their friends, and that it’s trained using something like the length of the user’s FB browsing session as the reward. I think that, if you kept training it, nothing too weird would happen. It might produce some unintended social harms (like addiction, polarization, etc.), but the system wouldn’t, in any meaningful sense, have long-run objectives (due to the shortness of sessions). It also probably wouldn’t have the ability or inclination to manipulate the external world in the pursuit of complex schemes. Figuring out how to manipulate the external world in precise ways would require a huge amount of very weird exploration, deep in a section of the space of possible policies where most of the policies are terrible at maximizing reward; in the unlikely event that the necessary exploration happened, and the policy started moving in this direction, I think it would be conspicuous before the newsfeed selection algorithm does something like kill everyone to prevent ongoing FB sessions from ending (if this is indeed possible given the system’s limited space of possible actions.)
Thanks for the thoughtful reply!
Would you say that the treacherous turn argument can also be mapped over to contemporary ML methods (similarly to the instrumental convergence thesis) due to it being a fairly abstract principle?
Also, why is “recursive self-improvement” awkward to fit onto concrete and plausible ML-based development scenarios? (If we ignore the incorrect usage of the word “recursive” here; the concept should have been called “iterative self-improvement”). Consider the work that has been done on neural architecture search via reinforcement learning (this 2016 paper on that topic currently has 1,775 citations on Google Scholars, including 560 citations from 2020). It doesn’t seem extremely unlikely that such a technique will be used, at some point in the future, in some iterative self-improvement setup, in a way that may cause an existential catastrophe.
Regarding the example with the agent that creates the feed of each FB user:
I agree that the specified time horizon (and discount factor) is important, and that a shorter time horizon seems safer. But note that FB is incentivized to specify a long time horizon. For example, suppose the feed-creation-agent shows a user a horrible post by some troll, which causes the user to spend many hours in a heated back-and-forth with said troll. Consequently, the user decides FB sucks and ends up getting off FB for many months. If the specified time horizon is sufficiently short (or the discount factor is sufficiently small), then from the perspective of the training process the agent did well when it showed the user that post, and the agent’s policy network will be updated in a way that makes such decisions more likely. FB doesn’t want that. FB’s actual discount factor for users’ engagement time may be very close to 1 (i.e. a user spending an hour on FB today is not 100x more valuable to FB than the user spending an hour on FB next month). This situation is not unique to FB. Many companies that use RL agents that act in the real world have long-term preferences with respect to how their RL agents act.
Regarding the “inclination” part: Manipulating the “external world” (what other environment does the feed-creation-agent model?) in the pursuit of certain complex schemes is very useful for maximizing the user engagement metric (that by assumption corresponds to the specified reward function). Also, I don’t see how the “wouldn’t have the ability” part is justified in the limit as the amount of training compute (and architecture size) and data grows to infinity.
We expect the training process to update the policy network in a way that makes the agent more intelligent (i.e. better at modeling the world and causal chains therein, better at planning, etc.), because that is useful for maximizing the sum of discounted rewards. So I don’t understand how your above argument works, unless you’re arguing that there’s some upper bound on the level of intelligence that we can expect deep RL algorithms to yield, and that upper bound is below the minimum level for an agent to pose existential risk due to instrumental convergence.
We should expect a sufficiently intelligent agent [EDIT: that acts in the real world] to refrain from behaving in a way that is both unacceptable and conspicuous, as long as we can turn it off (that’s the treacherous turn argument). The question is whether the agent will do something sufficiently alarming and conspicuous before the point where it is intelligent enough to realize it should not cause alarm. I don’t think we can be very confident either way.
You discuss at one point in the podcast the claim that as AI systems take on larger and larger real world problems, the challenge of defining the reward function will become more and more important. For example for cleaning, the simple number-of-dust-particles objective is inadequate because we care about many other things e.g. keeping the house tidy and many side constraints e.g. avoiding damaging household objects. This isn’t quite an argument for AI alignment solving itself, but it is an argument that the attention and resources poured into AI alignment may naturally rise to the challenge without EA effort, and thus perhaps EA effort is misplaced.
First off, I think this is a great steel-man of the Lecun/Etzioni safety skeptic position, and, importantly, I think it gives a more concrete/falsifiable position to argue against. On the other hand, this argument seems to go through only if most of the tasks worked on by AI researchers are of the kind described—i.e. the designer of the system has it in their own interest to deal with side constraints and fix reward function specification. In my view, this condition is unlikely to be met. It seems to me likely that most tasks AI corps work on will have a principal-agent complication. Recommender system alignment, automated advertising, stock trading, etc. work in all of these domains maximize profit for the AI researchers’ company when they run roughshod over side constraints. The side constraints here being mostly the preferences of users on the platform for tech, and other investors for finance.
Does this seem right? If so, what are the upshots? Could the legal/lobbying work of strengthening the positions of these principals become a high-value task for EA to take on?
Sorry if this isn’t as polished as I’d hoped. Still a lot to read and think about, but posting as I won’t have time now to elaborate further before the weekend. Thanks for doing the AMA!
It seems like a crux that you have identified is how “sudden emergence” happens. How would a recursive self-improvement feedback loop start? Increasing optimisation capacity is a convergent instrumental goal. But how exactly is that goal reached? To give the most pertinent example—what would the nuts and bolts of it be for it happening in an ML system? It’s possible to imagine a sufficiently large pile of linear algebra enabling recursive chain reactions of both improvement in algorithmic efficiency, and size (e.g. capturing all global compute → nanotech → converting Earth to Computronium). Even more so since GPT-3. But what would the trigger be for setting it off?
Does the above summary of my take of this chime with yours? Do you (or anyone else reading) know of any attempts at articulating such a “nuts-and-bolts” explanation of “sudden emergence” of AGI in an ML system?
Or maybe there would be no trigger? Maybe a great many arbitrary goals would lead to sufficiently large ML systems brute-force stumbling upon recursive self-improvement as an instrumental goal (or mesa-optimisation)?
Responding to some quotes from the 80,000 Hours podcast:
What mechanism makes AI be attracted to benign things? Surely only through human direction? But to my mind the whole Bostrom/Yudkowsky argument is that it FOOMs out of control of humans (and e.g. converts everything into Computronium as a convergent instrumental goal.)
This reads like a bit of a strawman. My intuition for the problem of instrumental convergence is that in many take-off scenarios the AI will perform (a lot) more compute, and the way it will do this is by converting all available matter to Computronium (with human-existential collateral damage). From what I’ve read, you don’t directly touch on such scenarios. Would be interested to hear your thoughts on them.
Whilst you might not typically get radically different behaviours, in the cases where ML systems do fail, they tend to fail catastrophically (in ways that a human never would)! This also fits in with the notion of hidden proxy goals from “mesa optimisers” being a major concern (as well as accurate and sufficient specification of human goals).
Here is an argument for how GPT-X might lead to proto-AGI in a more concrete, human-aided, way:
Thoughts on modifications/improvements to The Windfall Clause?
What do you think about hardware-based forecasts for human-substitute AI?
I don’t currently give them very much weight.
It seems unlikely to me that hardware progress—or, at least, practically achievable hardware progress—will turn out to be sufficient for automating away all the tasks people can perform. If both hardware progress and research effort instead play similarly fundamental roles, then focusing on only a single factor (hardware) can only give us pretty limited predictive power.
Also, to a lesser extent: Even it is true that compute growth is the fundamental driver of AI progress, I’m somewhat skeptical that we could predict the necessary/sufficient amount of compute very well.
Great interview, thanks for some really thought-provoking ideas. For the brain in the box section, it seemed like you were saying that we’d expect future worlds to have fairly uniform distributions of capabilities of AI systems, and so we’d learn from other similar cases. How uniform do you think the spread of capabilities of AI systems is now, and how wide do you think the gaps have to be in the future for the ‘brain in a box’ scenario to be possible?
Have your become more uncertain/optimistic about the arguments in favour of importance of other x-risks as a result of scrutinising AI risk?
I don’t think it’s had a significant impact on my views about the absolute likelihood or tractability of other existential risks. I’d be interested if you think it should have, though!
Oh, I meant pessimistic. A reason for a weak update might similar to Gell-Man amnesia effect. After putting effort into classical arguments you noticed some important flaws. The fact that have not been articulated before suggests that collective EA epistemology is weaker than expected. Because of that one might get less certain about quality of arguments in other EA domains.
I’d say nearly everyone’s ability to determine an argument’s strength is very weak. On the Forum, invalid meta-arguments* are pretty common, such as “people make logic mistakes so you might have too”, rather than actually identifying the weaknesses in an argument. There’s also a lot of pseudo-superforcasting, like “I have 80% confidence in this”, without any evidence backing up those credences. This seems to me like people are imitating sound arguments without actually understanding how they work. Effective altruists have centred around some ideas that are correct (longtermism, moral uncertainty, etc.), but outside of that, I’d say we’re just as wrong as anyone else.
*Some meta-arguments are valid, like discussions on logical grounding of particular methodologies, e.g. “Falsification works because of the law of contraposition, which follows from the definition of logical implication”.
From a bayesian perspective there is no particular reason why you have to provide more evidence if you provide credences, and in general I think there is a lot of value in people providing credences even if they don’t provide additional evidence, if only to avoid problems of ambiguous language.
I’m not sure I know what you mean by this.
I’d agree that you’re definitely not obligated to provide more evidence, and that your credence does fully capture how likely you think it is that X will happen.
But it seems to me that the evidence that informed your credence can also be very useful information for people, both in relation to how much they should update their own credences (as they may have info you lack regarding how relevant and valid those pieces of evidence are), and in relation to how—and how much—you might update your views (e.g., if they find out you just thought for 5 seconds and went with your gut, vs spending a year building expertise and explicit models). It also seems like sharing that evidence could help them with things like building their general models of the world or of how to make estimates.
(This isn’t an argument against giving explicit probabilities that aren’t based on much or that aren’t accompanied by explanations of what they’re based on. I’m generally, though tentatively, in favour of that. It just seems like also explaining what the probabilities are based on is often quite useful.)
(By the way, Beard et al. discuss related matters in the context of existential risk estimates, using the term “evidential reasoning”.)
This is in contrast to a frequentist perspective, or maybe something close to a “common-sense” perspective, which tends to bucket knowledge into separate categories that aren’t easily interchangeable.
Many people make a mental separation between “thinking something is true” and “thinking something is X% likely, where X is high”, with one falling into the category of lived experience, and the other falling into the category of “scientific or probabilistic assessment”. The first one doesn’t require any externalizable evidence and is a fact about the mind, the second is part of a collaborative scientific process that has at its core repeatable experiments, or at least recurring frequencies (i.e. see the frequentist discussion of it being meaningless to assign probabilities to one-time events).
Under some of these other non-bayesian interpretations of probability theory, an assignment of probabilities is not valid if you don’t associate it with either an experimental setup, or some recurring frequency. So under those interpretations you do have an additional obligation to provide evidence and context to your probability estimates, since otherwise they don’t really form even a locally valid statement.
Thanks for that answer. So just to check, you essentially just meant that it’s ok to provide credences without saying your evidence—i.e., you’re not obligated to provide evidence when you provide credences? Not that there’s no added value to providing your evidence alongside your credences?
If so, I definitely agree.
(And it’s not that your original statement seemed to clearly say something different, just that I wasn’t sure that that’s all it was meant to mean.)
Yep, that’s what I was implying.
This statement is just incorrect.
Sure there is: By communicating, we’re trying to update one another’s credences. You’re not going to be very successful in doing so if you provide a credence without supporting evidence. The evidence someone provides is far more important than someone’s credence (unless you know the person is highly calibrated and precise). If you have a credence that you keep to yourself, then yes, there’s no need for supporting evidence.
Ambiguous statements are bad, 100%, but so are clear, baseless statements.
As you say, people can legitimately have credences about anything. It’s how people should think. But if you’re going to post your credence, provide some evidence so that you can update other people’s credences too.
You seem to have switched from the claim that EAs often report their credences without articulating the evidence on which those credences rest, to the claim that EAs often lack evidence for the credences they report. The former claim is undoubtedly true, but it doesn’t necessarily describe a problematic phenomenon. (See Greg Lewis’s recent post; I’m not sure if you disagree.). The latter claim would be very worrying if true, but I don’t see reason to believe that it is. Sure, EAs sometimes lack good reasons for the views they espouse, but this is a general phenomenon unrelated to the practice of reporting credences explicitly.
Habryka seems to be talking about people who have evidence and are just not stating it, so we might be talking past one another. I said in my first comment “There’s also a lot of pseudo-superforcasting … without any evidence backing up those credences.” I didn’t say “without stating any evidence backing up those credences.” This is not a guess on my part. I’ve seen comments where they say explicitly that the credence they’re giving is a first impression, and not something well thought out. It’s fine for them to have a credence, but why should anyone care what your credence is if it’s just a first impression?
I completely agree with him. Imprecision should be stated and significant figures are a dumb way to do it. But if someone said “I haven’t thought about this at all, but I’m pretty sure it’s true”, is that really all that much worse than providing your uninformed prior and saying you haven’t really thought about it?
I agree that EAs put superforecasters and superforecasting techniques on a pedestal, more than is warranted.
Yes, I think it’s a lot worse. Consider the two statements:
And
The two statements are pretty similar in verbalized terms (and each falls under loose interpretations of what “pretty sure” means in common language), but ought to have drastically different implications for behavior!
I basically think EA and associated communities would be better off to have more precise credences, and be accountable for them. Otherwise, it’s difficult to know if you were “really” wrong, even after checking hundreds of claims!
Yes you’re right. But I’m making a distinction between people’s own credences and their ability to update the credences of other people. As far as changing the opinion of the reader, when someone says “I haven’t thought much about it”, it should be an indicator to not update your own credence by very much at all.
I fully agree. My problem is that this is not the current state of affairs for the majority of Forum users, in which case, I have no reason to update my credences because an uncalibrated random person says they’re 90% confident without providing any reasoning that justifies their position. All I’m asking for is for people to provide a good argument along with their credence.
I think that they should be emulated. But superforcasters have reasoning to justify their credences. They break problems down into components that they’re more confident in estimating. This is good practice. Providing a credence without any supporting argument, is not.
I’m curious if you agree or disagree with this claim:
With a specific operationalization like:
It’s almost irrelevant, people still should provide their supporting argument of their credence, otherwise evidence can get “double counted” (and there’s “flow on” effects where the first person who updates another person’s credence has a significant effect on the overall credence of the population). For example, say I have arguments A and B supporting my 90% credence on something. And you have arguments A, B and C supporting your 80% credence on something. And neither of us post our reasoning; we just post our credences. It’s a mistake for you to then say “I’ll update my credence a few percent because FCCC might have other evidence.” For this reason, providing supporting arguments is a net benefit, irrespective of EA’s accuracy of forecasts.
I don’t find your arguments persuasive for why people should give reasoning in addition to credences. I think posting reasoning is on the margin of net value, and I wish more people did it, but I also acknowledge that people’s time is expensive so I understand why they choose not to. You list reasons why giving reasoning is beneficial, but not reasons for why it’s sufficient to justify the cost.
My question probing predictive ability of EAs earlier was an attempt to set right what I consider to be an inaccuracy in the internal impressions EAs have about the ability of superforecasters. In particular, it’s not obvious to me that we should trust the judgments of superforecasters substantially more than we trust the judgments of other EAs.
My view is that giving explicit, quantitative credences plus stating the supporting evidence is typically better than giving explicit, quantitative credences without stating the supporting evidence (at least if we ignore time costs, information hazards, etc.), which is in turn typically better than giving qualitative probability statements (e.g., “pretty sure”) without stating the supporting evidence, and often better than just saying nothing.
Does this match your view?
In other words, are you essentially just arguing that “providing supporting arguments is a net benefit”?
I ask because I had the impression that you were arguing that it’s bad for people to give explicit, quantitative credences if they aren’t also giving their supporting evidence (and that it’d be better for them to, in such cases, either use qualitative statements or just say nothing). Upon re-reading the thread, I got the sense that others may have gotten that impression too, but also I don’t see you explicitly make that argument.
Basically, yeah.
But I do think it’s a mistake to update your credence based off someone else’s credence without knowing their argument and without knowing whether they’re calibrated. We typically don’t know the latter, so I don’t know why people are giving credences without supporting arguments. It’s fine to have a credence without evidence, but why are people publicising such credences?
I’d agree with a modified version of your claim, along the following lines: “You should update more based on someone’s credence if you have more reason to believe their credence will track the truth, e.g. by knowing they’ve got good evidence (even if you haven’t actually seen the evidence) or knowing they’re well-calibrated. There’ll be some cases where you have so little reason to believe their credence will track the truth that, for practical purposes, it’s essentially not worth updating.”
But your claim at least sounds like it’s instead that some people are calibrated while others aren’t (a binary distinction), and when people aren’t calibrated, you really shouldn’t update based on their credences at all (at least if you haven’t seen their arguments).
I think calibration increases in a quantitative, continuous way, rather than switching from off to on. So I think we should just update on credences more the more calibrated the person they’re from is.
Does that sound right to you?
I mean, very frequently it’s useful to just know what someone’s credence is. That’s often an order of magnitude cheaper to provide, and often is itself quite a bit of evidence. This is like saying that all statements of opinions or expressions of feelings are bad, unless they are accompanied with evidence, which seems like it would massively worsen communication.
I agree, but only if they’re a reliable forecaster. A superforecaster’s credence can shift my credence significantly. It’s possible that their credences are based off a lot of information that shifts their own credence by 1%. In that case, it’s not practical for them to provide all the evidence, and you are right.
But most people are poor forecasters (and sometimes they explicitly state they have no supporting evidence other than their intuition), so I see no reason to update my credence just because someone I don’t know is confident. If the credence of a random person has any value to my own credence, it’s very low.
That would depend on the question. Sometimes we’re interested in feelings for their own sake. That’s perfectly legitimate because the actual evidence we’re wanting is the data about their feelings. But if someone’s giving their feelings about whether there are an infinite number of primes, it doesn’t update my credences at all.
I think opinions without any supporting argument worsen discourse. Imagine a group of people thoughtfully discussing evidence, then someone comes in, states their feelings without any evidence, and then leaves. That shouldn’t be taken seriously. Increasing the proportion of those people only makes it worse.
Bayesians should want higher-quality evidence. Isn’t self-reported data is unreliable? And that’s when the person was there when the event happened. So what is the reference class for people providing opinions without having evidence? It’s almost certainly even more unreliable. If someone has an argument for their credence, they should usually give that argument; if they don’t have an argument, I’m not sure why they’re adding to the conversation.
I’m not saying we need to provide peer-reviewed articles. I just want to see some line of reasoning demonstrating why you came to the conclusion you made, so that everyone can examine your assumptions and inferences. If we have different credences and the set of things I’ve considered is a strict subset of yours, you might update your credence because you mistakenly think I’ve considered something you haven’t.
Yes, but unreliability does not mean that you instead just use vague words instead of explicit credences. It’s a fine critique to say that people make too many arguments without giving evidence (something I also disagree with, but that isn’t the subject of this thread), but you are concretely making the point that it’s additionally bad for them to give explicit credences! But the credences only help, compared to vague and ambiguous terms that people would use instead.
I’m not sure how you think that’s what I said. Here’s what I actually said:
I thought I was fairly clear about what my position is. Credences have internal value (you should generate your own credence). Superforecasters’ credences have external value (their credence should update yours). Uncalibrated random people’s credences don’t have much external value (they shouldn’t shift your credence much). And an argument for your credence should always be given.
I never said vague words are valuable, and in fact I think the opposite.
This is an empirical question. Again, what is the reference class for people providing opinions without having evidence? We could look at all of the unsupported credences on the forum and see how accurate they turned out to be. My guess is that they’re of very little value, for all the reasons I gave in previous comments.
I demonstrated a situation where a credence without evidence is harmful:
The only way we can avoid such a situation is either by providing a supporting argument for our credences, OR not updating our credences in light of other people’s unsupported credences.
Here are two claims I’d very much agree with:
It’s often best to focus on object-level arguments rather than meta-level arguments, especially arguments alleging bias
One reason for that is that the meta-level arguments will often apply to a similar extent to a huge number of claims/people. E.g., a huge number of claims might be influenced substantially by confirmation bias.
(Here are two relevant posts.)
Is that what you meant?
But you say invalid meta-arguments, and then give the example “people make logic mistakes so you might have too”. That example seems perfectly valid, just often not very useful.
And I’d also say that that example meta-argument could sometimes be useful. In particular, if someone seems extremely confident about something based on a particular chain of logical steps, it can be useful to remind them that there have been people in similar situations in the past who’ve been wrong (though also some who’ve been right). They’re often wrong for reasons “outside their model”, so this person not seeing any reason they’d be wrong doesn’t provide extremely strong evidence that they’re not.
It would be invalid to say, based on that alone, “You’re probably wrong”, but saying they’re plausibly wrong seems both true and potentially useful.
(Also, isn’t your comment primarily meta-arguments of a somewhat similar nature to “people make logic mistakes so you might have too”? I guess your comment is intended to be a bit closer to a specific reference class forecast type argument?)
Describing that as pseudo-superforecasting feels unnecessarily pejorative. I think such people are just forecasting / providing estimates. They may indeed be inspired by Tetlock’s work or other work with superforecasters, but that doesn’t mean they’re necessarily trying to claim their estimates use the same methodologies or deserve the same weight as superforecasters’ estimates. (I do think there are potential downsides of using explicit probabilities, but I think each potential downside is debatable, and there are also potential upsides, and using seemingly pejorative terms probably doesn’t help.)
Did you mean “some ideas that are probably correct and very important”? If so, I’d agree. But I wouldn’t want to imply longtermism (and to a lesser extent moral uncertainty) are simply “correct” (rather than “quite likely” or “what we should act based on, given (meta)moral uncertainty and expected value reasoning.
I’d disagree with that. I definitely don’t think EAs are perfect, but they do seem above-average in their tendency to have true beliefs and update appropriately on evidence, across a wide range of domains.
My definition of an invalid argument contains “arguments that don’t reliably differentiate between good and bad arguments”. “1+1=2″ is also a correct statement, but that doesn’t make it a valid response to any given argument. Arguments need to have relevancy. I dunno, I could be using “invalid” incorrectly here.
Yes, if someone believed that having a logical argument is a guarantee, and they’ve never had one of their logical arguments have a surprising flaw, it would be valid to point that out. That’s fair. But (as you seem to agree with) the best way to do this is to actually point to the flaw in the specific argument they’ve made. And since most people who are proficient with logic already know that logic arguments can be unsound, it’s not useful to reiterate that point to them.
It is, but as I said, “Some meta-arguments are valid”. (I can describe how I delineate between valid and invalid meta-arguments if you wish.)
Ah sorry, I didn’t mean to offend. If they were superforecasters, their credence alone would update mine. But they’re probably not, so I don’t understand why they give their credence without a supporting argument.
The set of things I give 100% credence is very, very small (i.e. claims that are true even if I’m a brain in a vat). I could say “There’s probably a table in front of me”, which is technically more correct than saying that there definitely is, but it doesn’t seem valuable to qualify every statement like that.
Why am I confident in moral uncertainty? People do update their morality over time, which means that they were wrong at some point (i.e. there is demonstrably moral uncertainty), or the definition of “correct” changes and nobody is ever wrong. I think “nobody is ever wrong” is highly unlikely, especially because you can point to logical contradictions in people’s moral beliefs (not just unintuitive conclusions). At that point, it’s not worth mentioning the uncertainty I have.
Yeah, I’m too focused on the errors. I’ll concede your point: Some proportion of EAs are here because they correctly evaluated the arguments. So they’re going to bump up the average, even outside of EA’s central ideas. My reference classes here were all the groups that have correct central ideas, and yet are very poor reasoners outside of their domain. My experience with EAs is too limited to support my initial claim.
Oh, when you said “Effective altruists have centred around some ideas that are correct (longtermism, moral uncertainty, etc.)”, I assumed (perhaps mistakenly) that by “moral uncertainty” you meant something vaguely like the idea that “We should take moral uncertainty seriously, and think carefully about how best to handle it, rather than necessarily just going with whatever moral theory currently seems best to us.”
So not just the idea that we can’t be certain about morality (which I’d be happy to say is just “correct”), but also the idea that that fact should change our behaviour is substantial ways. I think that both of those ideas are surprisingly rare outside of EA, but the latter one is rarer, and perhaps more distinctive to EA (though not unique to EA, as there are some non-EA philosophers who’ve done relevant work in that area).
On my “inside-view”, the idea that we should “take moral uncertainty seriously” also seems extremely hard to contest. But I move a little away from such confidence, and probably wouldn’t simply call it “correct”, due to the fact that most non-EAs don’t seem to explicitly endorse something clearly like that idea. (Though maybe they endorse somewhat similar ideas in practice, even just via ideas like “agree to disagree”.)
What are your thoughts on AI policy careers in government? I’m somewhat skeptical, for two main reasons:
1) It’s not clear that governments will become leading actors in AI development; by default I expect this not to happen. Unlike with nuclear weapons, governments don’t need to become experts in the technology to yield AI-based weapons; they can just purchase them from contractors. Beyond military power, competition between nations is mostly economic. Insofar as AI is an input to this, governments have an incentive to invest in domestic AI firms over government AI capabilities, because this is the more effective way to translate AI into GDP.
2) Government careers in AI policy also look compelling if the intersection of AI and war is crucial. But as you say in the interview, it’s not clear that AI is the best lever for reducing existentially damaging war. And in the EA community, it seems like this argument was generated as an additional reason to work on AI, and wasn’t the output of research trying to work out the best ways to reduce war.
Do you think the answer to this question should be a higher priority, especially given the growing number of EAs studying things like Security Studies in D.C.?
In brief, I do actually feel pretty positively.
Even if governments aren’t doing a lot of important AI research “in house,” and private actors continue to be the primary funders of AI R&D, we should expect governments to become much more active if really serious threats to security start to emerge. National governments are unlikely to be passive, for example, if safety/alignment failures become increasingly damaging—or, especially, if existentially bad safety/alignment failures ever become clearly plausible. If any important institutions, design decisions, etc., regarding AI get “locked in,” then I also expect governments to be heavily involved in shaping these institutions, making these decisions, etc. And states are, of course, the most important actors for many concerns having to do with political instability caused by AI. Finally, there are also certain potential solutions to risks—like creating binding safety regulations, forging international agreements, or plowing absolutely enormous amounts of money into research projects—that can’t be implemented by private actors alone.
Basically, in most scenarios where AI governance work turns out be really useful from a long-termist perspective—because there are existential safety/alignment risks, because AI causes major instability, or because there are opportunities to “lock in” key features of the world—I expect governments to really matter.