AI safety

Core TagLast edit: Aug 7, 2024, 3:10 PM by vipulnaik

AI safety is the study of ways to reduce risks posed by artificial intelligence.

Interventions that aim to reduce these risks can be split into:

Technical alignment - research on how to align AI systems with human or moral goals
AI governance - reducing AI risk by e.g. global coordination around regulating AI development or providing incentives for corporations to be more cautious in their AI research
AI forecasting - predicting AI capabilities ahead of time

Reading on why AI might be an existential risk

Hilton, Benjamin (2023) Preventing an AI-related catastrophe, 80000 Hours, March 2023

Cotra, Ajeya (2022) Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover Effective Altruism Forum, July 18

Carlsmith, Joseph (2022) Is Power-Seeking AI an Existential Risk? Arxiv, 16 June

Yudkowsky, Eliezer (2022) AGI Ruin: A List of Lethalities LessWrong, June 5

Ngo et al (2023) The alignment problem from a deep learning perspectiveArxiv, February 23

Arguments against AI safety

AI safety and AI risk is sometimes referred to as a Pascal’s Mugging ^[1], implying that the risks are tiny and that for any stated level of ignorable risk the the payoffs could be exaggerated to force it to still be a top priority. A response to this is that in a survey of 700 ML researchers, the median answer to the “the probability that the long-run effect of advanced AI on humanity will be “extremely bad (e.g., human extinction)” was 5% with, with 48% of respondents giving 10% or higher^[2]. These probabilites are too high (by at least 5 orders of magnitude) to be consider Pascalian.

AI safety as a career

80,000 Hours’ medium-depth investigation rates technical AI safety research a “priority path”—among the most promising career opportunities the organization has identified so far.^[3]^[4] Richard Ngo and Holden Karnofsky also have advice for those interested in working on AI Safety^[5]^[6].

^
https://twitter.com/amasad/status/1632121317146361856 The CEO of Replit, a coding organisation who are involved in ML Tools
^
https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/#Data
^
Todd, Benjamin (2023) The highest impact career paths our research has identified so far, 80,000 Hours, May 12.
^
Hilton, Benjamin (2023) AI safety technical research, 80,000 Hours, June 19th
^
Ngo, Richard (2023) AGI safety career advice, EA Forum, May 2
^
Karnofsky, Holden (2023), Jobs that can help with the most important century, EA Forum, Feb 12

No entries.

Eevee🔹Apr 20, 2021, 7:32 PM
4 points
0 ∶ 0
Let’s merge this with AI Risks
- Pablo Apr 20, 2021, 9:16 PM
  6 points
  0 ∶ 0
  Parent
  Thanks for flagging this. There are a number of related entries, including AI risks, AI alignment and AI safety, and others have been suggested, so I think this would be a good opportunity to make a general decision on how this space should be carved up. I’ve reached out to Rob Bensinger, who wrote this useful clarificatory comment, for feedback, but everyone is welcome to chime in with their thoughts.
  - David M Sep 29, 2023, 12:17 PM
    5 points
    0 ∶ 0
    Parent
    Seems like these ‘topics’ are trying to serve at least two purposes: providing wiki articles with info to orient people, and classifying/tagging forum posts. These purposes don’t need to be so tied together as they currently are. One could want to have e.g. 3 classification labels (‘safety’, ‘risks’, ‘alignment’), but that seems like a bad reason to write 3 separate articles, which duplicates effort in cases where the topics have a lot of overlap.
    
    A lot of writing time could be saved if tags/topics and wiki articles were split out such that closely related tags/topics could point to the same wiki article.
    - Pablo Sep 29, 2023, 4:08 PM
      3 points
      0 ∶ 0
      Parent
      Thanks for the feedback! Although I am no longer working on this project, I am interested in your thoughts because I am currently developing a website with Spanish translations, which will also feature a system where each tag is also a wiki article and vice versa. I do think that tags and wiki articles have somewhat different functions and integrating them in this way can sometimes create problems. But I’m not sure I agree that the right approach is to map multiple tags onto a single article. In my view, a core function of a Wiki is to provide concise definitions of key terms and expressions (as a sort of interactive glossary), and this means that one wants the articles to be as granular as the tags. The case of “AI safety” vs. “AI risk” vs. “AI alignment” seems to me more like a situation where the underlying taxonomy is unclear, and this affects the Wiki entries both considered as articles and considered as tags. But perhaps there are other cases I’m missing.
      Tagging @Lizka and @Amber Dawn.
  - RobBensinger Apr 20, 2021, 11:57 PM
    10 points
    0 ∶ 0
    Parent
    A natural way of tagging AI-related content on the EA Forum might be something like:
    Discussion of the sort of AI that existential risk EAs are worried about.
    Discussion of other sorts of AI.
    And within 1:
    1a. Technical work aimed at increasing the probability that well-intentioned developers can reliably produce good outcomes from category-1 AI systems.
    1b. Attempts to forecast AI progress as it bears on category-1 AI systems.
    1c. Attempts to answer macrostrategy questions about such AI systems: How should they be used? What kind of group(s) should develop them? How do we ensure that developers are informed, responsible, and/or well-intentioned? Etc.
    Plausibly 1b and 1c should just be one tag (which can then link to multiple different daughter wiki articles explaining different subtopics), since there’s lots of overlap and keeping the number of tags small makes it easier to find articles you’re looking for and remember what the tags are.
    (There may also be no need to make 1 its own category, if everything falls under at least one of 1a/1b/1c anyway. But maybe some things will be meta enough to benefit from a supertag—e.g., discussions of the orgs working on AI x-risk.)
    If those are good categories, then the next question is what to name them. Some established options for category 1 (roughly in increasing order of how much I like them for this purpose):
    Superintelligent AI—Defined by Bostrom as AI “that is much smarter than the best human brains
    in practically every field, including scientific creativity, general wisdom and
    social skills”. This seems overly specific: AI might destroy the world with superhuman science and engineering skills even if it lacks superhuman social skills, for example.
    Transformative AI—Defined by Open Phil as “AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution”. I think this is too vague, and wouldn’t help people discussing Bostrom/Christiano/etc. doomsday scenarios find each other on the forum. E.g., Logan Zoellner wonders whether existing AI is already “transformative” in this sense; whatever the answer, it seems like a question that’s tangential to the kinds of considerations x-risk folks mostly care about.
    Advanced AI—A vague term that can variously mean any of the terms on this list. Its main advantage is that its lack of a clear definition would let the EA Forum stipulate some definition just for the sake of the wiki and tagging systems.
    Artificial general intelligence—AI that can do the same sort of general reasoning about messy physical environments that allowed humans to land on the Moon and build particle accelerators (even though those are very different tasks, neither capability was directly selected for in our ancestral environment, and chimpanzees can’t do either). This seems like a good option relative to my own way of thinking about AI x-risk. The main disadvantage of the term (for this use case) is that some thinkers worried about AI x-risk are more skeptical that “general intelligence” is a natural or otherwise useful category. A slightly more theory-neutral term might be better for a tag or overview article.
    Prepotent AI—A new term defined by Andrew Critch and David Krueger to mean AI whose deployment “would transform the state of humanity’s habitat—currently the Earth—in a manner that is at least as impactful as humanity and unstoppable to humanity”. This seems to nicely subsume both the slower-takeoff Christiano doomsday scenarios and the faster-takeoff Bostromite doomsday scenarios, without weighing in on whether “AGI” is a good category.
    Category 1 could then be called “Non-prepotent AI” or “Narrow AI” or similar.
    I haven’t used the term “prepotent AI” much, so it might have issues that I’m not tracking. But if so, giving it a test run on the EA Forum might be a good way to reveal such issues.
    I think the best term for 1a is AI alignment, with the wrinkle that most researchers are focused on “intent alignment” (getting the AI to try to produce good outcomes), and Paul Christiano thinks it would be more natural to define the field’s goal as intent alignment, while some other researchers want to include topics like boxing in ‘AI alignment’ (I’ve called this larger category “outcome alignment” to disambiguate).
    For 1b+1c, I like AI strategy and forecasting. The term “AI governance” is popular, but seems too narrow (and potentially alienating or confusing) to me. You could also maybe call it ‘Prepotent AI strategy and forecasting’ or ‘AGI strategy and forecasting’ to clarify that we aren’t talking about the strategic implications of using existing AI tech to augment anti-malaria efforts or what-have-you.
    - Pablo Apr 23, 2021, 3:24 PM
      4 points
      0 ∶ 0
      Parent
      Thank you for these suggestions.
      I like the overall taxonomy. I think it’s fine to have separate articles for each of 1a, 1b and 1c. In general, I’m not very worried about having lots of articles, as long as they carve up the space in the right way.
      Concerning terminology, I agree that ‘prepotent AI’ as defined by Critch and Krueger describes (1) better than the other alternatives. At the same time, I’m not too keen on using terminology that isn’t reasonably widely used. My inclination is to use ‘Artificial General Intelligence’, though I’m still not particularly satisfied with it: my main worry, besides the issue Rob notes, is that the name prejudges that to be “prepotent”, “transformative” or otherwise have the effects we worry about AI needs to be “general”. It would seem preferable to have a formula where “AI” is preceded by an adjective that characterizes AI by its transformative potential than by some internal characteristic.
      Another possibility is to use something like ‘Existential risk from AI’, though this would exclude non-existential AI risks. We could also have a separate article on ‘Catastrophic risks from AI’, but this would create a somewhat artificial bifurcation of content, since many catastrophic risks from AI are also existential. I think we basically want to have a single article where “serious enough” AI risks are discussed, and there appears to be no fully satisfactory name for this article.
      For 1a, 1b and 1c, I would use ‘AI alignment’, ‘AI forecasting’ and ‘AI strategy’, respectively, broadly adopting Rob’s suggestions (though he proposes consolidating 1b and 1c as ‘AI strategy and forecasting’).
      - RobBensinger Apr 23, 2021, 4:07 PM
        4 points
        0 ∶ 0
        Parent
        Concerning terminology, I agree that ‘prepotent AI’ as defined by Critch and Krueger describes (1) better than the other alternatives. At the same time, I’m not too keen on using terminology that isn’t reasonably widely used.
        Makes sense. Though if it’s good enough terminology, we should keep in mind that switching to the better term is a coordination problem. Someone’s got to get the ball rolling on using it, so it can be recognizable and well-established later. (But maybe the term isn’t quite that good, or this isn’t the best place to get that ball rolling.)
        My inclination is to use ‘Artificial General Intelligence’, though I’m still not particularly satisfied with it
        I like ‘AGI’ as an option. It’s recognizable and pretty widely used. If it doesn’t exactly map on to the things we really care about, or there’s some disagreement about how useful/natural it is as a concept, that can be discussed on the page itself.
        Another possibility is to use something like ‘Existential risk from AI’, though this would exclude non-existential AI risks.
        My main objection to this isn’t that it excludes catastrophic risks—it’s that EA’s focus should be on maximizing the net goodness of AI’s effects, which includes minimizing risks but also maximizing benefits. But I wasn’t able to think of a good, short term for this concept. (‘Astronomical impacts of AI’ isn’t great, and ‘Long-termist consequences of AI’ misleadingly suggests that AI’s major effects definitely won’t happen for a long time, or that caring about AGI otherwise requires long-termism.)
        Pablo Apr 23, 2021, 5:11 PM
        4 points
        0 ∶ 0
        Parent
        
        Someone’s got to get the ball rolling on using it, so it can be recognizable and well-established later.
        
        I agree, though I don’t think the EA Wiki should play that role. (It could be that this is one respect in which the EA and LW Wikis should take different approaches.)
        
        My main objection to this isn’t that it excludes catastrophic risks—it’s that EA’s focus should be on maximizing the net goodness of AI’s effects
        
        Ah, yes: I overlooked that in my previous comment, but I agree it’s a key consideration.
        RobBensinger Apr 23, 2021, 6:21 PM
        2 points
        0 ∶ 0
        Parent
        As a side-note, LessWrong’s current hierarchy is:
        Artificial Intelligence
        Basic Alignment Theory
        Subcategories include: AIXI, Fixed Point Theorems, Goodhart’s Law, Inner Alignment, Logical Uncertainty...
        Engineering Alignment
        Subcategories include: Debate, Inverse Reinforcement Learning, Mild Optimization, Transparency / Interpretability, Value Learning...
        Strategy
        Subcategories include: AI Governance, AI Takeoff, AI Timelines...
        Organizations
        Subcategories include: MIRI, Ought...
        Other
        Subcategories include: Alpha-, GPT, Research Agendas
        Ben Pace Apr 23, 2021, 9:08 PM
        4 points
        0 ∶ 0
        Parent
        Pretty sure I picked those. I don’t know that the first two categories are as great a split as I once did. I was broadly trying to describe the difference between the sorts of basic theory work done by people like Alex Flint and Scott Garrabrant, and the sorts of ‘just solve the problem’ ideas by people like Paul Christiano and Alex Turner and Stuart Russell. But it’s not super clean, they dip into each other all the time e.g. Inner Alignment is a concept used throughout by people like Paul Christiano and Eliezer Yudkowsky in all sorts of research.
        I worked with the belief that a very simple taxonomy even if wrong is far better than no taxonomy, so I still feel good about it. But am interested in an alternative.
        RobBensinger Apr 23, 2021, 6:24 PM
        3 points
        0 ∶ 0
        Parent
        This reminds me that the EA Forum may want to have conversations about things like GPT-3. Having a very basic top-level category name like ‘AI’ can ensure those have a home.
        I would naturally put those in the ‘AGI-relevant’ camp (they’re certainly a big part of the current conversation about AGI), but a tag like ‘AGI’ wouldn’t work well here, and ‘AI Strategy and Forecasting’ might sometimes be not-quite-right too. Hmm.
    - RyanCarey Apr 21, 2021, 2:18 AM
      5 points
      0 ∶ 0
      Parent
      This largely seems reasonable to me. However, I’ll just push back on the idea of treating near/long-term as the primary split:
      I don’t see people on this forum writing a lot about near-term AI issues, so does it even need a category?
      It’s arguable whether near-term/long-term is a more fundamental division than technical/strategic. For example, people sometimes use the phrase “near-term AI alignment”, and some research applies to both near-term and long-term issues.
      One attractive alternative might be just to use the categories AI alignment and AI strategy and forecasting.
      - RobBensinger Apr 21, 2021, 1:34 PM
        4 points
        0 ∶ 0
        Parent
        Seems fine to start with the simpler system, as you propose, and add wrinkles only if problems actually arise in practice.

AI safety

Reading on why AI might be an existential risk

Arguments against AI safety

Further reading on arguments against AI Safety

AI safety as a career

Further reading

Related entries