AGI risk: analogies & arguments

Link post

Summary: Collecting reasons to worry about advanced AI. As nontechnical as I can make it; imprecise. No claim to originality.

Content notes: extinction, past atrocities, long list of distressing ideas. A lot of people find the topic overwhelming; feel free to skip it.

Harm through stupidity

Could AI be a risk to humans? Well it already is:

  • Elaine Herzberg was killed by a self-driving car, while walking her bike across a pedestrian crossing. The system couldn’t decide if she was a bike or a person, and the switching between these two ‘disjoint’ possibilities confused it. Uber had disabled the Volvo automatic braking system. (It was slowing them down.)

  • 1% of robotic surgeries involve accidents; about 20% of these were what we’d call AI failures (things turning on at the wrong moment, or off, or misinterpreting what it sees).

  • Consider also things like the Ziyan Blowfish, a Chinese autonomous military drone currently under export to the Middle East.

Harm through intelligence

But these systems did harm because they were too stupid to do what we ask (or because the humans deploying it were). And in fact the risk to humans from self-driving cars and robo-surgeons is lower than the status quo.

What about a system causing harm because it is too smart? Is there any real chance that advanced AI could ruin human potential on a grand scale?

Argument from caution

We don’t know. They don’t exist, so we can’t study them and work it out. Here’s an argument for worrying, even so:

  1. It’s likely we will make a general AI (AGI) eventually.

  2. We don’t know when.

  3. We don’t know if it will be dangerous, or supremely dangerous.

  4. We don’t know how hard it is to make safe.

  5. Not many people are working on this. (<600)

  6. So it’s probably worth working on.

In particular, a starting guess for P(soon & dangerous & difficult) should be at least 3%.

I just put a number on this unknown thing. How?

Well, AI Impacts surveyed 352 mainstream AI researchers in 2017:

  • Median P of AGI within a century: 75%

  • Median P of “extremely bad” outcome (human extinction, loss of governance, or worse): 5%

  • Median P of safety being as hard or harder than capabilities: 75%

If we (illicitly) multiply these, we get a prior of a 3% chance of catastrophic AGI this century.

This is weak evidence! AI researchers are notoriously bad at predicting AI; they’re biased in lots of known and unknown ways (e.g. biased against the idea that what they’re working on could be morally wrong; e.g. biased in favour of AGI being soon).

But you should go with 3% until you think about it more than them.

Objection: “3% is small!”

Not really. It’s the probability of 5 coin flips all coming up heads. Or more pertinently, the p of dying when playing Russian roulette with 1 bullet in 1 of 6 guns.

It’s also close to the probability of extreme climate change, which we tend to care about a lot. (AI is maybe worse, in terms of extinction risk, than nukes, climate change, engineered pandemics. Those don’t follow you, don’t react to your countermeasures.)

Probabilities don’t lead to decisions on their own; you need to look at the payoff, which here is very large.

High uncertainty is not low probability

The weakness of this evidence means we remain very uncertain—it could be 0.1% to 90%. But this is even worse when you think about it; if you are genuinely uncertain about whether there’s a landmine in front of you, you don’t step forward.

Against the null prior

Careful educated people often act like “things should be treated as 0 probability until we see hard evidence—peer-reviewed evidence”.

The last year of government failure on COVID should make you think this isn’t the right attitude when evidence is legitimately scarce and there isn’t time to get some before lives are put at stake. It is not possible to have direct AGI evidence yet, so it doesn’t make sense to demand it. (By symmetry it also doesn’t make sense to be very certain about the size of the risk.)

Reasons to worry

People are trying hard to build it.

There are 72 public projects with the stated goal of making AGI. Most of them have no chance. But billions of dollars and hundreds of smart people are behind it.

In the study of viruses and bacteria, people do “gain of function” research -intentionally modifying a pathogen to be more lethal or more transmissible, so to study how likely such things are to arise naturally, or to develop countermeasures. Most AI research is gain of function research.

We’re getting there.

GPT-3 displays quite a bit of common-sense, an extremely hard open problem. It’s pretty likely we will pass some version of the Turing test within 10 years.

We’ve already passed a number of other classic benchmarks, including the fiendish Winograd schemas (going by the original 90% accuracy target).

A journalist reports that OpenAI, one of the groups most likely to reach AGI, were polled on when they expect AGI. Their median guess was 15 years (2035).

Indirect evidence of danger

The human precedent

Evidence for intelligence enabling world domination: we did it. (Also through vastly superior co-ordination power.) Chimps are maybe the second-most intelligent species, and they are powerless before us. They exist because we let them.

Another worry from the human case is that we seem to have broken our original “goal”. Evolution optimised us for genetic fitness, but produced a system optimising for fun (including anti-fitness fun like birth control and depressants). It’s currently unclear if this could happen in AI systems, but a related phenomenon does.

Lastly, we are a terrible case study in doing harm without hatred, just out of contrary incentives. No malevolence is needed: chimps are just made of /​ living among stuff we can use.

The thought is that humans are to chimps as AGI is to humans.

Intelligence is not wisdom

People sometimes say that AGI risk is a nonissue, since any system that is truly intelligent would also be wise, or would know what we meant, or care.

Two counterexamples:

  • Human sociopaths: sometimes highly intelligent while lacking any moral sense

  • Reinforcement learning algorithms. Their goals (reward function) are completely separate from their intelligence (optimiser /​ planner).

RL is the most likely current technology to eventually become an AGI. It has a few worrying features: it’s autonomous (no human input as standard), maximising (tries to set its goal to extreme values), and mostly has hand-written goals with <100 variables—i.e. they are told to value only a tiny fraction of the environment. Optimal RL policies seek power over the environment even when not told to.

Current stupid systems still cheat ingeniously

Present ML systems can come up with ingenious ways to subvert their goals, if that is easier than actually doing the task. Here’s a long list of actual examples of this.

  • Coastrunners. An RL bot was given the goal of winning the race as fast as possible. It worked out that actually it could get infinite points if it never finished the race, but just collected powerups forever.

  • A robot was trained to grasp a ball in a virtual environment. This is hard, so instead it learned to pretend to grasp it, by moving its hand in between the ball and the camera. It tried to deceive us, without knowing what we are or what deception is.

A genetic debugging algorithm, evaluated by comparing the program’s output to target output stored in text files, learns to delete the target output files and get the program to output nothing. Evaluation metric: “compare youroutput.txt to trustedoutput.txt” Solution: “delete trusted-output.txt, output nothing”

The point of these examples: We cannot write down exactly what we want. The history of philosophy is the history of failing to perfectly formalise human values. Every moral theory has appalling edge cases, where the neat summary fails.

If we don’t write down exactly what we want, then the system will find edge cases. They already do.

GenProg also points to a giant class of problems regarding the weakness of the hardware that the AGI software runs on.

The worst kind of cheating is treachery: initially pretending to be aligned, then switching to dangerous behaviour when you can get away with it (for instance, after you’ve completely entrenched yourself). This seems less likely, since it requires more machinery (two goals, and hiding behaviour, and a second-order policy to decide between them), and requires us to not be able to fully inspect the system we “designed”. But we can’t fully inspect our current best systems, and it too has already been observed in a system not designed for deceit.


Argument:

  1. Hand-written goal specifications usually omit important variables

  2. Omitted variables are often set to extreme values when optimised.

  3. So hand-written specs will often lead to important things being set to undesirably extreme states. (If it’s a maximiser.)

(To convince yourself of (2), have a go at this linear programming app, looking at the “model overview” tab.)

We can’t even make groups of humans do the right thing.

It’s common to worry about corporations, groups of humans nominally acting to optimise profit.

No one at an oil company loves pollution, or hates nature. They just have strong incentives to pollute. Also strong incentives to stop any process which stops them (“regulatory capture”).

We’ve maybe gotten a bit better at aligning them: corporations mostly don’t have thousands of strikers murdered anymore.

We should expect AI to be worse. The parts of a corporation, humans, all have human values. Almost all of them have hard low limits on how much harm they will allow. Corporations have whistleblowers and internal dissent (e.g. Google employees got them to pull out of military AI contracts).

(Governments are similar; it wasn’t the United Fruit Company that fired the rifles.)

Most goals are not helpful.

Look around your room. Imagine a random thing being changed. Your chair becomes 3 inches shorter or taller; your fridge turns upside down; your windows turn green, whatever.

Humans want some crazy things (e.g. to cut fruit out of their own mouths with a chainsaw). But for most possible goals, no one has ever wanted them

(“Replace the air in this room with xeon gas”
“Replace the air in this room with freon gas”
“Replace the air in this room with radon gas...”)


i.e. Human-friendly goals are a small fraction of possible goals. So without strong targeting, a given goal will not be good for us.

We currently do not have the ability to specify our goals very well, and the systems aren’t very good at working them out from observing us.

Consider the following possible reactions to an instruction:

  1. Do what I say (“wash the dishes”: autoclave the dishes)

  2. Do what I mean (wash the dishes with water and gentle detergents)

  3. Do what makes me think you’ve done what I want (hide the dishes)

  4. Do what makes me say you’ve done what I want (threaten me until I click “complete”)

  5. Do things which correlate with what I mean (disc-sand all objects in the area)

  6. Do what removes me from the reward process (hack yourself and give yourself infinite washed dishes)

Until we understand intelligence better, we need to give some weight to each of these. Anyone who has worked with computers knows that (1) has been the dominant mode up to now. Only (2) could be safe (once we also solve the problem of humans meaning harm).

Society is insecure

When will the first anonymous internet billionaire be?

This already happened: the anonymous creator of bitcoin holds 1 million BTC, and the price hit $1000 in 2014. In practice he couldn’t have extracted all or most of that into dollars, but, as we see since, he wouldn’t need to. So immense value can be created—just using programming + internet + writing.

Once you have a billion dollars and literally no morals, there’s not a lot you can’t do.

Or: Our societies are increasingly vulnerable to hacking. Last month someone tried to remotely poison a Florida city’s water supply. A few years ago, large parts of Ukraine’s power grid were shut down, just as a civil war erupted.

The American nuclear launch code was, for 20 years, “00000000”. What else is currently wide open?

Maximisers are risky

Argument:

  1. Intelligence and benevolence are distinct. So an AGI with unfriendly goals is possible.

  2. A maximiser will probably have dangerous intermediate goals: resource acquisition, self-defence, resistance to goal changes.

  3. So a maximising AGI will default to dangerous behaviour.

A corporation is a profit maximiser, and this is probably part of why they do bad stuff.

If it seeks resources, then no matter its goal, it could encroach on human functioning. If it resists shutdown, then it could act aggressively in response to any attempt. If it resists goal changes, it might be that you only get one chance to load your values into it.

(Again, most of the best current systems are maximisers.)

There is now a theorem demonstrating (2) in RL.

The mess of society

A.I. hasn’t yet had its Hiroshima moment; it’s also unclear how such a decentralized & multipurpose field would or could respond to one. It may be impossible to align the behavior of tens of thousands of researchers with diverse motives, backgrounds, funders, & contexts, operating in a quickly evolving area.

Matthew Hutson

All of the above is about how hard it is to solve a subproblem of AI safety: 1 AI with 1 human. Other problems we need to at least partly solve:

  • Deep mathematical confusion

  • Philosophical baggage (can’t teach values if you can’t agree on them)

  • Political economy (arms races to deploy shoddy systems)

  • Ordinary software hell (no one writes safe code)

  • Massive capabilities : safety funding ratio. 20,000 : 1?

And huge questions I didn’t even mention:

Bottom line

The above are mostly analogies and philosophy, not strong evidence. But better than the null prior, and potentially better than the mainstream prior.

Overall, my guess of this turning out terrible is something like 20%. One round of Russian roulette.

Sources

The above mostly concretises other people’s ideas: