Rethink Priorities is working on a project called ‘Defense in Depth Against Catastrophic AI Failures’. “Defense in depth” refers to the use of multiple redundant layers of safety and/or security measures such that each layer reduces the chance of catastrophe. Our project is intended to (1) make the case for taking a defense in depth approach to ensuring safety when deploying near-term, high-stakes AI systems and (2) identify many defense layers/measures that may be useful for this purpose.
If you can think of any possible layers, please mention them below. We’re hoping to collect a very long list of such layers, either for inclusion in our main output or for potentially investigating further in future, so please err on the side of commenting even if the ideas are quite speculative, may not actually be useful, or may be things we’ve already thought of. Any relevant writing you can refer us to would also be useful.
If we end up including layers you suggest in our outputs, we’d be happy to either leave you anonymous or credit you, depending on your preference.
Some further info about the project: By “catastrophic AI failure”, we mean harmful accidents or harmful unintended use of computer systems that perform tasks typically associated with intelligent behavior (and especially of machine learning systems) that lead to at least 100 fatalities or $1 billion in economic loss. This could include failures in contexts like power grid management, autonomous weapons, or cyber offense (if you’re interested in more concrete examples, see here )
Defense layers can relate to any phase of a technology’s development and deployment, from early development to monitoring deployment to learning from failures, and can be about personnel, procedures, institutional set up, technical standards, etc.
Some examples of defense layers for AI include (find more here):
Procedures for vetting and deciding on institutional partners, investors, etc.
Methods for scaling human supervision and feedback during and after training high-stakes ML systems
Tools for blocking unauthorized use of developed/trained IP, akin to the PALs on nuclear weapons
Technical methods and process methods (e.g. certification; Cihon et al. 2021, benchmarks?) for gaining high confidence in certain properties of ML systems, and properties of the inputs to ML systems (e.g. datasets), at all stages of development (a la Ashmore et al. 2019)
Background checks & similar for people being hired or promoted to certain types of roles
Methods for avoiding or detecting supply chain attacks
Procedures for deciding when and how to engage one’s host government to help with security/etc.
Four layers come to mind for me:
Have strong theoretical reasons to think your method of creating the system cannot result in something motivated to take dangerous actions
Inspect the system thoroughly after creation, before deployment, to make sure it looks as expected and appears incapable of making dangerous decisions
Deploy the system in an environment where it is physically incapable of doing anything dangerous
Monitor the internals of the system closely during deployment to ensure operation is as expected, and that no dangerous actions are attempted
In response to an earlier version of this question (since taken down) weeatquince responded with the following helpful comment:
Regulatory type interventions (pre-deployment):
Regulatory restriction (rules on what can be done)
Regulatory oversight (regulators)
Industry self-regulation
Industry (& regulator) peer reviews systems
Fiduciary duties
Senior management regimes
Information sharing regimes
Whistleblowing regimes
Staff security clearances
Cybersecurity of AI companies
Standardisation (to support ease of oversight etc)
Clarity about liability & legal responsibility
Internal government oversight (all of the above applied internally by government to itself, e.g. internal military safety best practice)
Technical type interventions (pre-deployment):
AI safety research
Defence in depth type interventions (post-deployment):
Windfall clauses etc
Shut-off switches for AI systems
AIs policing other AIs’ behaviours
Internet / technology shut-off systems
The boring answers
Don’t give your AI system excess compute. Like ideally on a hardware level. Run it on a small isolated machine not a 0.1% timeshare on a supercomputer.
Use the coding practices developed by Nasa to minimize standard bugs.
Record all random seeds and input data to make everything reproducible.
Put in hard coded sanity checks between AI and output. A robot arm isn’t allowed to move beyond safe limits by a simple max(AI_OUTPUT, MAXIMUM_ARM_ANGLE) type code.
Humans checking in the loop.
Hardware minimization of unneeded action space. Suppose you are programming a social media bot. The AI is quite smart. If you have 1 computer running the AI that has a network connection, and it rowhammers, it can send arbitrary internet packages. Computer A runs the AI and sends raw ascii down a wire to computer B. Computer B checks every simple check that you would expect to be true in normal operation, and performs a hard shutdown if it spots anything odd. (Ie funny characters, post too long or short. ) Computer A of course has no other connections, has a power supply resistant to modulated energy demand attacks, has a radio and audio noise generator next to it and is in a Farady cage.
2 or 3 redundant computers (cosmic ray errors)
Thermite on a timer. Slag the whole computer system. Stops the AI thinking too long. Makes it harder to hide info for future versions of itself.
The speculative answers.
Holomorphic encryption.
Run it on a quantum computer using amplitude amplification techniques. Any breaking out of the box technique must be reversible to work, or use new physics.
Honeypots that look like good hacking targets.
A fairly easy way for any moderately smart AI to wirehead, setting its own utility to maximum. (and triggering some alarm/ delayed shutdown.)
The silly answers
Chuck the AI system into a black hole, set to only start computing after they cross the event horizon.