Lizka comments on GovAI: Towards best practices in AGI safety and governance: A survey of expert opinion

Lizka 26 May 2023 11:03 UTC
4 points
0 ∶ 0
I’m surprised at how much agreement there is about the top ideas. The following ideas all got >70% “Strongly agree” and at most “3%” “strong disagree” (note that not everyone answered each question, although most of these 14 did have all 51 responses):
1. Pre-deployment risk assessment. AGI labs should take extensive measures to identify, analyze, and evaluate risks from powerful models before deploying them.
2. Dangerous capability evaluations. AGI labs should run evaluations to assess their models’ dangerous capabilities (e.g. misuse potential, ability to manipulate, and power-seeking behavior).
3. Third-party model audits. AGI labs should commission third-party model audits before deploying powerful models.
4. Safety restrictions. AGI labs should establish appropriate safety restrictions for powerful models after deployment (e.g. restrictions on who can use the model, how they can use the model, and whether the model can access the internet).
5. Red teaming. AGI labs should commission external red teams before deploying powerful models.
6. Monitor systems and their uses. AGI labs should closely monitor deployed systems, including how they are used and what impact they have on society.
7. Alignment techniques. AGI labs should implement state-of-the-art safety and alignment techniques.
8. Security incident response plan. AGI labs should have a plan for how they respond to security incidents (e.g. cyberattacks).
9. Post-deployment evaluations. AGI labs should continually evaluate models for dangerous capabilities after deployment, taking into account new information about the model’s capabilities and how it is being used.
10. Report safety incidents. AGI labs should report accidents and near misses to appropriate state actors and other AGI labs (e.g. via an AI incident database).
11. Safety vs capabilities. A significant fraction of employees of AGI labs should work on enhancing model safety and alignment rather than capabilities.
12. Internal review before publication. Before publishing research, AGI labs should conduct an internal review to assess potential harms.
13. Pre-training risk assessment. AGI labs should conduct a risk assessment before training powerful models.
14. Emergency response plan. AGI labs should have and practice implementing an emergency response plan. This might include switching off systems, overriding their outputs, or restricting access.
The ideas that had the most disagreement seem to be:
- 49 (48 in the graphic above) - Avoid capabilities jumps. AGI labs should not deploy models that are much more capable than any existing models.
  - 11% somewhat disagree, 5% strong disagree, and only 22% strong agree, 35% somewhat agree, 37 responses
- 48 (49 in the graphic above) - Inter-lab scrutiny. AGI labs should allow researchers from other labs to scrutinize powerful models before deployment.
  - 13% somewhat disagree, 3% strong disagree, 41% somewhat agree, 18% strong agree, 37 responses
- 37 - No [unsafe] open-sourcing. AGI labs should not open-source powerful models, unless they can demonstrate that it is sufficiently safe to do so.
  - 14% (somewhat and strong) disagree, 57% strong-agree, and 27% somewhat-agree, 51 responses
- 42 - Treat updates similarly to new models. AGI labs should treat significant updates to a deployed model (e.g. additional fine-tuning) similarly to its initial development and deployment. In particular, they should repeat the pre-deployment risk assessment.
  - 14% somewhat disagree, but 45% strong agree and 35% somewhat agree, 51 responses
- 50 - Notify other labs. AGI labs should notify other labs before deploying powerful models.
  - 11% somewhat disagree, 3% strong disagree, 11% strong agree, 32% somewhat disagree, 38 responses
And
- 21 - Security standards. AGI labs should comply with information security standards (e.g. ISO/IEC 27001 or NIST Cybersecurity Framework). These standards need to be tailored to an AGI context.
  - Got the most strong disagreement: 6% (51 responses), although it’s overall still popular (61% strong-agree, 18% somewhat agree)
(Ideas copied from here — thanks!)