mic comments on How to make the best of the most important century?

mic 27 Sep 2021 0:02 UTC
1 point
0 ∶ 0

Figuring out how to stop AI systems from making extremely bad judgments on images designed to fool them, and other work focused on helping avoid the “worst case” behaviors of AI systems.

I haven’t seen much about adversarial examples for AI alignment. Besides https://www.alignmentforum.org/tag/adversarial-examples (which only has four articles tagged), https://www.alignmentforum.org/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment, and https://cset.georgetown.edu/publication/key-concepts-in-ai-safety-robustness-and-adversarial-examples/ are there other good articles about this topic?
- Holden Karnofsky 6 Oct 2021 6:33 UTC
  3 points
  0 ∶ 0
  Parent
  I’m not sure whether you’re asking for academic literature on adversarial examples (I believe there is a lot) or for links discussing the link between adversarial examples and alignment (most topics about the “link between X and alignment” haven’t been written about a ton). The latter topic is discussed some in the recent paper Unsolved Problems in ML Safety and in An overview of 11 proposals for building safe advanced AI.