Figuring out how to stop AI systems from making extremely bad judgments on images designed to fool them, and other work focused on helping avoid the “worst case” behaviors of AI systems.
I’m not sure whether you’re asking for academic literature on adversarial examples (I believe there is a lot) or for links discussing the link between adversarial examples and alignment (most topics about the “link between X and alignment” haven’t been written about a ton). The latter topic is discussed some in the recent paper Unsolved Problems in ML Safety and in An overview of 11 proposals for building safe advanced AI.
I haven’t seen much about adversarial examples for AI alignment. Besides https://www.alignmentforum.org/tag/adversarial-examples (which only has four articles tagged), https://www.alignmentforum.org/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment, and https://cset.georgetown.edu/publication/key-concepts-in-ai-safety-robustness-and-adversarial-examples/ are there other good articles about this topic?
I’m not sure whether you’re asking for academic literature on adversarial examples (I believe there is a lot) or for links discussing the link between adversarial examples and alignment (most topics about the “link between X and alignment” haven’t been written about a ton). The latter topic is discussed some in the recent paper Unsolved Problems in ML Safety and in An overview of 11 proposals for building safe advanced AI.