AI alignment prize suggestion: Introduce AI Safety concepts into the ML community
Artificial Intelligence
Recently, there have been several papers published at top ML conferences that introduced concepts from the AI safety community into the broader ML community. Such papers often define a problem, explain why it matters, sometimes formalise it, often include extensive experiments to showcase the problem, sometimes include some initial suggestions for remedies. Such papers are useful in several ways: they popularise AI alignment concepts, pave the way for further research, and demonstrate that researchers can do alignment research while also publishing in top venues. A great example would be Optimal Policies Tend To Seek Power, published in NeurIPS. Future Fund could advertise prizes for any paper that gets published in a top ML/NLP/Computer Vision conference (from ML, that would be NeurIPS, ICML, and ICLR) and introduces a key concept of AI alignment.
The course presents possible solutions to these risks, and the students feel like they “understood” AI risk, and in the future it will be harder to these students about AI risk since they feel like they already have an understanding, even though it is wrong.
I am specifically worried about this because I try imagining who would write the course and who would teach it. Will these people be able to point out the problems in the current approaches to alignment? Will these people be able to “hold an argument” in class well enough to point out holes in the solutions that the students will suggest after thinking about the problem for five minutes?
AI alignment prize suggestion: Introduce AI Safety concepts into the ML community
Artificial Intelligence
Recently, there have been several papers published at top ML conferences that introduced concepts from the AI safety community into the broader ML community. Such papers often define a problem, explain why it matters, sometimes formalise it, often include extensive experiments to showcase the problem, sometimes include some initial suggestions for remedies. Such papers are useful in several ways: they popularise AI alignment concepts, pave the way for further research, and demonstrate that researchers can do alignment research while also publishing in top venues. A great example would be Optimal Policies Tend To Seek Power, published in NeurIPS. Future Fund could advertise prizes for any paper that gets published in a top ML/NLP/Computer Vision conference (from ML, that would be NeurIPS, ICML, and ICLR) and introduces a key concept of AI alignment.
Risk:
The course presents possible solutions to these risks, and the students feel like they “understood” AI risk, and in the future it will be harder to these students about AI risk since they feel like they already have an understanding, even though it is wrong.
I am specifically worried about this because I try imagining who would write the course and who would teach it. Will these people be able to point out the problems in the current approaches to alignment? Will these people be able to “hold an argument” in class well enough to point out holes in the solutions that the students will suggest after thinking about the problem for five minutes?
I’m not saying this isn’t solvable, just a risk.