AI alignment prize suggestion: Improve our ability to evaluate (and provide training signal for) fuzzy tasks
Artificial Intelligence
There are many tasks that we want AI systems to do, for which performance cannot be evaluated automatically (and thus training signal provision is hard). If we don’t make progress on our ability to train systems for such tasks, we might end up in a world full of systems that optimise for that which is easy to measure, rather than what we actually want. One example of such a task is the evaluation of free-form text; there is currently no automated method to evaluate free-form text (with respect to criteria such as usefulness or correctness) that matches human evaluation. The Future Fund could offer prizes for work that takes a task for which the gold-standard of evaluation is humans, and demonstrates an automated evaluation method that matches human evaluation very closely (or work that demonstrates an automated evaluation method to be superior to human evaluation).
Note: This is crucially not the same as “training models to perform well on the task in question”. There are a number of technical reasons why what I suggest is easier. Intuitively, evaluating performance is often considerably easier than generating good performance. For example, I can watch a movie and say if it’s good, but I can’t make a good movie.
AI alignment prize suggestion: Improve our ability to evaluate (and provide training signal for) fuzzy tasks
Artificial Intelligence
There are many tasks that we want AI systems to do, for which performance cannot be evaluated automatically (and thus training signal provision is hard). If we don’t make progress on our ability to train systems for such tasks, we might end up in a world full of systems that optimise for that which is easy to measure, rather than what we actually want. One example of such a task is the evaluation of free-form text; there is currently no automated method to evaluate free-form text (with respect to criteria such as usefulness or correctness) that matches human evaluation. The Future Fund could offer prizes for work that takes a task for which the gold-standard of evaluation is humans, and demonstrates an automated evaluation method that matches human evaluation very closely (or work that demonstrates an automated evaluation method to be superior to human evaluation).
Note: This is crucially not the same as “training models to perform well on the task in question”. There are a number of technical reasons why what I suggest is easier. Intuitively, evaluating performance is often considerably easier than generating good performance. For example, I can watch a movie and say if it’s good, but I can’t make a good movie.