For example, TruthfulQA is a quantitative benchmark for measuring the truthfulness of a language model. Achieving strong performance on this benchmark would not alone solve the alignment problem (or anything close to that), but it could potentially offer meaningful progress towards the valuable goal of more truthful AI.
This could be a reasonable benchmark for which to build a small prize, as well as a good example of the kinds of concrete goals that are most easily incentivized.
I like the TruthfulQA idea/paper a lot, but I think incentivizing people to optimize against it probably wouldn’t be very robust, and non-alignment-relevant ideas could wind up making a big difference.
Just one of several issues: The authors selected questions adversarially against GPT-3—i.e., they oversampled the exact questions GPT-3 got wrong—so, simply replacing GPT-3 with something equally misaligned but different, like Gopher, should yield significantly better performance. That’s really not something you want to see in an alignment benchmark.
Yeah that’s a good point. Another hack would be training a model on text that specifically includes the answers to all of the TruthfulQA questions.
The real goal is to build new methods and techniques that reliably improve truthfulness over a range of possible measurements. TruthfulQA is only one such measurement, and performing well on it does not guarantee a signficant contribution to alignment capabilities.
I’m really not sure what the unhackable goal looks like here.
For example, TruthfulQA is a quantitative benchmark for measuring the truthfulness of a language model. Achieving strong performance on this benchmark would not alone solve the alignment problem (or anything close to that), but it could potentially offer meaningful progress towards the valuable goal of more truthful AI.
This could be a reasonable benchmark for which to build a small prize, as well as a good example of the kinds of concrete goals that are most easily incentivized.
Here’s the paper: https://arxiv.org/pdf/2109.07958.pdf
I like the TruthfulQA idea/paper a lot, but I think incentivizing people to optimize against it probably wouldn’t be very robust, and non-alignment-relevant ideas could wind up making a big difference.
Just one of several issues: The authors selected questions adversarially against GPT-3—i.e., they oversampled the exact questions GPT-3 got wrong—so, simply replacing GPT-3 with something equally misaligned but different, like Gopher, should yield significantly better performance. That’s really not something you want to see in an alignment benchmark.
Yeah that’s a good point. Another hack would be training a model on text that specifically includes the answers to all of the TruthfulQA questions.
The real goal is to build new methods and techniques that reliably improve truthfulness over a range of possible measurements. TruthfulQA is only one such measurement, and performing well on it does not guarantee a signficant contribution to alignment capabilities.
I’m really not sure what the unhackable goal looks like here.
My colleagues have often been way too nice about reading group papers, rather than the opposite. (I’ll bet this varies a ton lab-to-lab.)