Yeah that’s a good point. Another hack would be training a model on text that specifically includes the answers to all of the TruthfulQA questions.
The real goal is to build new methods and techniques that reliably improve truthfulness over a range of possible measurements. TruthfulQA is only one such measurement, and performing well on it does not guarantee a signficant contribution to alignment capabilities.
I’m really not sure what the unhackable goal looks like here.
Yeah that’s a good point. Another hack would be training a model on text that specifically includes the answers to all of the TruthfulQA questions.
The real goal is to build new methods and techniques that reliably improve truthfulness over a range of possible measurements. TruthfulQA is only one such measurement, and performing well on it does not guarantee a signficant contribution to alignment capabilities.
I’m really not sure what the unhackable goal looks like here.
My colleagues have often been way too nice about reading group papers, rather than the opposite. (I’ll bet this varies a ton lab-to-lab.)