The main point I make is that NIST may not be well suited to creating measurements for complex, multi-dimensional characteristics of language models—and that some people may be overestimating the capabilities of NIST because they don’t recognize how incomparable the Facial Recognition Vendor Test is to this situation of subjective metrics for GenAI and they don’t realize NIST arguably even botched MNIST (which was actually produced by Yann LeCun by recompiling NIST’s datasets). Moreover, government is slow, while AI is fast. Instead, I argue we should consider an alternative model such as federal funding for private/academic benchmark development (e.g., prize competitions).
I wasn’t sure if this warranted a full post, especially since it feels a bit late; LMK if you think otherwise!
Sure! (I just realized the point about the MNIST dataset problems wasn’t fully explained in my shared memo, but I’ve fixed that now)
Per the assessment section, some of the problems with assuming that FRVT demonstrates NIST’s capabilities for evaluation of LLMs/etc. include:
Facial recognition is a relatively “objective” test—i.e., the answers can be linked to some form of “definitive” answer or correctness metric (e.g., name/identity labels). In contrast, many of the potential metrics of interest with language models (e.g., persuasiveness, knowledge about dangerous capabilities) may not have a “definitive” evaluation method, where following X procedure reliably evaluates a response (and does so in a way that onlookers would look silly to dispute).
The government arguably had some comparative advantage in specific types of facial image data, due to collecting millions of these images with labels. The government doesn’t have a comparative advantage in, e.g., text data.
The government has not at all kept pace with private/academic benchmarks for most other ML capabilities, such as non-face image recognition (e.g., Common Objects in Context) and LLMs (e.g., SuperGLUE).
It’s honestly not even clear to me whether FRVT’s technical quality truly is the “gold standard” in comparison to the other public training/test datasets for facial recognition (e.g., MegaFace); it seems plausible that the value of FRVT is largely just that people can’t easily cheat on it (unlike datasets where the test set is publicly available) because of how the government administers it.
For the MNIST case, I now have the following in my memo:
Even NIST’s efforts with handwriting recognition were of debatable quality: Yann LeCun’s widely-used MNIST is a modification of NIST’s datasets, in part because NIST’s approach used census bureau employees’ handwriting for the training set and high school students’ handwriting for the test set.[1]
Some may argue this assumption was justified at the time because it required that models could “generalize” beyond the training set. However, popular usage appears to have favored MNIST’s approach. Additionally, it is externally unclear that one could effectively generalize from the handwriting of a narrow and potentially unrepresentative segment of society—professional bureaucrats—to high schoolers’, and the assumption that this would be necessary (e.g., due to the inability to get more representative data) seems unrealistic.
There are some major differences with the type of standards that NIST usually produces. Perhaps the most obvious is that a good AI model can teach itself to pass any standardised test. A typical standard is very precisely defined in order to be reproducible by different testers. But if you make such a clear standard test for an LLM, it would, say, be a series of standard prompts or tasks, which would be the same no matter who typed them in. But in such a case, the model just trains itself on how to answer these prompts, or follows the Volkswagen model of learning how to recognize that it’s being evaluated, and to behave accordingly, which won’t be hard if the testing questions are standard.
So the test tells you literally nothing useful about the model.
I don’t think NIST (or anyone outside the AI community) has experience with the kind of evals that are needed for models, which will need to be designed specifically to be unlearnable. The standards will have to include things like red-teaming in which the model cannot know what specific tests it will be subjected to. But it’s very difficult to write a precise description of such an evaluation which could be applied consistently.
In my view this is a major challenge for model evaluation. As a chemical engineer, I know exactly what it means to say that a machine has passed a particular standard test. And if I’m designing the equipment, I know exactly what standards it has to meet. It’s not at all obvious how this would work for an LLM.
Seeing the drama with the NIST AI Safety Institute and Paul Christiano’s appointment and this article about the difficulty of rigorously/objectively measuring characteristics of generative AI, I figured I’d post my class memo from last October/November.
The main point I make is that NIST may not be well suited to creating measurements for complex, multi-dimensional characteristics of language models—and that some people may be overestimating the capabilities of NIST because they don’t recognize how incomparable the Facial Recognition Vendor Test is to this situation of subjective metrics for GenAI and they don’t realize NIST arguably even botched MNIST (which was actually produced by Yann LeCun by recompiling NIST’s datasets). Moreover, government is slow, while AI is fast. Instead, I argue we should consider an alternative model such as federal funding for private/academic benchmark development (e.g., prize competitions).
I wasn’t sure if this warranted a full post, especially since it feels a bit late; LMK if you think otherwise!
I would be quite interested to hear more about what you’re saying re MNIST and the facial recognition vendor test
Sure! (I just realized the point about the MNIST dataset problems wasn’t fully explained in my shared memo, but I’ve fixed that now)
Per the assessment section, some of the problems with assuming that FRVT demonstrates NIST’s capabilities for evaluation of LLMs/etc. include:
Facial recognition is a relatively “objective” test—i.e., the answers can be linked to some form of “definitive” answer or correctness metric (e.g., name/identity labels). In contrast, many of the potential metrics of interest with language models (e.g., persuasiveness, knowledge about dangerous capabilities) may not have a “definitive” evaluation method, where following X procedure reliably evaluates a response (and does so in a way that onlookers would look silly to dispute).
The government arguably had some comparative advantage in specific types of facial image data, due to collecting millions of these images with labels. The government doesn’t have a comparative advantage in, e.g., text data.
The government has not at all kept pace with private/academic benchmarks for most other ML capabilities, such as non-face image recognition (e.g., Common Objects in Context) and LLMs (e.g., SuperGLUE).
It’s honestly not even clear to me whether FRVT’s technical quality truly is the “gold standard” in comparison to the other public training/test datasets for facial recognition (e.g., MegaFace); it seems plausible that the value of FRVT is largely just that people can’t easily cheat on it (unlike datasets where the test set is publicly available) because of how the government administers it.
For the MNIST case, I now have the following in my memo:
Some may argue this assumption was justified at the time because it required that models could “generalize” beyond the training set. However, popular usage appears to have favored MNIST’s approach. Additionally, it is externally unclear that one could effectively generalize from the handwriting of a narrow and potentially unrepresentative segment of society—professional bureaucrats—to high schoolers’, and the assumption that this would be necessary (e.g., due to the inability to get more representative data) seems unrealistic.
There are some major differences with the type of standards that NIST usually produces. Perhaps the most obvious is that a good AI model can teach itself to pass any standardised test. A typical standard is very precisely defined in order to be reproducible by different testers. But if you make such a clear standard test for an LLM, it would, say, be a series of standard prompts or tasks, which would be the same no matter who typed them in. But in such a case, the model just trains itself on how to answer these prompts, or follows the Volkswagen model of learning how to recognize that it’s being evaluated, and to behave accordingly, which won’t be hard if the testing questions are standard.
So the test tells you literally nothing useful about the model.
I don’t think NIST (or anyone outside the AI community) has experience with the kind of evals that are needed for models, which will need to be designed specifically to be unlearnable. The standards will have to include things like red-teaming in which the model cannot know what specific tests it will be subjected to. But it’s very difficult to write a precise description of such an evaluation which could be applied consistently.
In my view this is a major challenge for model evaluation. As a chemical engineer, I know exactly what it means to say that a machine has passed a particular standard test. And if I’m designing the equipment, I know exactly what standards it has to meet. It’s not at all obvious how this would work for an LLM.