Sure! (I just realized the point about the MNIST dataset problems wasn’t fully explained in my shared memo, but I’ve fixed that now)
Per the assessment section, some of the problems with assuming that FRVT demonstrates NIST’s capabilities for evaluation of LLMs/etc. include:
Facial recognition is a relatively “objective” test—i.e., the answers can be linked to some form of “definitive” answer or correctness metric (e.g., name/identity labels). In contrast, many of the potential metrics of interest with language models (e.g., persuasiveness, knowledge about dangerous capabilities) may not have a “definitive” evaluation method, where following X procedure reliably evaluates a response (and does so in a way that onlookers would look silly to dispute).
The government arguably had some comparative advantage in specific types of facial image data, due to collecting millions of these images with labels. The government doesn’t have a comparative advantage in, e.g., text data.
The government has not at all kept pace with private/academic benchmarks for most other ML capabilities, such as non-face image recognition (e.g., Common Objects in Context) and LLMs (e.g., SuperGLUE).
It’s honestly not even clear to me whether FRVT’s technical quality truly is the “gold standard” in comparison to the other public training/test datasets for facial recognition (e.g., MegaFace); it seems plausible that the value of FRVT is largely just that people can’t easily cheat on it (unlike datasets where the test set is publicly available) because of how the government administers it.
For the MNIST case, I now have the following in my memo:
Even NIST’s efforts with handwriting recognition were of debatable quality: Yann LeCun’s widely-used MNIST is a modification of NIST’s datasets, in part because NIST’s approach used census bureau employees’ handwriting for the training set and high school students’ handwriting for the test set.[1]
Some may argue this assumption was justified at the time because it required that models could “generalize” beyond the training set. However, popular usage appears to have favored MNIST’s approach. Additionally, it is externally unclear that one could effectively generalize from the handwriting of a narrow and potentially unrepresentative segment of society—professional bureaucrats—to high schoolers’, and the assumption that this would be necessary (e.g., due to the inability to get more representative data) seems unrealistic.
I would be quite interested to hear more about what you’re saying re MNIST and the facial recognition vendor test
Sure! (I just realized the point about the MNIST dataset problems wasn’t fully explained in my shared memo, but I’ve fixed that now)
Per the assessment section, some of the problems with assuming that FRVT demonstrates NIST’s capabilities for evaluation of LLMs/etc. include:
Facial recognition is a relatively “objective” test—i.e., the answers can be linked to some form of “definitive” answer or correctness metric (e.g., name/identity labels). In contrast, many of the potential metrics of interest with language models (e.g., persuasiveness, knowledge about dangerous capabilities) may not have a “definitive” evaluation method, where following X procedure reliably evaluates a response (and does so in a way that onlookers would look silly to dispute).
The government arguably had some comparative advantage in specific types of facial image data, due to collecting millions of these images with labels. The government doesn’t have a comparative advantage in, e.g., text data.
The government has not at all kept pace with private/academic benchmarks for most other ML capabilities, such as non-face image recognition (e.g., Common Objects in Context) and LLMs (e.g., SuperGLUE).
It’s honestly not even clear to me whether FRVT’s technical quality truly is the “gold standard” in comparison to the other public training/test datasets for facial recognition (e.g., MegaFace); it seems plausible that the value of FRVT is largely just that people can’t easily cheat on it (unlike datasets where the test set is publicly available) because of how the government administers it.
For the MNIST case, I now have the following in my memo:
Some may argue this assumption was justified at the time because it required that models could “generalize” beyond the training set. However, popular usage appears to have favored MNIST’s approach. Additionally, it is externally unclear that one could effectively generalize from the handwriting of a narrow and potentially unrepresentative segment of society—professional bureaucrats—to high schoolers’, and the assumption that this would be necessary (e.g., due to the inability to get more representative data) seems unrealistic.