Good question. Benchmarks provide empirical, quantitative evaluation. They can be static datasets, e.g. ImageNet. They can also be models! For example, CLIP is a model capable of image captioning and is used to evaluate image generation models like DALLE2, specifically how aligned the generated images are to text inputs.
The bottom line is, benchmarks should provide a way for AI labs and researchers to compare with each other in a fair way, representing the research progress towards goals that the research community cares about.
Good question. Benchmarks provide empirical, quantitative evaluation. They can be static datasets, e.g. ImageNet. They can also be models! For example, CLIP is a model capable of image captioning and is used to evaluate image generation models like DALLE2, specifically how aligned the generated images are to text inputs.
The bottom line is, benchmarks should provide a way for AI labs and researchers to compare with each other in a fair way, representing the research progress towards goals that the research community cares about.
Hope this helps!