I’m probably not the target audience for this post, but could you make it a bit more accessible by providing a definition of what a benchmark is? Unfortunately the EA Forum also lacks a definition and this link also only provides examples.
Good question. Benchmarks provide empirical, quantitative evaluation. They can be static datasets, e.g. ImageNet. They can also be models! For example, CLIP is a model capable of image captioning and is used to evaluate image generation models like DALLE2, specifically how aligned the generated images are to text inputs.
The bottom line is, benchmarks should provide a way for AI labs and researchers to compare with each other in a fair way, representing the research progress towards goals that the research community cares about.
I’m probably not the target audience for this post, but could you make it a bit more accessible by providing a definition of what a benchmark is? Unfortunately the EA Forum also lacks a definition and this link also only provides examples.
Good question. Benchmarks provide empirical, quantitative evaluation. They can be static datasets, e.g. ImageNet. They can also be models! For example, CLIP is a model capable of image captioning and is used to evaluate image generation models like DALLE2, specifically how aligned the generated images are to text inputs.
The bottom line is, benchmarks should provide a way for AI labs and researchers to compare with each other in a fair way, representing the research progress towards goals that the research community cares about.
Hope this helps!