“AIs doing Forecasting”[1] has become a major part of the EA/AI/Epistemics discussion recently.
I think a logical extension of this is to expand the focus from forecasting to evaluation.
Forecasting typically asks questions like, “What will the GDP of the US be in 2026?”
Evaluation tackles partially-speculative assessments, such as:
“How much economic benefit did project X create?”
“How useful is blog post X?”
I’d hope that “evaluation” could function as “forecasting with extra steps.” The forecasting discipline excels at finding the best epistemic procedures for uncovering truth[2]. We want to maintain these procedures while applying them to more speculative questions.
Evaluation brings several additional considerations:
We need to identify which evaluations to run from a vast space of useful and practical options.
Evaluations often disrupt the social order, requiring skillful management.
Determining the best ways to “resolve” evaluations presents greater challenges than resolving forecast questions.
I’ve been interested in this area for 5+ years but struggled to draw attention to it—partly because it seems abstract, and partly because much of the necessary technology wasn’t quite ready.
We’re now at an exciting point where creating LLM apps for both forecasting and evaluation is becoming incredibly affordable. This might be a good time to spotlight this area.
There’s a curious gap now where we can, in theory, envision a world with sophisticated AI evaluation infrastructure, yet discussion of this remains limited. Fortunately, researchers and enthusiasts can fill this gap, one sentence at a time.
[1] As opposed to [Forecasting About AI], which is also common here. [2] Or at least, do as good a job as we can.
“AIs doing Forecasting”[1] has become a major part of the EA/AI/Epistemics discussion recently.
I think a logical extension of this is to expand the focus from forecasting to evaluation.
Forecasting typically asks questions like, “What will the GDP of the US be in 2026?”
Evaluation tackles partially-speculative assessments, such as:
“How much economic benefit did project X create?”
“How useful is blog post X?”
I’d hope that “evaluation” could function as “forecasting with extra steps.” The forecasting discipline excels at finding the best epistemic procedures for uncovering truth[2]. We want to maintain these procedures while applying them to more speculative questions.
Evaluation brings several additional considerations:
We need to identify which evaluations to run from a vast space of useful and practical options.
Evaluations often disrupt the social order, requiring skillful management.
Determining the best ways to “resolve” evaluations presents greater challenges than resolving forecast questions.
I’ve been interested in this area for 5+ years but struggled to draw attention to it—partly because it seems abstract, and partly because much of the necessary technology wasn’t quite ready.
We’re now at an exciting point where creating LLM apps for both forecasting and evaluation is becoming incredibly affordable. This might be a good time to spotlight this area.
There’s a curious gap now where we can, in theory, envision a world with sophisticated AI evaluation infrastructure, yet discussion of this remains limited. Fortunately, researchers and enthusiasts can fill this gap, one sentence at a time.
[1] As opposed to [Forecasting About AI], which is also common here.
[2] Or at least, do as good a job as we can.