(Potential research project, curious to get feedback)
I’ve been thinking a lot about how to do quantitative LLM evaluations of the value of various (mostly-EA) projects.
We’d have LLMs give their best guesses at the value of various projects/outputs. These would be mediocre at first, but help us figure out how promising this area is, and where we might want to go with it.
The first idea that comes to mind is “Estimate the value in terms of [dollars, from a certain EA funder] as a [probability distribution]”. But this quickly becomes a mess. I think this couples a few key uncertainties into one value. This is probably too hard for early experiments.
A more elegant example would be “relative value functions”. This is theoretically nicer, but the infrastructure would be more expensive. It helps split up some of the key uncertainties, but would require a lot of technical investment.
One option that might be interesting is asking for a simple rank order. “Just order these projects in terms of the expected value.” We can definitely score rank orders, even though doing so is a bit inelegant.
So one experiment I’m imagining is:
We come up with a list of interesting EA outputs. Say, a combination of blog posts, research articles, interventions, etc. From this, we form a list of maybe 20 to 100 elements. These become public.
We then ask people to compete to rank these. A submission would be [an ordering of all the elements] and an optional [document defending their ordering].
We feed all of the entries in (2) into an LLM evaluation system. This would come with a lengthy predefined prompt. It would take in all of the provided orderings and all the provided defenses. It then outputs its own ordering.
We then score all of the entries in (2), based on how well they match the result of (3).
The winner gets a cash prize. Ideally, all submissions would become public.
1. “How would you choose which projects/items to analyze?” One option could be to begin with a mix of well-regarded posts on the EA Forum. Maybe we keep things to a limited domain for now (just X-risk), but have cover a spectrum of different amounts of karma.
2. “Wouldn’t the LLM do a poor job? Why not humans?” Having human judges at the end of this would add a lot of cost. It could easily make the project 2x as expensive. Also, I think it’s good for us to learn how to use LLMs for evaluating these competitions, as it has more long-term potential.
3. “The resulting lists would be poor quality” I think the results would be interesting, for a few reasons. I’d expect the results to be better than what many individuals would come up with. I also think it’s really important we start somewhere. It’s very easy to delay things until we have something perfect- then for that to never happen.
Really like the idea. Also I would say yes you need to keep this to an extremely limited domain otherwise I would assume the main crux will just be the llm vs human analysis of different cause areas relative value. Agree with 2⁄3 though.
Seems like there are two different broad ideas at play here. How good is the blog post fixing the topic and how important is the topic. I suppose you can try to tackle both at once but I feel like that might be biting off a lot at once?
A topic I personally like to think about and could gather 20 quite related posts are those relating to % chance of extinction risk and/or relative value of s-risk vs extinction risk reduction.
(Potential research project, curious to get feedback)
I’ve been thinking a lot about how to do quantitative LLM evaluations of the value of various (mostly-EA) projects.
We’d have LLMs give their best guesses at the value of various projects/outputs. These would be mediocre at first, but help us figure out how promising this area is, and where we might want to go with it.
The first idea that comes to mind is “Estimate the value in terms of [dollars, from a certain EA funder] as a [probability distribution]”. But this quickly becomes a mess. I think this couples a few key uncertainties into one value. This is probably too hard for early experiments.
A more elegant example would be “relative value functions”. This is theoretically nicer, but the infrastructure would be more expensive. It helps split up some of the key uncertainties, but would require a lot of technical investment.
One option that might be interesting is asking for a simple rank order. “Just order these projects in terms of the expected value.” We can definitely score rank orders, even though doing so is a bit inelegant.
So one experiment I’m imagining is:
We come up with a list of interesting EA outputs. Say, a combination of blog posts, research articles, interventions, etc. From this, we form a list of maybe 20 to 100 elements. These become public.
We then ask people to compete to rank these. A submission would be [an ordering of all the elements] and an optional [document defending their ordering].
We feed all of the entries in (2) into an LLM evaluation system. This would come with a lengthy predefined prompt. It would take in all of the provided orderings and all the provided defenses. It then outputs its own ordering.
We then score all of the entries in (2), based on how well they match the result of (3).
The winner gets a cash prize. Ideally, all submissions would become public.
This is similar to this previous competition we did.
Questions:
1. “How would you choose which projects/items to analyze?”
One option could be to begin with a mix of well-regarded posts on the EA Forum. Maybe we keep things to a limited domain for now (just X-risk), but have cover a spectrum of different amounts of karma.
2. “Wouldn’t the LLM do a poor job? Why not humans?”
Having human judges at the end of this would add a lot of cost. It could easily make the project 2x as expensive. Also, I think it’s good for us to learn how to use LLMs for evaluating these competitions, as it has more long-term potential.
3. “The resulting lists would be poor quality”
I think the results would be interesting, for a few reasons. I’d expect the results to be better than what many individuals would come up with. I also think it’s really important we start somewhere. It’s very easy to delay things until we have something perfect- then for that to never happen.
Really like the idea. Also I would say yes you need to keep this to an extremely limited domain otherwise I would assume the main crux will just be the llm vs human analysis of different cause areas relative value. Agree with 2⁄3 though.
Seems like there are two different broad ideas at play here. How good is the blog post fixing the topic and how important is the topic. I suppose you can try to tackle both at once but I feel like that might be biting off a lot at once?
A topic I personally like to think about and could gather 20 quite related posts are those relating to % chance of extinction risk and/or relative value of s-risk vs extinction risk reduction.