Why do you think participants largely didn’t change their minds?
Damien Laird
I agree with your concerns on using a pure Brier score with open platforms. I expect that currently it makes the most sense within “tournaments” where participants are answering every question. Technically, I think some sort of objective, proper scoring rule is a prerequisite to a more advanced scoring system that conveys more useful information in open contexts.
I’ve seen some sort of a “relative Brier score” referenced frequently in associated research (definitely in the good judgement project papers, at a minimum) that scored forecasters based on the difficulty of each question, as determined by the performance of others who forecasted it. This seems promising, and I expect there are a lot of options in that direction.
Project Idea: Profiles Aggregating Forecasting Performance Metrics
Research Summary: Forecasting with Large Language Models
I would say it a little differently. I would say that “judgmental” forecasting, the kind typically done on Metaculus or Good Judgement Open or similar platforms, CAN involve mathemtical models, but oftentimes people are just doing some simple math, if any at all. In cases where people do use models, sure it would make sense to link to them as sources, and I agree that would also be valuable to track for similar reasons. Guesstimate seems like the obvious place to do that.
I think that is separate from the proposition I intended to communicate for primarily text based research.
I also wasn’t anticipating any need to do scraping if this was implemented by the two platforms themselves. It should be easy enough for them to tell if a citation is linking to an EA forum post? Metaculus doesn’t have a footnote/citation formatting tool today like the EA Forum’s. (Although if you were to scrape, finding EA forum links within citations on this forum seems pretty well defined and achievable? idk, I don’t write much code, thus me floating this out here for feedback)
Thanks for the thoughts!
Good points.
This post comes to mind, which I cited in my nuclear GCR forecasts here, along with many other posts from that series. In general I expect posts from Rethink Priorities to be relevant. I’ve seen similar quality posts for AI risks and pandemics here. Most of my familiarity is with GCR’s but I expected there to be strong overlap between popular forecasting topics and popular EA forum topics more generally. There are lots of GCR related questions on Metaculus, and you can find many cited in that link with my forecasts.
Still, I think you’re right that this wouldn’t be applicable to the majority of EA forum posts. Maybe it’s only even displayed once a post is cited in a forecast, or only a particular tag is eligible for this in order to simplify the implementation.
I do think making people’s forecasting performance more obvious in different contexts would be very useful for the community (re: your brier scores in EAF profiles idea), and would love a central site that’s sort of like a minimum viable linkedin that consolidates relevant metrics for an individual across the top forecasting platforms and has an API that makes it easy to connect to other accounts, or use with discord bots etc. I may write about this soon.
Generating forecasts associated with a post is interesting and I’m sure there are UX opportunities to make this easier / more common, but I need to think more about it.
Thanks for the thoughtful response!
I was also a participant and have my own intuitions from my limited experience. I’ve had lots of great conversations with people where we both learned new things and updated our beliefs… But I don’t know that I’ve ever had one in an asynchronous comment thread format. Especially given the complexity of the topics, I’m just not sure that format was up to the task. During the whole tournament I found myself wanting to create a Discord server and set up calls to dig deeper into assumptions and disagreements. I totally understand the logistical challenges something like that would impose, as well as making it much harder to analyze the communication between participants, but my biggest open question after the tournament was how much better our outputs could have been with a richer collaboration environment.
I asked the original question to try and get at the intuitions of the researchers, having seen all of the data. They outline possible causes and directions for investigation in the paper, which is the right thing to do, but I’m still interested in what they believe happened this time.