Thanks Ben, this is interesting. I think we disagree somewhat on the extent to which relative Brier avoids the question selection problem (see Nuno’s comment on this), and also whether it’s desirable to award no points for agreeing with the crowd, but I definitely think the case for relative Brier being the best option is reasonable and that you have made it well.
I’m interested in particular in your comment on extremising meaning that it’s possible that the overconfidence incentive in some tournament scoring is desirable. My understanding is that the qualitative argument for extremising being useful is that if several people independently rate an event as being almost certain, they may have different reasons for doing so. It seems that the benefit of extremising may be much smaller, and possibly non-existent, if a crowd can see the aggregate forecast, possibly moreso if the crowd can see every individual forecast that’s been made. Do you know of any research on this? I’d be interested to see some.
As far as I know, the Metaculus algorithm does not “deliberately” extremise, however the exact procedure is not public, and it did recently produce a very confident set of predictions!
Re: question selection—I agree that there are some edge cases where the scoring system doesn’t have perfect incentives around question selection (Nuno’s being a good example). But for us, getting people to forecast at all in these tournaments has been a much, much bigger problem than any question selection nuances inherent in the scoring system. If improving the overall system accuracy is the primary goal, we’re much more likely (IMO) to get more juice out of focusing time/resources/effort on increasing overall participation.
Re: extremizing—I haven’t read specific papers on this (though there are probably some out there from the IARPA ACE program, if I had to guess). This might be related, but I admit I haven’t actually read it :) - https://arxiv.org/pdf/1506.06405.pdf
But we’ve seen improvements in the aggregate forecast’s Brier score if we apply very basic extremization to it (ie. anything <50% gets pushed closer to 0, anything above 50% gets pushed closer to 100%). This was true even when we showed the crowd forecast to individuals. But I’ll also be the first to admit that connecting this to the idea that an overconfidence incentive is a good thing is purely speculative and is not something we’ve explicitly tested/investigated.
Thanks Ben, this is interesting. I think we disagree somewhat on the extent to which relative Brier avoids the question selection problem (see Nuno’s comment on this), and also whether it’s desirable to award no points for agreeing with the crowd, but I definitely think the case for relative Brier being the best option is reasonable and that you have made it well.
I’m interested in particular in your comment on extremising meaning that it’s possible that the overconfidence incentive in some tournament scoring is desirable. My understanding is that the qualitative argument for extremising being useful is that if several people independently rate an event as being almost certain, they may have different reasons for doing so. It seems that the benefit of extremising may be much smaller, and possibly non-existent, if a crowd can see the aggregate forecast, possibly moreso if the crowd can see every individual forecast that’s been made. Do you know of any research on this? I’d be interested to see some.
As far as I know, the Metaculus algorithm does not “deliberately” extremise, however the exact procedure is not public, and it did recently produce a very confident set of predictions!
Re: question selection—I agree that there are some edge cases where the scoring system doesn’t have perfect incentives around question selection (Nuno’s being a good example). But for us, getting people to forecast at all in these tournaments has been a much, much bigger problem than any question selection nuances inherent in the scoring system. If improving the overall system accuracy is the primary goal, we’re much more likely (IMO) to get more juice out of focusing time/resources/effort on increasing overall participation.
Re: extremizing—I haven’t read specific papers on this (though there are probably some out there from the IARPA ACE program, if I had to guess). This might be related, but I admit I haven’t actually read it :) - https://arxiv.org/pdf/1506.06405.pdf
But we’ve seen improvements in the aggregate forecast’s Brier score if we apply very basic extremization to it (ie. anything <50% gets pushed closer to 0, anything above 50% gets pushed closer to 100%). This was true even when we showed the crowd forecast to individuals. But I’ll also be the first to admit that connecting this to the idea that an overconfidence incentive is a good thing is purely speculative and is not something we’ve explicitly tested/investigated.