I was also going to recommend this, but I’ll just add an implementation idea (which IDK if I fully endorse): you could try to recruit a few superforecasters or subject-matter experts (SMEs) in given field to provide forecasts on the questions at the same time, then have a reciprocal scoring element (I.e., who came closest to the superforecasters’/SMEs’ forecasts). This is basically what was done in the 2022 Existential Risk Persuasion/Forecasting Tournament (XPT), which Philip Tetlock ran (and I participated in). IDK when the study results for that tournament will be out, and maybe it won’t recommend reciprocal scoring, but it definitely seems worth considering.
A separate idea (which again IDK if I fully endorse but was also in the XPT): have people provide dense rationales for a few big forecasts, then you can rate them on the merits of their rationales. (Yes, this involves subjectivity, but it’s not very different from speech and debate tournaments; the bigger problem could be the time required to review the rationales, but even this definitely seems manageable, especially if you provide a clear rubric, as is common in some competitive speech leagues.)
A trial of #2 would have some information value—you could discern how strong the correlation was between the rationale scores and final standings to decide if rationales were a good way to produce a same-week result.
Maybe you could also use idea #1 with only the top-scoring teams making it to the rationale round, to cut down on time spent scoring rationales?
TBH, I think that the time spent scoring rationales is probably quite manageable: I don’t think it should take longer than 30 person-minutes to decently judge each rationale (e.g., have three judges each spend 10 minutes evaluating each), maybe less? It might be difficult to have results within 1-2 hours if you don’t have that many judges, but probably it should be available by the end of the day.
To be clear, I was thinking that only a small number (no more than three, maybe just two) of the total questions should be “rationale questions.”
But definitely the information value of “do rationale scores correlate with performance” would be interesting! I’m not sure if the literature has ever done this (I don’t think I’ve encountered anything like that, but I haven’t actively searched for it)
I was also going to recommend this, but I’ll just add an implementation idea (which IDK if I fully endorse): you could try to recruit a few superforecasters or subject-matter experts (SMEs) in given field to provide forecasts on the questions at the same time, then have a reciprocal scoring element (I.e., who came closest to the superforecasters’/SMEs’ forecasts). This is basically what was done in the 2022 Existential Risk Persuasion/Forecasting Tournament (XPT), which Philip Tetlock ran (and I participated in). IDK when the study results for that tournament will be out, and maybe it won’t recommend reciprocal scoring, but it definitely seems worth considering.
A separate idea (which again IDK if I fully endorse but was also in the XPT): have people provide dense rationales for a few big forecasts, then you can rate them on the merits of their rationales. (Yes, this involves subjectivity, but it’s not very different from speech and debate tournaments; the bigger problem could be the time required to review the rationales, but even this definitely seems manageable, especially if you provide a clear rubric, as is common in some competitive speech leagues.)
A trial of #2 would have some information value—you could discern how strong the correlation was between the rationale scores and final standings to decide if rationales were a good way to produce a same-week result.
Maybe you could also use idea #1 with only the top-scoring teams making it to the rationale round, to cut down on time spent scoring rationales?
TBH, I think that the time spent scoring rationales is probably quite manageable: I don’t think it should take longer than 30 person-minutes to decently judge each rationale (e.g., have three judges each spend 10 minutes evaluating each), maybe less? It might be difficult to have results within 1-2 hours if you don’t have that many judges, but probably it should be available by the end of the day.
To be clear, I was thinking that only a small number (no more than three, maybe just two) of the total questions should be “rationale questions.”
But definitely the information value of “do rationale scores correlate with performance” would be interesting! I’m not sure if the literature has ever done this (I don’t think I’ve encountered anything like that, but I haven’t actively searched for it)