Thanks for highlighting Beadle (2022), I will add it to our review!
I wonder how FFI Superforecasters were selected? It’s important to first select forecasters who are doing good and then evaluate their performance on new questions to avoid the issue of “training and testing on the same data.”
Good question! There were many differences between the approaches by FFI and the GJP. One of them is that no superforecasters were selected and grouped in the FFI tournament.
Here is google’s translation of a relevant passage: “In FFI’s tournament, the super forecasters consist of the 60 best participants overall. FFI’s tournament was not conducted one year at a time, but over three consecutive years, where many of the questions were not decided during the current year and the participants were not divided into experimental groups. It is therefore not appropriate to identify new groups of super forecasters along the way” (2022, 168). You can translate the entirety of 5.4 here for further clarification on how Beadle defines superforecasters in the FFI tournament.
So it’s fair to say that FFI-supers were selected and evaluated on the same data? This seems concerning. Specifically, on which questions the top-60 were selected, and on which questions the below scores were calculated? Did these sets of questions overlap?
The standardised Brier scores of FFI superforecasters (–0.36) were almost perfectly similar to that of the initial forecasts of superforecasters in GJP (–0.37). [17] Moreover, even though regular forecasters in the FFI tournament were worse at prediction than GJP forecasters overall (probably due to not updating, training or grouping), the relative accuracy of FFI’s superforecasters compared to regular forecasters (-0.06), and to defence researchers with access to classified information (–0.1) was strikingly similar.[18]
Yes, the 60 FFI supers were selected and evaluated on the same 150 questions (Beadle, 2022, 169-170). Beadle also identified the top 100 forecasters based on the first 25 questions, and evaluated their performance on the basis of the remaining 125 questions to see if their accuracy was stable over time, or due to luck. Similarly to the GJP studies, he found that they were consistent over time (Beadle, 2022, 128-131).
I should note that I have not studied the report very thoroughly, so I may be mistaken about this. I’ll have a closer look when I have the time and correct the answer above if it is wrong!
Thanks for highlighting Beadle (2022), I will add it to our review!
I wonder how FFI Superforecasters were selected? It’s important to first select forecasters who are doing good and then evaluate their performance on new questions to avoid the issue of “training and testing on the same data.”
Good question! There were many differences between the approaches by FFI and the GJP. One of them is that no superforecasters were selected and grouped in the FFI tournament.
Here is google’s translation of a relevant passage: “In FFI’s tournament, the super forecasters consist of the 60 best participants overall. FFI’s tournament was not conducted one year at a time, but over three consecutive years, where many of the questions were not decided during the current year and the participants were not divided into experimental groups. It is therefore not appropriate to identify new groups of super forecasters along the way” (2022, 168). You can translate the entirety of 5.4 here for further clarification on how Beadle defines superforecasters in the FFI tournament.
So it’s fair to say that FFI-supers were selected and evaluated on the same data? This seems concerning. Specifically, on which questions the top-60 were selected, and on which questions the below scores were calculated? Did these sets of questions overlap?
Yes, the 60 FFI supers were selected and evaluated on the same 150 questions (Beadle, 2022, 169-170). Beadle also identified the top 100 forecasters based on the first 25 questions, and evaluated their performance on the basis of the remaining 125 questions to see if their accuracy was stable over time, or due to luck. Similarly to the GJP studies, he found that they were consistent over time (Beadle, 2022, 128-131).
I should note that I have not studied the report very thoroughly, so I may be mistaken about this. I’ll have a closer look when I have the time and correct the answer above if it is wrong!