+1 to comments about the paucity of details or checks. There are a range of issues that I can see.
Am I understanding the technical report correctly? It says “For each question, we sample 5 forecasts. All metrics are averaged across these forecasts.” It is difficult to interpret this precisely. But the most likely meaning I take from this, is that you calculated accuracy metrics for 5 human forecasts per question, then averaged those accuracy metrics. That is not measuring the accuracy of “the wisdom of the crowd”. That is (a very high variance) estimate of “the wisdom of an average forecaster on Metaculus”. If that interpretation is correct, all you’ve achieved is a bot that does better than an average Metaculus forecaster.
I think that it is likely that searches for historical articles will be biased by Google’s current search rankings. For example, if Israel actually did end up invading Lebanon, then you might expect historical articles speculating about a possible invasion to be linked to more by present articles, and therefore show up in search queries higher even when restricting only to articles written before the cutoff date. This would bias the model’s data collection, and partially explain good performance on prediction for historical events.
Assuming that you have not made the mistake I described in 1. above, it’d be useful to look into the result data a bit more to check how performance varies on different topics. How does performance tend to be better than wisdom of the crowd? For example, are there particular topics that it performs better on? Does it tend to be more willing to be conservative/confident than a crowd of human forecasters? How does its calibration curve compare to that of humans? Also questions I would expect to be answered in a technical report claiming to prove superhuman forecasting ability.
It might be worth validating that the knowledge cutoff for the LLM is actually the one you expect from the documentation. I do not trust public docs to keep up-to-date, and that seems like a super easy error mode for evaluation here.
I think that the proof will be in future forecasting prediction ability: give 539 a Metaculus account and see how it performs.
Honestly, at a higher level, your approach is very unscientific. You have a demo and UI mockups illustrating how your tool could be used, and grandiose messaging across different forums. Yet your technical report has no details whatsoever. Even the section on Platt scoring has no motivation on why I should care about those metrics. This is a hype-driven approach to research that I am (not) surprised to see come out of ‘the centre for AI safety’.
Fwiw Metaculus has an AI Forecasting Benchmark Tournament. The Q3 contest ends soon, but another should come out afterwards and it would be helpful to see how 539 performs compared to the other bots.
+1 to comments about the paucity of details or checks. There are a range of issues that I can see.
Am I understanding the technical report correctly? It says “For each question, we sample 5 forecasts. All metrics are averaged across these forecasts.” It is difficult to interpret this precisely. But the most likely meaning I take from this, is that you calculated accuracy metrics for 5 human forecasts per question, then averaged those accuracy metrics. That is not measuring the accuracy of “the wisdom of the crowd”. That is (a very high variance) estimate of “the wisdom of an average forecaster on Metaculus”. If that interpretation is correct, all you’ve achieved is a bot that does better than an average Metaculus forecaster.
I think that it is likely that searches for historical articles will be biased by Google’s current search rankings. For example, if Israel actually did end up invading Lebanon, then you might expect historical articles speculating about a possible invasion to be linked to more by present articles, and therefore show up in search queries higher even when restricting only to articles written before the cutoff date. This would bias the model’s data collection, and partially explain good performance on prediction for historical events.
Assuming that you have not made the mistake I described in 1. above, it’d be useful to look into the result data a bit more to check how performance varies on different topics. How does performance tend to be better than wisdom of the crowd? For example, are there particular topics that it performs better on? Does it tend to be more willing to be conservative/confident than a crowd of human forecasters? How does its calibration curve compare to that of humans? Also questions I would expect to be answered in a technical report claiming to prove superhuman forecasting ability.
It might be worth validating that the knowledge cutoff for the LLM is actually the one you expect from the documentation. I do not trust public docs to keep up-to-date, and that seems like a super easy error mode for evaluation here.
I think that the proof will be in future forecasting prediction ability: give 539 a Metaculus account and see how it performs.
Honestly, at a higher level, your approach is very unscientific. You have a demo and UI mockups illustrating how your tool could be used, and grandiose messaging across different forums. Yet your technical report has no details whatsoever. Even the section on Platt scoring has no motivation on why I should care about those metrics. This is a hype-driven approach to research that I am (not) surprised to see come out of ‘the centre for AI safety’.
Fwiw Metaculus has an AI Forecasting Benchmark Tournament. The Q3 contest ends soon, but another should come out afterwards and it would be helpful to see how 539 performs compared to the other bots.