Ozzie Gooen comments on Metaculus Q4 AI Benchmarking: Bots Are Closing The Gap

Ozzie Gooen 14 Mar 2025 1:41 UTC
4 points
0 ∶ 0
I think this is neat! It’s also lengthy, I like the write-up.

Some quick thoughts:
1. I’d be curious if the source code or specific prompts that pgodzinai used are publicly. It seems like it took the author less than 40 hours, so maybe they could be paid for this, worst case.
2. I find it interesting that the participants included commercial entities, academic researchers, etc. I’m curious if this means that there’s a budding industry of AI forecasting tools.
3. It sounds like a lot of the bots are using similar techniques, and also seems like these techniques aren’t too complicated. Here the fact that pgodzinai took fewer than 40 hours comes to mind. “The top bot (pgodzinai) spent between 15 and 40hr on his bot.”. At very least, it seems like it should be doable to make a bot similar to pgodzinai and have it be an available open-source standard that others could begin experimenting with. I assume we want there to be some publicly available forecasting bots that are ideally close to SOTA (especially if this is fairly cheap, anyway). One thing this could do is act as a “baseline” for future forecasting experiments by others.
4. I’m curious about techniques that could be used to do this for far more questions, like 100k questions. I imagine that there could be a bunch of narrower environments with limited question types, but in conditions where we could much more rapidly test different setups.
5. I imagine it’s a matter of time until someone can set up some RL environment that deeply optimizes a simple forecasting agent like this (though would do so in a limited setting).
- Molly Hickman 18 Mar 2025 13:33 UTC
  1 point
  0 ∶ 0
  Parent
  Thanks Ozzie! Phil Godzin’s code isn’t public, but our simple template bot is. The o1-preview and Claude 3.5-powered template bots did pretty well relative to the rest of the bots.
  What links here?
  - Ozzie Gooen's comment on Ozzie Gooen’s Quick takes by Ozzie Gooen (14 May 2025 6:59 UTC; 2 points)
  - Ozzie Gooen 18 Mar 2025 16:50 UTC
    4 points
    0 ∶ 0
    Parent
    The o1-preview and Claude 3.5-powered template bots did pretty well relative to the rest of the bots.
    As I think about it, this surprises me a bit. Did participants have access to these early on?
    
    If so, it seems like many participants underperformed the examples/defaults? That seems kind of underwhelming. I guess it’s easy to make a lot of changes that seem good at the time but wind up hurting performance when tested. Of course, this raises the point that it’s concerning that there wasn’t any faster/cheaper way of testing these bots first. Something seems a bit off here.
    - Molly Hickman 26 Mar 2025 13:24 UTC
      1 point
      0 ∶ 0
      Parent
      Yes, they’ve had access to the template from the get-go, and I believe a lot of people built their bots on the template. I guess it doesn’t surprise me that much. Just another case of KISS.
      That said, pgodzinai did layer quite a lot of things, albeit in under 40 hours, and did remarkably well, peer score-wise (compared to his bot peers). And no one did any fine-tuning afaik, which plausibly could improve performance.
      As for faster/cheaper way to test the bots: we’re working on something to address this!
  - Ozzie Gooen 18 Mar 2025 14:26 UTC
    2 points
    0 ∶ 0
    Parent
    That’s useful, thanks!