Thanks for doing this work, this seems like a particularly useful benchmark to track the world model of AI systems.
I found it pretty interesting to read the prompts you use, which are quite extensive and give a lot of useful structure to the reasoning. I was surprised to see in table 16 that the zero-shot prompts had almost the same performance level. The prompting kinda introduces a bunch of variance I imagine, and I wonder whether I should expect scaffolding (like https://futuresearch.ai/ are presumable focussing on) to cause significant improvements.
Thanks for doing this work, this seems like a particularly useful benchmark to track the world model of AI systems.
I found it pretty interesting to read the prompts you use, which are quite extensive and give a lot of useful structure to the reasoning. I was surprised to see in table 16 that the zero-shot prompts had almost the same performance level. The prompting kinda introduces a bunch of variance I imagine, and I wonder whether I should expect scaffolding (like https://futuresearch.ai/ are presumable focussing on) to cause significant improvements.