I think the reasons we didn’t go deeper on this are basically a mix of:
Eh, I’m not sure how much signal you get from the simple prototypes. Like for sure you can get some, but mostly what you’re testing is “Are the LLMs already good enough that they can be quite useful here even with minimal scaffolding?”
A lot of the research was done 6-9 months ago (when Claude Code was significantly weaker)
Questions of comparative advantage—it being unclear that we’re the best people to be exploring this (although I agree that Claude Code makes this more plausible than it would have been in the past)
We didn’t want to let the perfect be the enemy of the good—indeed in many ways Claude Code improvements make it more attractive to get out, since it’s more plausible that someone else will casually run with and develop one of these ideas
We didn’t; although two of us were involved in running the AI for Human Reasoning fellowship, and some of the fellows on that did.
I think the reasons we didn’t go deeper on this are basically a mix of:
Eh, I’m not sure how much signal you get from the simple prototypes. Like for sure you can get some, but mostly what you’re testing is “Are the LLMs already good enough that they can be quite useful here even with minimal scaffolding?”
A lot of the research was done 6-9 months ago (when Claude Code was significantly weaker)
Questions of comparative advantage—it being unclear that we’re the best people to be exploring this (although I agree that Claude Code makes this more plausible than it would have been in the past)
We didn’t want to let the perfect be the enemy of the good—indeed in many ways Claude Code improvements make it more attractive to get out, since it’s more plausible that someone else will casually run with and develop one of these ideas