Recent generations of Claude seem better at understanding blog posts and making fairly subtle judgment calls than most smart humans. These days when I’d read an article that presumably sounds reasonable to most people but has what seems to me to be a glaring conceptual mistake, I can put it in Claude, ask it to identify the mistake, and more likely than not Claude would land on the same mistake as the one I identified.
I think before Opus 4 this was essentially impossible, Claude 3.xs can sometimes identify small errors but it’s a crapshoot on whether it can identify central mistakes, and certainly not judge it well.
It’s possible I’m wrong about the mistakes here and Claude’s just being sycophantic and identifying which things I’d regard as the central mistake, but if that’s true in some ways it’s even more impressive.
Interestingly, both Gemini and ChatGPT failed at these tasks. (They can sometimes directionally approach the error I identified, but their formulation is imprecise and broad, and they only have it in a longer list of potential quibbles rather than zero in on the most damning issue).
For clarity purposes, here are 3 articles I recently asked Claude to reassess (Claude got the central error in 2⁄3 of them). I’m also a little curious what the LW baseline is here; I did not include my comments in my prompts to Claude.
EDIT: I noticed that in my examples I primed Claude a little, and when unprimed Claude does not reliably (or usually) get to the answer. However Claude 4.xs are still noticeable in how little handholding they need for this class of conceptual errors, Geminis often takes like 5 hints where Claude usually gets it with one. And my impression was that Claude 3.xs were kinda hopeless (they often don’t get it even with short explanations by me, and when they do, I’m not confident they actually got it vs just wanted to agree).
This resonates a lot. I’m keen to connect with others who are actively thinking about when it becomes justified to hand off specific parts of their work to AI.
Reading this, it seems like the key discovery wasn’t “Claude is good at critique in general,” but that a particular epistemic function — identifying important conceptual mistakes in a text — crossed a reliability threshold. The significance, as I read it, is that you can now trust Claude roughly like a reasonable colleague for spotting such mistakes, both in your own drafts and in texts you rely on at work or in life.
I’m interested in concrete ways people are structuring this kind of exploration in practice: choosing which tasks to stress-test for delegation, running those tests cheaply and repeatably, and deciding when a workflow change is actually warranted rather than premature.
My aim is simple: produce higher-quality output more quickly without giving up epistemic control. If others are running similar experiments, have heuristics for this, or want to collaborate on lightweight evaluation approaches, I’d be keen to compare notes.
The significance, as I read it, is that you can now trust Claude roughly like a reasonable colleague for spotting such mistakes, both in your own drafts and in texts you rely on at work or in life.
I wouldn’t go quite this far, at least from my comment. There’s a saying in startups, “never outsource your core competency”, and unfortunately reading blog posts and spotting conceptual errors of a certain form is a core competency of mine. Nonetheless I’d encourage other Forum users less good at spotting errors (which is most people) to try to do something like this and post posts that seem a little fishy to Claude and see if it’s helpful.[1]
For me, Claude is more helpful for identifying factual errors, and for challenging my own blog posts at different levels (eg spelling, readability, conceptual clarity, logical flow, etc). I wouldn’t bet on it spotting conceptual/logical errors in my posts I missed, but again, I have a very high opinion of myself here.
Recent generations of Claude seem better at understanding blog posts and making fairly subtle judgment calls than most smart humans. These days when I’d read an article that presumably sounds reasonable to most people but has what seems to me to be a glaring conceptual mistake, I can put it in Claude, ask it to identify the mistake, and more likely than not Claude would land on the same mistake as the one I identified.
I think before Opus 4 this was essentially impossible, Claude 3.xs can sometimes identify small errors but it’s a crapshoot on whether it can identify central mistakes, and certainly not judge it well.
It’s possible I’m wrong about the mistakes here and Claude’s just being sycophantic and identifying which things I’d regard as the central mistake, but if that’s true in some ways it’s even more impressive.
Interestingly, both Gemini and ChatGPT failed at these tasks. (They can sometimes directionally approach the error I identified, but their formulation is imprecise and broad, and they only have it in a longer list of potential quibbles rather than zero in on the most damning issue).
For clarity purposes, here are 3 articles I recently asked Claude to reassess (Claude got the central error in 2⁄3 of them). I’m also a little curious what the LW baseline is here; I did not include my comments in my prompts to Claude.
https://terrancraft.com/2021/03/21/zvx-the-effects-of-scouting-pillars/
https://www.clearerthinking.org/post/what-can-a-single-data-point-teach-you
https://www.lesswrong.com/posts/vZcXAc6txvJDanQ4F/the-median-researcher-problem-1
EDIT: I noticed that in my examples I primed Claude a little, and when unprimed Claude does not reliably (or usually) get to the answer. However Claude 4.xs are still noticeable in how little handholding they need for this class of conceptual errors, Geminis often takes like 5 hints where Claude usually gets it with one. And my impression was that Claude 3.xs were kinda hopeless (they often don’t get it even with short explanations by me, and when they do, I’m not confident they actually got it vs just wanted to agree).
what prompt did you use?
This resonates a lot. I’m keen to connect with others who are actively thinking about when it becomes justified to hand off specific parts of their work to AI.
Reading this, it seems like the key discovery wasn’t “Claude is good at critique in general,” but that a particular epistemic function — identifying important conceptual mistakes in a text — crossed a reliability threshold. The significance, as I read it, is that you can now trust Claude roughly like a reasonable colleague for spotting such mistakes, both in your own drafts and in texts you rely on at work or in life.
I’m interested in concrete ways people are structuring this kind of exploration in practice: choosing which tasks to stress-test for delegation, running those tests cheaply and repeatably, and deciding when a workflow change is actually warranted rather than premature.
My aim is simple: produce higher-quality output more quickly without giving up epistemic control. If others are running similar experiments, have heuristics for this, or want to collaborate on lightweight evaluation approaches, I’d be keen to compare notes.
I wouldn’t go quite this far, at least from my comment. There’s a saying in startups, “never outsource your core competency”, and unfortunately reading blog posts and spotting conceptual errors of a certain form is a core competency of mine. Nonetheless I’d encourage other Forum users less good at spotting errors (which is most people) to try to do something like this and post posts that seem a little fishy to Claude and see if it’s helpful.[1]
For me, Claude is more helpful for identifying factual errors, and for challenging my own blog posts at different levels (eg spelling, readability, conceptual clarity, logical flow, etc). I wouldn’t bet on it spotting conceptual/logical errors in my posts I missed, but again, I have a very high opinion of myself here.
(To be clear I’m not sure the false positives/false negatives ratio is good enough for other people).