Aidan Kankyoku comments on Announcing RoastMyPost: LLMs Eval Blog Posts and More

Aidan Kankyoku 19 Dec 2025 17:59 UTC
7 points
3 ∶ 0
I love this idea! I just took it for a spin and the quality of the feedback isn’t at a point I would find it very useful yet. My sense is that it’s limited by the quality of the agents rather than anything about the design of the app, though maybe changes in the scaffold could help.
Most of the critiques were myopic, such as:
- It labeled one sentence in my intro as a “hasty generalization/unsupported claim” when I spend most of the post supporting that statement.
- In one sentence, it raised a flag for “missing context” about a study I reference, with a different flag affirming that the link embedded in the sentence provides the context
- I make a claim about how the majority of people view an issue, providing support for the claim, then discussing the problems with the view. It raised a flag on the claim I’m critiquing, calling it unscientific– even from the sentences immediately before and after, it should be clear that was exactly my point!
I could list several more examples, most of the flags I clicked on were misunderstandings in similar ways. Article is here if you want to take a look: https://www.roastmypost.org/docs/jr1MShmVhsK6Igp0yJCL2/reader
- Jamie_Harris 31 Dec 2025 14:14 UTC
  6 points
  1 ∶ 0
  Parent
  I just tried it out and had a similar impression. I think this is a cool idea and am excited Ozzie created it, but suspect it needs active development (and/or improvements in the underlying models over time) before it’s useful to me, due to similarish issues to what you found. I’ll likely try a couple more times with drafts though!
- Ozzie Gooen 19 Dec 2025 21:04 UTC
  3 points
  0 ∶ 0
  Parent
  Thanks for the feedback!
  
  I did a quick look at this. I largely agree there were some incorrect checks.
  
  It seems like these specific issues were mostly from the Fallacy Check? That one is definitely too aggressive (in addition to having limited context), I’ll work on tuning it down. Note that you can choose which evaluators to run on each post, so going forward you might want to just skip that one at this point.
  - Aidan Kankyoku 19 Dec 2025 21:13 UTC
    2 points
    0 ∶ 0
    Parent
    It looks like maybe 60% fallacy check and 40% fact check. For instance, fact check:
    claims there are more farmed chickens than shrimps (!)
    Claims ICAW does not use aggressive tactics, apparently basing that on vague copy on their website
    - Ozzie Gooen 19 Dec 2025 21:26 UTC
      2 points
      0 ∶ 0
      Parent
      I’m looking now at the Fact Check. It did verify most of the claims it investigated on your post as correct, but not all (almost no posts get all, especially as the error rate is significant).
      
      It seems like with chickens/shrimp it got a bit confused by numbers killed vs. numbers alive at any one time or something.
      
      In the case of ICAWs, it looked like it did a short search via Perplexity, and didn’t find anything interesting. The official sources claim they don’t use aggressive tactics, but a smart agent would have realized it needed to search more. I think to get this one right would have involved a few more searches—meaning increased costs. There’s definitely some tinkering/improvements to do here.
      - Aidan Kankyoku 19 Dec 2025 22:17 UTC
        3 points
        0 ∶ 0
        Parent
        That makes sense, I don’t want to be overly fussy if it was getting most things right. I guess the thing is, it’s not helpful if it mostly recognizes true facts as true but mistakes some true facts as false, if it does not accurately flag a significant number of incorrect facts, which in clicking through a bunch of flags I didn’t see almost any I thought necessitated an edit.