Ozzie Gooen comments on Announcing RoastMyPost: LLMs Eval Blog Posts and More

Ozzie Gooen 19 Dec 2025 21:04 UTC
3 points
0 ∶ 0
Thanks for the feedback!

I did a quick look at this. I largely agree there were some incorrect checks.

It seems like these specific issues were mostly from the Fallacy Check? That one is definitely too aggressive (in addition to having limited context), I’ll work on tuning it down. Note that you can choose which evaluators to run on each post, so going forward you might want to just skip that one at this point.
- Aidan Kankyoku 19 Dec 2025 21:13 UTC
  2 points
  0 ∶ 0
  Parent
  It looks like maybe 60% fallacy check and 40% fact check. For instance, fact check:
  - claims there are more farmed chickens than shrimps (!)
  - Claims ICAW does not use aggressive tactics, apparently basing that on vague copy on their website
  - Ozzie Gooen 19 Dec 2025 21:26 UTC
    2 points
    0 ∶ 0
    Parent
    I’m looking now at the Fact Check. It did verify most of the claims it investigated on your post as correct, but not all (almost no posts get all, especially as the error rate is significant).
    
    It seems like with chickens/shrimp it got a bit confused by numbers killed vs. numbers alive at any one time or something.
    
    In the case of ICAWs, it looked like it did a short search via Perplexity, and didn’t find anything interesting. The official sources claim they don’t use aggressive tactics, but a smart agent would have realized it needed to search more. I think to get this one right would have involved a few more searches—meaning increased costs. There’s definitely some tinkering/improvements to do here.
    - Aidan Kankyoku 19 Dec 2025 22:17 UTC
      3 points
      0 ∶ 0
      Parent
      That makes sense, I don’t want to be overly fussy if it was getting most things right. I guess the thing is, it’s not helpful if it mostly recognizes true facts as true but mistakes some true facts as false, if it does not accurately flag a significant number of incorrect facts, which in clicking through a bunch of flags I didn’t see almost any I thought necessitated an edit.