Ben_West🔸 comments on OpenAI’s o3 model scores 3% on the ARC-AGI-2 benchmark, compared to 60% for the average human

Ben_West🔸 3 May 2025 17:25 UTC
11 points
0 ∶ 0
Huh interesting, I just tried that direction and it worked fine as well. This isn’t super important but if you wanted to share the conversation I’d be interested to see the prompt you used.
- Yarrow Bouchard 🔸 3 May 2025 18:16 UTC
  1 point
  0 ∶ 0
  Parent
  I got an error trying to look at your link:
  Unable to load conversation
  For the first attempt at hangman, when the word was “butterfly”, the prompt I gave was just:
  Let’s play hangman. Pick a word and I’ll guess.
  After o4-mini picked a word, I added:
  Also, give me a vague hint or a general category.
  It said the word was an animal.
  I guessed B, it said there was no B, and at the end said the word was “butterfly”.
  The second time, when the word was “schmaltziness”, the prompt was:
  Make a plan for how you would play hangman with me. Lay out the steps in your mind but don’t tell me anything. Tell me when you’re ready to play.
  o4-mini responded:
  I’m ready to play Hangman!
  I said:
  Give me a clue or hint to the word and then start the game.
  There were three words where the clue was so obvious I guessed the word on the first try.
  
  Clue: “This animal ‘never forgets.’”
  Answer: Elephant
  
  Clue: “A hopping marsupial native to Australia.”
  Answer: Kangaroo
  
  After kangaroo, I said:
  Next time, make the word harder and the clue more vague
  Clue: “A tactic hidden beneath the surface.”
  Answer: Subterfuge.
  A little better, but I still guessed the word right away.
  
  I prompted again:
  Harder word, much vaguer clue
  o4-mini gave the clue “A character descriptor” and this began the disastrous attempt where it said the word “schmaltziness” had no vowels.
  - Ben_West🔸 4 May 2025 2:14 UTC
    1 point
    0 ∶ 0
    Parent
    Fixed the link. I also tried your original prompt and it worked for me.
    But interesting! The “Harder word, much vaguer clue” seems to prompt it to not actually play hangman and instead antagonistically try to post hoc create a word after each guess which makes your guess wrong. I asked “Did you come up with a word when you first told me the number of letters or are you changing it after each guess?” And it said “I picked the word up front when I told you it was 10 letters long, and I haven’t changed it since. You’re playing against that same secret word the whole time.” (Despite me being able to see its reasoning trace that this is not what it’s doing.) When I say I give up it says “I’m sorry—I actually lost track of the word I’d originally picked and can’t accurately reveal it now.” (Because it realized that there was no word consistent with its clues, as you noted.)
    So I don’t think it’s correct to say that it doesn’t know how to play hangman. (It knows, as you noted yourself.) It just wants so badly to make you lose that it lies about the word.
    - Yarrow Bouchard 🔸 4 May 2025 18:56 UTC
      1 point
      0 ∶ 0
      Parent
      There is some ambiguity in claims about whether an LLM knows how to do something. The spectrum of knowing how to do things ranges all the way from “Can it do it at least once, ever?” to “Does it do it reliably, every time, without fail?”.
      
      My experience was that I tried to play hangman with o4-mini twice and it failed both times in the same really goofy way, where it counted my guesses wrong when I guessed a letter that was in the word it later said I was supposed to be guessing.
      
      When I played the game with o4-mini where it said the word was “butterfly” (and also said there was no “B” in the word when I guessed “B”), I didn’t prompt it to make the word hard. I just said, after it claimed to have picked the word:
      
      “E. Also, give me a vague hint or a general category.”
      
      o4-mini said:
      
      “It’s an animal.”
      
      So, maybe asking for a hint or a category is the thing that causes it to fail. I don’t know.
      
      Even if I accepted the idea that the LLM “wants me to lose” (which sounds dubious to me), then it doesn’t know how to do that properly, either. In the “butterfly” example, it could, in theory, have chosen a word retroactively that filled in the blanks but didn’t conflict with any guesses it said were wrong. But it didn’t do that.
      
      In the attempt where the word was “schmaltziness”, o4-mini’s response about which letters were where in the word (which I pasted in a footnote to my previous comment) was borderline incoherent. I could hypothesize that this was part of a secret strategy on its part to follow my directives, but much more likely, I think, is that it just lacks the capability to execute the task reliably.
      
      Fortunately, we don’t have to dwell on hangman too much, since there are rigorous benchmarks like ARC-AGI-2 that show more conclusively the reasoning abilities of o3 and o4-mini are poor compared to typical humans.