Yarrow Bouchard 🔸 comments on OpenAI’s o3 model scores 3% on the ARC-AGI-2 benchmark, compared to 60% for the average human

Yarrow Bouchard 🔸 4 May 2025 18:56 UTC
1 point
0 ∶ 0
There is some ambiguity in claims about whether an LLM knows how to do something. The spectrum of knowing how to do things ranges all the way from “Can it do it at least once, ever?” to “Does it do it reliably, every time, without fail?”.

My experience was that I tried to play hangman with o4-mini twice and it failed both times in the same really goofy way, where it counted my guesses wrong when I guessed a letter that was in the word it later said I was supposed to be guessing.

When I played the game with o4-mini where it said the word was “butterfly” (and also said there was no “B” in the word when I guessed “B”), I didn’t prompt it to make the word hard. I just said, after it claimed to have picked the word:

“E. Also, give me a vague hint or a general category.”

o4-mini said:

“It’s an animal.”

So, maybe asking for a hint or a category is the thing that causes it to fail. I don’t know.

Even if I accepted the idea that the LLM “wants me to lose” (which sounds dubious to me), then it doesn’t know how to do that properly, either. In the “butterfly” example, it could, in theory, have chosen a word retroactively that filled in the blanks but didn’t conflict with any guesses it said were wrong. But it didn’t do that.

In the attempt where the word was “schmaltziness”, o4-mini’s response about which letters were where in the word (which I pasted in a footnote to my previous comment) was borderline incoherent. I could hypothesize that this was part of a secret strategy on its part to follow my directives, but much more likely, I think, is that it just lacks the capability to execute the task reliably.

Fortunately, we don’t have to dwell on hangman too much, since there are rigorous benchmarks like ARC-AGI-2 that show more conclusively the reasoning abilities of o3 and o4-mini are poor compared to typical humans.