There is some ambiguity in claims about whether an LLM knows how to do something. The spectrum of knowing how to do things ranges all the way from âCan it do it at least once, ever?â to âDoes it do it reliably, every time, without fail?â.
My experience was that I tried to play hangman with o4-mini twice and it failed both times in the same really goofy way, where it counted my guesses wrong when I guessed a letter that was in the word it later said I was supposed to be guessing.
When I played the game with o4-mini where it said the word was âbutterflyâ (and also said there was no âBâ in the word when I guessed âBâ), I didnât prompt it to make the word hard. I just said, after it claimed to have picked the word:
âE. Also, give me a vague hint or a general category.â
o4-mini said:
âItâs an animal.â
So, maybe asking for a hint or a category is the thing that causes it to fail. I donât know.
Even if I accepted the idea that the LLM âwants me to loseâ (which sounds dubious to me), then it doesnât know how to do that properly, either. In the âbutterflyâ example, it could, in theory, have chosen a word retroactively that filled in the blanks but didnât conflict with any guesses it said were wrong. But it didnât do that.
In the attempt where the word was âschmaltzinessâ, o4-miniâs response about which letters were where in the word (which I pasted in a footnote to my previous comment) was borderline incoherent. I could hypothesize that this was part of a secret strategy on its part to follow my directives, but much more likely, I think, is that it just lacks the capability to execute the task reliably.
Fortunately, we donât have to dwell on hangman too much, since there are rigorous benchmarks like ARC-AGI-2 that show more conclusively the reasoning abilities of o3 and o4-mini are poor compared to typical humans.
There is some ambiguity in claims about whether an LLM knows how to do something. The spectrum of knowing how to do things ranges all the way from âCan it do it at least once, ever?â to âDoes it do it reliably, every time, without fail?â.
My experience was that I tried to play hangman with o4-mini twice and it failed both times in the same really goofy way, where it counted my guesses wrong when I guessed a letter that was in the word it later said I was supposed to be guessing.
When I played the game with o4-mini where it said the word was âbutterflyâ (and also said there was no âBâ in the word when I guessed âBâ), I didnât prompt it to make the word hard. I just said, after it claimed to have picked the word:
âE. Also, give me a vague hint or a general category.â
o4-mini said:
âItâs an animal.â
So, maybe asking for a hint or a category is the thing that causes it to fail. I donât know.
Even if I accepted the idea that the LLM âwants me to loseâ (which sounds dubious to me), then it doesnât know how to do that properly, either. In the âbutterflyâ example, it could, in theory, have chosen a word retroactively that filled in the blanks but didnât conflict with any guesses it said were wrong. But it didnât do that.
In the attempt where the word was âschmaltzinessâ, o4-miniâs response about which letters were where in the word (which I pasted in a footnote to my previous comment) was borderline incoherent. I could hypothesize that this was part of a secret strategy on its part to follow my directives, but much more likely, I think, is that it just lacks the capability to execute the task reliably.
Fortunately, we donât have to dwell on hangman too much, since there are rigorous benchmarks like ARC-AGI-2 that show more conclusively the reasoning abilities of o3 and o4-mini are poor compared to typical humans.