Out of curiosity, what do you think of my argument that LLMs can’t pass a rigorous Turing test because a rigorous Turing test could include ARC-AGI 2 as a subset (and, indeed, any competent panel of judges should include it) and LLMs can’t pass that? Do you agree? Do you think that’s a higher level of rigour than a Turing test should have and that’s shifting the goal posts?
I think we both agree that there are ways to tell apart a human from an LLM of 2025, including handing ARC-AGI-2 to each.
Whether or not that fact means “LLMs of 2025 cannot pass the Turing Test” seems to be purely an argument about the definition / rules of “Turing Test”. Since that’s a pointless argument over definitions, I don’t really care to hash it out further. You can have the last word on that. Shrug :-P
Okay, since you’re giving me the last word, I’ll take it.
There are some ambiguities in terms of how to interpret the concept of the Turing test. People have disagreed about what the rules should be. I will say that in Turing’s original paper, he did introduce the concept of testing the computer via sub-games:
Q: Do you play chess?
A: Yes.
Q: I have K at my K1, and no other pieces. You have only K at K6 and R at R1. It is your move. What do you play?
A: (After a pause of 15 seconds) R-R8 mate.
Including other games or puzzles, like the ARC-AGI 2 puzzles, seems in line with this.
My understanding of the Turing test has always been that there should be basically no restrictions at all — no time limit, no restrictions on what can be asked, no word limit, no question limit.
In principle, I don’t see why you wouldn’t allow sending of images, but if you only allowed text-based questions, I suppose even then a judge could tediously write out the ARC-AGI 2 tasks, since they consist of coloured squares in a 30 x 30 grid, and ask the interlocutor to re-create them in Paint.
To be clear, I don’t think ARC-AGI 2 is nearly the only thing you could use to make an LLM fail the Turing test, it’s just an easy example.
In Daniel Dennett’s 1985 essay “Can Machines Think?” on the Turing test (included in the anthology Brainchildren), Dennett says that “the unrestricted test” is “the only test that is of any theoretical interest at all”. He emphasizes that judges should be able to ask anything:
People typically ignore the prospect of having the judge ask off-the-wall questions in the Turing test, and hence they underestimate the competence a computer would have to have to pass the test. But remember, the rules of the imitation game as Turing presented it permit the judge to ask any question that could be asked of a human being—no holds barred.
He also warns:
Cheapened versions of the Turing test are everywhere in the air. Turing’s test is not just effective, it is entirely natural—this is, after all, the way we assay the intelligence of each other every day. And since incautious use of such judgments and such tests is the norm, we are in some considerable danger of extrapolating too easily, and judging too generously, about the understanding of the systems we are using.
It’s true that before we had LLMs we had lower expectations of what computers can do and asked easier questions. But it doesn’t seem right to me to say that as computers get better at natural language, we shouldn’t be able to ask harder questions.
I do think the definition and conception of the Turing test is important. If people say that LLMs have passed the Turing test and that’s not true, it gives a false impression of LLMs’ capabilities, just like when people falsely claim LLMs are AGI.
You could qualify this by saying LLMs can pass a restricted, weak version of the Turing test — but not an unrestricted, adversarial Turing test — which was also true of older computer systems before deep learning. This would sidestep the question of defining the “true” Turing test and still give accurate information.
Thanks!
I think we both agree that there are ways to tell apart a human from an LLM of 2025, including handing ARC-AGI-2 to each.
Whether or not that fact means “LLMs of 2025 cannot pass the Turing Test” seems to be purely an argument about the definition / rules of “Turing Test”. Since that’s a pointless argument over definitions, I don’t really care to hash it out further. You can have the last word on that. Shrug :-P
Okay, since you’re giving me the last word, I’ll take it.
There are some ambiguities in terms of how to interpret the concept of the Turing test. People have disagreed about what the rules should be. I will say that in Turing’s original paper, he did introduce the concept of testing the computer via sub-games:
Including other games or puzzles, like the ARC-AGI 2 puzzles, seems in line with this.
My understanding of the Turing test has always been that there should be basically no restrictions at all — no time limit, no restrictions on what can be asked, no word limit, no question limit.
In principle, I don’t see why you wouldn’t allow sending of images, but if you only allowed text-based questions, I suppose even then a judge could tediously write out the ARC-AGI 2 tasks, since they consist of coloured squares in a 30 x 30 grid, and ask the interlocutor to re-create them in Paint.
To be clear, I don’t think ARC-AGI 2 is nearly the only thing you could use to make an LLM fail the Turing test, it’s just an easy example.
In Daniel Dennett’s 1985 essay “Can Machines Think?” on the Turing test (included in the anthology Brainchildren), Dennett says that “the unrestricted test” is “the only test that is of any theoretical interest at all”. He emphasizes that judges should be able to ask anything:
He also warns:
It’s true that before we had LLMs we had lower expectations of what computers can do and asked easier questions. But it doesn’t seem right to me to say that as computers get better at natural language, we shouldn’t be able to ask harder questions.
I do think the definition and conception of the Turing test is important. If people say that LLMs have passed the Turing test and that’s not true, it gives a false impression of LLMs’ capabilities, just like when people falsely claim LLMs are AGI.
You could qualify this by saying LLMs can pass a restricted, weak version of the Turing test — but not an unrestricted, adversarial Turing test — which was also true of older computer systems before deep learning. This would sidestep the question of defining the “true” Turing test and still give accurate information.