I donāt think we should say AI has passed the Turing test until it has passed the test under conditions similar to this:
But I do really like that these researchers have put the test online for people to try!
https://āāturingtest.live/āā
Iāve had one conversation as the interrogator, and I was able to easily pick out the human in 2 questions. My opener was:
āHi, how many words are there in this sentence?ā
The AI said ā8ā, I said āare you sure?ā, and it re-iterated its incorrect answer after claiming to have recounted.
The human said ā9ā, I said āare you sure?ā, and they said āyes?ā.. indicating confusion and annoyance for being challenged on such an obvious question.
Maybe I was paired with one of the worse LLMs⦠but unless itās using hidden chain of thought under the hood (which it doesnāt sound like it is) then I donāt think even GPT 4.5 can accurately perform counting tasks without writing out its full working.
My current job involves trying to get LLMs to automate business tasks, and my impression is that current state of the art models are still a fair way from something which is truly indistinguishable from an average human, even when confronted with relatively simple questions! (Not saying they wonāt quickly close the gap though, maybe they will!)
Thanks for sharing the original definition! I didnāt realise Turing had defined the parameters so precisely, and that they werenāt actually that strict! I
I probably need to stop saying that AI hasnāt passed the Turing test yet then. I guess it has! Youāre right that this ends up being an argument over semantics, but seems fair to let Alan Turing define what the term āTuring Testā should mean.
But I do think that the stricter form of the Turing test defined in that metaculus forecast is still a really useful metric for deciding when AGI has been achieved, whereas this much weaker Turing test probably isnāt.
(Also, for what itās worth, the business tasks I have in mind here arenāt really ācomplexā, they are the kind of tasks that an average human could quite easily do well on within a 5-minute window, possibly as part of a Turing-test style setup, but LLMs struggle with)