There’s some evidence humans are also likely to fabricate post-hoc reasons for doing something. For example:
Split brain patients have behaved in a way consistent with input to their right eye, but give verbal explanations based on input to their left eye https://en.wikipedia.org/wiki/Split-brain.
Korsakoff syndrome is a syndrome where patients have severe memory deficits, and often confabulate–eg. answering questions they can’t know the answer to with a made up answer
That humans can confabulate things is not really relevant. The point is that the model’s textual output is being claimed as clear and direct evidence of the capacity/the model’s actual purpose/intentions, but you can generate an indistiguishable response when the model is not and cannot be reporting about its actual intentions—so the test is simply not a good test.
There’s some evidence humans are also likely to fabricate post-hoc reasons for doing something. For example:
Split brain patients have behaved in a way consistent with input to their right eye, but give verbal explanations based on input to their left eye https://en.wikipedia.org/wiki/Split-brain.
Giving people a small incentive to lie about enjoying a task has resulted in them report liking it more later than giving them a larger incentive https://en.wikipedia.org/wiki/Forced_compliance_theory
Korsakoff syndrome is a syndrome where patients have severe memory deficits, and often confabulate–eg. answering questions they can’t know the answer to with a made up answer
Some experiments have shown people asked to choose between two identical items will come up with a reason for their choice; other experiments have shown that people asked to choose between two options will come up with a coherent reason for their “choice” even if they’re presented with the option they didn’t choose. https://bigthink.com/mind-brain/confabulation-why-telling-ourselves-stories-makes-us-feel-ok/ ; https://www.verywellmind.com/what-is-choice-blindness-2795019
That humans can confabulate things is not really relevant. The point is that the model’s textual output is being claimed as clear and direct evidence of the capacity/the model’s actual purpose/intentions, but you can generate an indistiguishable response when the model is not and cannot be reporting about its actual intentions—so the test is simply not a good test.