Kind of nerdy point, but I think the 0.8 effect size is likely inflated. A recent analysis of the meta-analyses on psychotherapy for depression found that the average summary effect size for meta-analyses was 0.56.
There seemed to be evidence for publication bias:
Meta-analyses that excluded high risk of bias studies, mean g=0.61, 95% CI (0.27, 0.95) with k=2413 included samples, produced larger effect size estimates than meta-analyses including only low risk of bias studies, mean g=0.45, 95% CI (0.19, 0.72) with k=1034.
Also, many studies used improper controls:
Meta-analyses that included samples compared with a wait-list control group, mean g=0.66, 95% CI (0.35, 0.96) with k=836, produced larger effect size estimates than treatments compared with care-as-usual, mean g=0.52, 95% CI (0.22, 0.82) with k=1194.
This analysis was specifically on therapy for depression, but I would expect the main criticisms/reasons for effect size inflation would apply to other mental health problems as well. FWIW, therapy for anxiety seems to have larger effect sizes than therapy for depression, so I would expect better performance on anxiety even when taking into account possible publication bias and inadequate controls.
My intuition after playing around with many of these models is that GPT 3.5 is probably not good enough at general reasoning to produce consistent results. It seems likely to me that either GPT 4 or Claude 2 would be good enough. FWIW, in a recent video Nathan Labenz said that he originally suggested to use GPT 4 and then go from there when people asked him for recommendations. The analysis gets more complicated with Claude 2 (perhaps slightly worse at reasoning, longer context window).