You can extract the raw scores from the Appendix if you put some work into it though, or I could pass along the raw scores, or you could tell me how to do this sort of analysis in Python if you wanted me to do it!
Ah, thanks for the suggestion! To be honest, I only have basic knowledge about stats, so I do not know to do the more complex analysis you described. My (quite possibly flawed) intuition for analysing all questions would be:
Determine, for each subject, “overall score” = (“score of question 1“ + “score of question 2” + “score of question 3”)/3.
If some subjects did not answer to all 3 questions, “overall score” = “sum of the scores of the answered questions”/”number of answered questions”.
Calculate the mean and standard error for each of the AI safety materials.
Repeat the calculation of the p-value as I illustrated above for the pairs of AI safety materials (best, 2nd best), (2nd best, 3rd best), …, and (2nd worst, worst), or just analyse all possible pairs.
Ah, thanks for the suggestion! To be honest, I only have basic knowledge about stats, so I do not know to do the more complex analysis you described. My (quite possibly flawed) intuition for analysing all questions would be:
Determine, for each subject, “overall score” = (“score of question 1“ + “score of question 2” + “score of question 3”)/3.
If some subjects did not answer to all 3 questions, “overall score” = “sum of the scores of the answered questions”/”number of answered questions”.
Calculate the mean and standard error for each of the AI safety materials.
Repeat the calculation of the p-value as I illustrated above for the pairs of AI safety materials (best, 2nd best), (2nd best, 3rd best), …, and (2nd worst, worst), or just analyse all possible pairs.