I was wondering what is the likelihood of the results being a fluke, so I calculated the p-value for the null hypothesis that the true means of the scores respecting the question āOverall, how much did you like this content?ā for Steinhardt (S) and Gates (G) were equal.
Assumption: S and G follow normal distributions.
Sample sizes. n_S = n_G = 29.
Sample means. mu_S = 5.7. mu_G = 5.4.
Standar errors of the sample means. SE_S = SE_G = 0.2.
T-score: t = (mu_Sāmu_G)/ā(SE_S^2 + SE_G^2)^0.5 = 1.06.
The value will be lower if we compare Steinhardt with authors which got a lower mean score. I guess it would be nice to include some statistical analysis of this type into the report, such that it is easier to quickly assess how robust are the conclusions.
Nice, yeah! I wouldnāt have expected a statistically significant difference between a mean of 5.7 and 5.4 with those standard errors, but itās nice to see it here.
I considered doing a statistical test, and then spent some time googling how to do something like a ā3-pairedā ANOVA on data that looks like (āsā is subject, ārā is reading):
because Iād like to do an ANOVA on the raw scores, rather than the means. I did not resolve my confusion about about what to do about the 3-paired data (I guess you could lump each subjectās data in one column, or do it separately by ālikeā, āagreementā, and āinformativeā, but Iām interested in how good each of the readings are summed across the three metrics). I then gave up and just presented the summary statistics. (You can extract the raw scores from the Appendix if you put some work into it though, or I could pass along the raw scores, or you could tell me how to do this sort of analysis in Python if you wanted me to do it!)
When I look at these tables, Iām also usually squinting at the median rather than mean, though I look at both. You can see the distributions in the Appendix, which I like even better. But point taken about how itād be nice to have stats.
You can extract the raw scores from the Appendix if you put some work into it though, or I could pass along the raw scores, or you could tell me how to do this sort of analysis in Python if you wanted me to do it!
Ah, thanks for the suggestion! To be honest, I only have basic knowledge about stats, so I do not know to do the more complex analysis you described. My (quite possibly flawed) intuition for analysing all questions would be:
Determine, for each subject, āoverall scoreā = (āscore of question 1ā + āscore of question 2ā + āscore of question 3ā)/ā3.
If some subjects did not answer to all 3 questions, āoverall scoreā = āsum of the scores of the answered questionsā/āānumber of answered questionsā.
Calculate the mean and standard error for each of the AI safety materials.
Repeat the calculation of the p-value as I illustrated above for the pairs of AI safety materials (best, 2nd best), (2nd best, 3rd best), ā¦, and (2nd worst, worst), or just analyse all possible pairs.
Thanks for sharing!
I was wondering what is the likelihood of the results being a fluke, so I calculated the p-value for the null hypothesis that the true means of the scores respecting the question āOverall, how much did you like this content?ā for Steinhardt (S) and Gates (G) were equal.
Assumption: S and G follow normal distributions.
Sample sizes. n_S = n_G = 29.
Sample means. mu_S = 5.7. mu_G = 5.4.
Standar errors of the sample means. SE_S = SE_G = 0.2.
T-score: t = (mu_Sāmu_G)/ā(SE_S^2 + SE_G^2)^0.5 = 1.06.
Degrees of freedom: D = n_S + n_G ā 2 = 56.
P-value: 2*(1-T.DIST(t, D, 1)) = 29.3 %.
The value will be lower if we compare Steinhardt with authors which got a lower mean score. I guess it would be nice to include some statistical analysis of this type into the report, such that it is easier to quickly assess how robust are the conclusions.
Nice, yeah! I wouldnāt have expected a statistically significant difference between a mean of 5.7 and 5.4 with those standard errors, but itās nice to see it here.
I considered doing a statistical test, and then spent some time googling how to do something like a ā3-pairedā ANOVA on data that looks like (āsā is subject, ārā is reading):
[s1 r1 ālikeā] [s1 r1 āagreementā] [s1 r1 āinformativeā]
[s2 r1 ālikeā] [s2 r1 āagreementā] [s2 r1 āinformativeā]
ā¦ [s28 r1 ālikeā] [s28 r1 āagreementā] [s28 r1 āinformativeā]
[s1 r2 ālikeā] [s1 r2ā³agreementā] [s1 r2 āinformativeā]
[s2 r2 ālikeā] [s2 r2 āagreementā] [s2 r2 āinformativeā]
...
because Iād like to do an ANOVA on the raw scores, rather than the means. I did not resolve my confusion about about what to do about the 3-paired data (I guess you could lump each subjectās data in one column, or do it separately by ālikeā, āagreementā, and āinformativeā, but Iām interested in how good each of the readings are summed across the three metrics). I then gave up and just presented the summary statistics. (You can extract the raw scores from the Appendix if you put some work into it though, or I could pass along the raw scores, or you could tell me how to do this sort of analysis in Python if you wanted me to do it!)
When I look at these tables, Iām also usually squinting at the median rather than mean, though I look at both. You can see the distributions in the Appendix, which I like even better. But point taken about how itād be nice to have stats.
Ah, thanks for the suggestion! To be honest, I only have basic knowledge about stats, so I do not know to do the more complex analysis you described. My (quite possibly flawed) intuition for analysing all questions would be:
Determine, for each subject, āoverall scoreā = (āscore of question 1ā + āscore of question 2ā + āscore of question 3ā)/ā3.
If some subjects did not answer to all 3 questions, āoverall scoreā = āsum of the scores of the answered questionsā/āānumber of answered questionsā.
Calculate the mean and standard error for each of the AI safety materials.
Repeat the calculation of the p-value as I illustrated above for the pairs of AI safety materials (best, 2nd best), (2nd best, 3rd best), ā¦, and (2nd worst, worst), or just analyse all possible pairs.