Thanks for organising this! I think the survey is very valuable! I was wondering if you could you say more on why you “will not be making an anonymised data set available to the community”? That seems initially to me like an interesting and useful thing for community members to have, and was wondering whether it was just a lack of resources/it being difficult, that meant that you weren’t doing this anymore.
Roughly speaking, there seem to be two main benefits and two main costs to making an anonymised dataset public. The main costs: i) time and ii) people being turned off of the EA Survey due to believing that their data will be available and identifiable. The main benefits: iii) the community being able to access information (which isn’t included in our public reports) and iv) transparency and validation from people being able to replicate our results.
Unfortunately, the dataset is so heavily anonymised in order to try to reduce cost (ii) (while simultaneously increasing cost (i)), that it seems impossible for people to replicate many of our analyses (even with the public dataset), because the data is so heavily obscured, essentially vitiating (iv). We have considered, and are considering, other options like producing a simulated dataset for future surveys in order to allow people to complete their own analyses, if there were sufficient demand, but this would come at an even higher time cost. Conversely, it seems benefit (iii) can be attained, in the main, without releasing a public dataset, just by producing additional aggregate analyses on request (where possible).
Of course, we’ll see how this system works this year and may revisit it in the future.
Well, I am far from expert, but my understanding is that differential privacy operates on queries as opposed to individual datapoints. But there are tools s.a. randomized response which will provide plausible deniability to individual responses.
Thanks for organising this! I think the survey is very valuable! I was wondering if you could you say more on why you “will not be making an anonymised data set available to the community”? That seems initially to me like an interesting and useful thing for community members to have, and was wondering whether it was just a lack of resources/it being difficult, that meant that you weren’t doing this anymore.
Thanks!
Roughly speaking, there seem to be two main benefits and two main costs to making an anonymised dataset public. The main costs: i) time and ii) people being turned off of the EA Survey due to believing that their data will be available and identifiable. The main benefits: iii) the community being able to access information (which isn’t included in our public reports) and iv) transparency and validation from people being able to replicate our results.
Unfortunately, the dataset is so heavily anonymised in order to try to reduce cost (ii) (while simultaneously increasing cost (i)), that it seems impossible for people to replicate many of our analyses (even with the public dataset), because the data is so heavily obscured, essentially vitiating (iv). We have considered, and are considering, other options like producing a simulated dataset for future surveys in order to allow people to complete their own analyses, if there were sufficient demand, but this would come at an even higher time cost. Conversely, it seems benefit (iii) can be attained, in the main, without releasing a public dataset, just by producing additional aggregate analyses on request (where possible).
Of course, we’ll see how this system works this year and may revisit it in the future.
To add to that, if there are concerns about data being de-anonymized, there are statistical techniques to mitigate it.
Do you or anybody else reading this have experience with differential privacy techniques on relatively small datasets (less than 10k people, say)?
I’ve only heard of differential privacy used in the context of machine learning and massive datasets.
Well, I am far from expert, but my understanding is that differential privacy operates on queries as opposed to individual datapoints. But there are tools s.a. randomized response which will provide plausible deniability to individual responses.