I have not rigorously scrutinized my code and the graphs below for accuracy. This post is primarily intended to share the data and a few examples of what can be done. But if you spot an inaccuracy, please let me know. Thanks!
Edit: also adding my code, which is hideously messy but maybe better than nothing!
Note: some information is lost (like the text of each comment). For the full data, check out one of the above two data sets! Screenshot of what you’re getting:
4. Small sample version of #3 (.csv, 41kb, and you can open in Excel—the other ones you can’t, at least on my computer)
100 random posts and the first 100 characters of text in each
For all of these, each row or observation in one forum post, and each column or variable is some bit of information about that post, like its author, text, or date of publication.
I’m posting the cleaned version because the raw data .csv has ~21,000 variables, some of which have rather unfriendly names like
You can find that here, which is an interactive version of the following screenshot:
Call for proposals
Oh yeah, if there are any questions you think I should look into using this data, I’d love to to know! I’m half-decent at data cleaning and reshaping in R, econometrics in R and Stata, and data visualization, and not a whole lot else.
Thanks for reading and I look forward to whatever insights or visualizations others generate!
(Cleaned) EA Forum data for your interest and enjoyment
Note/disclaimer
I have not rigorously scrutinized my code and the graphs below for accuracy. This post is primarily intended to share the data and a few examples of what can be done. But if you spot an inaccuracy, please let me know. Thanks!
Edit: also adding my code, which is hideously messy but maybe better than nothing!
Part 1: data
A few weeks ago, Jacques Thibs scraped the forum and put some data in the EA Twitter group for anyone to check out.
Long story short, I’ve cleaned the data a bit and figured some other users here might like to check it out! The data files are:
Original raw data (175 MB, .jsonl)
Original data, but as a .csv (~800 MB, don’t ask me why it’s larger)
My cleaned version (~90MG, .csv)
Note: some information is lost (like the text of each comment). For the full data, check out one of the above two data sets! Screenshot of what you’re getting:
4. Small sample version of #3 (.csv, 41kb, and you can open in Excel—the other ones you can’t, at least on my computer)
100 random posts and the first 100 characters of text in each
For all of these, each row or observation in one forum post, and each column or variable is some bit of information about that post, like its author, text, or date of publication.
I’m posting the cleaned version because the raw data .csv has ~21,000 variables, some of which have rather unfriendly names like
(variable 626), or
(variable 20,112)
Part 2: charts
And below are just a few graphs I made using ggplot and shiny in R:
Monthly EA forum activity over time
January 2013-May 2022
Zoomed in to exclude the last ~2022 spike
Most prolific Forum authors
Showing full data
Zoomed in because Aaron Gertler is a God
Interactive chart on popular tags
You can find that here, which is an interactive version of the following screenshot:
Call for proposals
Oh yeah, if there are any questions you think I should look into using this data, I’d love to to know! I’m half-decent at data cleaning and reshaping in R, econometrics in R and Stata, and data visualization, and not a whole lot else.
Thanks for reading and I look forward to whatever insights or visualizations others generate!