EA Forum: Data analysis and deep learning
(Cross-posted from my blog)
Here’s a fun project I undertook this month:
Scrape all posts from the Effective Altruism (EA) Forum
Explore overall trends in the data i.e. posts with the greatest number of comments, authors with the greatest number of posts etc.
Build a wordcloud to visualize the most used words
Fine-tune GPT2 on the EA Forum text corpus and generate text. Here’s a preview of the text GPT2 produced:
GITC’s Vaccination Prevention Research Project This is the first post of a three part series on the development of effective vaccines. This series will start with a list of possible vaccines that can be developed by the GPI team
Code and data for this project are available at this GitHib repo
1. Scraping
The robots.txt
file of EA Forum disallows crawling/scraping data from forum.effectivealtruism.org/allPosts
. To get around this, I did the following:
Manually loaded yearly links from
/allPosts
(this required manually clicking each year followed by “Load More”)Used a link extractor in Chrome to extract links from the page into a .csv file
Used Scrapy to scrape the following fields from each link: ‘date’, ‘author’, ‘title’, ‘number of comments’, ‘number of karma’, and ‘content’. I extracted data for posts published between 01-01-2013 and 05-04-2020. Posts with low karma (below −10) were ignored.
I cleaned the data and restricted subsequent analyses on posts published between 01-01-2013 to 04-15-2020, since recent posts were unlikely to have accumulated comments.
2. Exploratory Data Analysis
2.1 Number of yearly posts
2.2 Posts with the most comments
date | title | author | num_comments |
---|---|---|---|
4/23/2019 | Long-Term Future Fund: April 2019 grant recommendations | Habryka | 240 |
10/26/2017 | Why & How to Make Progress on Diversity & Inclusion in EA | Kelly_Witwicki | 235 |
11/15/2019 | I’m Buck Shlegeris, I do research and outreach at MIRI, AMA | Buck | 231 |
10/24/2016 | Concerns with Intentional Insights | Jeff_Kaufman | 186 |
2/26/2019 | After one year of applying for EA jobs: It is really, really hard to get hired by an EA organisation | EA applicant | 182 |
1/16/2020 | Growth and the case against randomista development | HaukeHillebrandt | 168 |
9/15/2014 | Open Thread | RyanCarey | 163 |
11/11/2017 | An Exploration of Sexual Violence Reduction for Effective Altruism Potential | Kathy_Forth | 156 |
10/22/2014 | Should Giving What We Can change its Pledge? | Michelle_Hutchinson | 144 |
9/3/2019 | Are we living at the most influential time in history? | William_MacAskill | 140 |
2.3 Posts with the most karma
date | title | author | num_karma |
---|---|---|---|
2/26/2019 | After one year of applying for EA jobs: It is really, really hard to get hired by an EA organisation | EA applicant | 285 |
1/16/2020 | Growth and the case against randomista development | HaukeHillebrandt | 269 |
1/13/2020 | EAF’s ballot initiative doubled Zurich’s development aid | Jonas Vollmer | 254 |
9/26/2019 | Some personal thoughts on EA and systemic change | Carl_Shulman | 183 |
9/3/2019 | Are we living at the most influential time in history? | William_MacAskill | 174 |
6/2/2019 | Is EA Growing? EA Growth Metrics for 2018 | Peter_Hurford | 168 |
3/7/2019 | SHIC Will Suspend Outreach Operations | cafelow | 165 |
8/20/2019 | List of ways in which cost-effectiveness estimates can be misleading | saulius | 155 |
6/20/2019 | Information security careers for GCR reduction | ClaireZabel | 153 |
8/14/2019 | Ask Me Anything! | William_MacAskill | 150 |
2.4 Authors with the most posts
author | num_posts |
---|---|
Aaron Gertler | 87 |
Milan_Griffes | 83 |
Peter_Hurford | 74 |
RyanCarey | 66 |
Tom_Ash | 58 |
2.5 Authors with the highest mean karma
Authors with <2 posts were excluded
author | mean_post_karma |
---|---|
Buck | 92.2 |
Jonas Vollmer | 77.0 |
Luisa_Rodriguez | 74.7 |
saulius | 73.5 |
sbehmer | 73.0 |
3. Word Clouds
My next goal was to make a word cloud representing the most commonly used words in the EA Forum. I preprocessed the post content as follows:
Tokenized words
Expanded word contractions e.g. ‘don’t’ → ‘do not’
Converted all words to lowercase
Removed tokens that were only punctuation
Filtered out stop words using nltk
Removed any tokens containing numbers
Removed any tokens containing ‘http’ or ‘www’
The resulting word cloud was built using the Python word_cloud package on ~2.6 million tokens:
The most common words appeared to be ‘one’ and ‘work’. I thought it would be instructive to see if these were over-represented in the EA Forum specifically, or are generally over-represented in other blogs/forums. To generate a control, I scraped all posts from Slate Star Codex (SSC) and performed identical text preprocessing to generate ~1.4 million tokens.
Using R’s wordcloud package I built a “comparative” word cloud showing words over-represented in the EA Forum versus SSC and vice-versa.
What about words that were common between the EA Forum and SSC?
4. GPT2
Finally, I used the text corpus from the EA Forum to fine-tune GPT2. GPT2 is a text-generating language model trained on 8 million web pages on the internet. You can play around with it on Talk to Transformer where the model completes the sentence given a prompt.
For fine-tuning I used this very convenient template Colab notebook made with gpt2-simple. I fine-tuned both the “small” (124 M parameters) and “medium” (355 M parameters) GPT2 models and tried a few temperature settings for text generation.
GPT2 can work with or without a prompt, and typically produces “good” text in 1⁄25 tries. I’ve cherry picked some outputs I obtained from the fine-tuned models:
-
Without a prompt:
What is Rationality in AI? Background Rationality is an ideology of AI, a belief system based on the belief that humans are naturally sentient. It argues that all human actions are responsible for the wellbeing of all sentient beings. This view is based on a large body of evidence. It is not a new, classicist theory of AI. Instead it is a mixture of old and new philosophy.
TL;DR’s for the EA Forum/Welcome: ”Effective altruists are trying to figure out how to build a more effective AI, using paperclips, but we’re not really sure how it’s possible to do so.
GITC’s Vaccination Prevention Research Project This is the first post of a three part series on the development of effective vaccines. This series will start with a list of possible vaccines that can be developed by the GPI team, ending with a brief overview of the science behind vaccine development. We will then address a variety of questions in the areas of biosecurity, biosecurity technologies and improved vaccine safety.This is a work in progress and we hope to get back to you soon.
-
With a prompt - ‘Introduction to effective altruism’:
Introduction to effective altruism [ edit ] The most basic assumption about effective altruism is that we must act rationally. It follows from this that you should maximize the benefit of the action you’re taking. The absolute value of a given intervention is the number of lives saved or improved. If the total benefit of the intervention is too small, you should not take it.
Introduction to effective altruism as a means to furthering one’s life, many people I have spoken with who are currently living on less than $10,000 per year have completed some EA-related training. The basic idea of effective altruism is that one should donate your money to the most effective charities. In practice, many people who are involved with EA don’t do this, and they will only use other methods to do it.
- 11 Jul 2021 15:44 UTC; 17 points) 's comment on Open Thread: July 2021 by (
Ouch.
This is awesome.
Would it be possible to make 2-grams and 3-grams as well. Maybe that provides more insight.
P.S It takes more time to generate it and more ram. It does not scale linearly with the n-grams.