Lorenzo Buonanno🔸 answers Training a GPT model on EA texts: what data?

Lorenzo Buonanno🔸 4 Jun 2022 15:55 UTC
3 points
0 ∶ 0
Some other resources that come to mind, not sure if they would all be useful and I’m probably forgetting tons:

- https://forum.effectivealtruism.org/library
—https://forum.effectivealtruism.org/topics/all
—https://blog.givewell.org/ (maybe including comments)
- Besides the blog, there’s lots of other great stuff and links to documents around GiveWell website, random samples: https://www.givewell.org/how-we-work/our-criteria/cost-effectiveness/comparing-moral-weights
https://docs.google.com/document/d/1ZKq-MNU-xtn_48uN33L6VvBEZRAduvjwWMeaEffL4K4
https://docs.google.com/document/d/1Jwe0PzDhCIIE3ymH_1Ct8btAaQ8_C_wXl23AxDgZi9M
—https://www.givingwhatwecan.org/ (the blog but also other pages e.g.)
- https://80000hours.org/all-articles/
- https://www.openphilanthropy.org/ has much more content than the grants database
—https://www.lesswrong.com/tag/effective-altruism

Posts/comments in Facebook groups, slack groups, and discord groups?
- JoyOptimizer 4 Jun 2022 16:24 UTC
  1 point
  0 ∶ 0
  Parent
  Thanks for these sources.
  
  How should GiveWell blog and 80,000 hours blog weighted against each other? My instinct is to weight by the number of views.
  
  Posts/comments in Facebook groups, slack groups, and discord groups?
  
  Does the EA community have the norm that these comments are public? I want to make sure the consent of participants is obtained.
  - Lorenzo Buonanno🔸 4 Jun 2022 16:44 UTC
    1 point
    0 ∶ 0
    Parent
    Does the EA community have the norm that these comments are public? I want to make sure the consent of participants is obtained.
    
    That’s a very good point and I think it’s definitely not the norm, didn’t think about text potentially getting leaked from the training set.
    How should GiveWell blog and 80,000 hours blog weighted against each other?
    What do you mean against each other? Do you mean compared to everything else, including the forum posts/comments?
    I have no idea, I think the number of views might lead to a better representation of the wider community, while the more technical posts might be more representative of the more “professional” parts of the movement.
    - JoyOptimizer 4 Jun 2022 17:04 UTC
      1 point
      0 ∶ 0
      Parent
      How much % of the training mix should be the GiveWell blog and how much should be the 80,000 hours blog? In other words, how many bytes of blog posts should be used from each, relative to the entire dataset?
      What kinds of posts are on each blog, and which best reflects the wider EA community, and which reflects the professional EA community? How can this be used to create a dataset?
      I also checked and neither blog has a direct view count measure—some other proxy metric would need to be used.
      - Charles He 4 Jun 2022 17:52 UTC
        2 points
        0 ∶ 0
        Parent
        Hmmm. You’re focused on the input text.
        Maybe, like, it seems like you want to focus on the “output” instead (and define some metric^[1] relative to this output and the “your targeted performance of the model”) ?
        In contrast to focusing on the output, focusing on the mix of input data seems different.
        For example, it’s not clear that a pass with a batch of GiveWell content, will shift GPT-3 more or less vs a same size batch of 80k content. It’s not clear that the input length of text would be a good measure, versus something like “perplexity of the fine tune text to the current GPT-3 output”. I haven’t trained a GPT-3 model though so I’m not sure.
        ^
        Although, in some sense, it’s really hard/crazy to think about what this metric would be, besides something trivial like perplexity. Maybe this difficulty is what you want to avoid?
      - Lorenzo Buonanno🔸 4 Jun 2022 17:38 UTC
        1 point
        0 ∶ 0
        Parent
        I honestly really don’t know :/
        I know it doesn’t help you, but I would expect both blogs (and all the other stuff on the websites that’s not in the blogs) to have some content aimed at a wider audience and some content that goes more into depth for a narrower audience.