JoyOptimizer comments on Training a GPT model on EA texts: what data?

JoyOptimizer Jun 4, 2022, 4:24 PM
1 point
0 ∶ 0
Thanks for these sources.

How should GiveWell blog and 80,000 hours blog weighted against each other? My instinct is to weight by the number of views.

Posts/comments in Facebook groups, slack groups, and discord groups?

Does the EA community have the norm that these comments are public? I want to make sure the consent of participants is obtained.
- Lorenzo Buonanno🔸Jun 4, 2022, 4:44 PM
  1 point
  0 ∶ 0
  Parent
  Does the EA community have the norm that these comments are public? I want to make sure the consent of participants is obtained.
  
  That’s a very good point and I think it’s definitely not the norm, didn’t think about text potentially getting leaked from the training set.
  How should GiveWell blog and 80,000 hours blog weighted against each other?
  What do you mean against each other? Do you mean compared to everything else, including the forum posts/comments?
  I have no idea, I think the number of views might lead to a better representation of the wider community, while the more technical posts might be more representative of the more “professional” parts of the movement.
  - JoyOptimizer Jun 4, 2022, 5:04 PM
    1 point
    0 ∶ 0
    Parent
    How much % of the training mix should be the GiveWell blog and how much should be the 80,000 hours blog? In other words, how many bytes of blog posts should be used from each, relative to the entire dataset?
    What kinds of posts are on each blog, and which best reflects the wider EA community, and which reflects the professional EA community? How can this be used to create a dataset?
    I also checked and neither blog has a direct view count measure—some other proxy metric would need to be used.
    - Charles He Jun 4, 2022, 5:52 PM
      2 points
      0 ∶ 0
      Parent
      Hmmm. You’re focused on the input text.
      Maybe, like, it seems like you want to focus on the “output” instead (and define some metric^[1] relative to this output and the “your targeted performance of the model”) ?
      In contrast to focusing on the output, focusing on the mix of input data seems different.
      For example, it’s not clear that a pass with a batch of GiveWell content, will shift GPT-3 more or less vs a same size batch of 80k content. It’s not clear that the input length of text would be a good measure, versus something like “perplexity of the fine tune text to the current GPT-3 output”. I haven’t trained a GPT-3 model though so I’m not sure.
      ^
      Although, in some sense, it’s really hard/crazy to think about what this metric would be, besides something trivial like perplexity. Maybe this difficulty is what you want to avoid?
    - Lorenzo Buonanno🔸Jun 4, 2022, 5:38 PM
      1 point
      0 ∶ 0
      Parent
      I honestly really don’t know :/
      I know it doesn’t help you, but I would expect both blogs (and all the other stuff on the websites that’s not in the blogs) to have some content aimed at a wider audience and some content that goes more into depth for a narrower audience.