How much % of the training mix should be the GiveWell blog and how much should be the 80,000 hours blog? In other words, how many bytes of blog posts should be used from each, relative to the entire dataset?
What kinds of posts are on each blog, and which best reflects the wider EA community, and which reflects the professional EA community? How can this be used to create a dataset?
I also checked and neither blog has a direct view count measure—some other proxy metric would need to be used.
Maybe, like, it seems like you want to focus on the “output” instead (and define some metric[1] relative to this output and the “your targeted performance of the model”) ?
In contrast to focusing on the output, focusing on the mix of input data seems different.
For example, it’s not clear that a pass with a batch of GiveWell content, will shift GPT-3 more or less vs a same size batch of 80k content. It’s not clear that the input length of text would be a good measure, versus something like “perplexity of the fine tune text to the current GPT-3 output”. I haven’t trained a GPT-3 model though so I’m not sure.
Although, in some sense, it’s really hard/crazy to think about what this metric would be, besides something trivial like perplexity. Maybe this difficulty is what you want to avoid?
I honestly really don’t know :/ I know it doesn’t help you, but I would expect both blogs (and all the other stuff on the websites that’s not in the blogs) to have some content aimed at a wider audience and some content that goes more into depth for a narrower audience.
How much % of the training mix should be the GiveWell blog and how much should be the 80,000 hours blog? In other words, how many bytes of blog posts should be used from each, relative to the entire dataset?
What kinds of posts are on each blog, and which best reflects the wider EA community, and which reflects the professional EA community? How can this be used to create a dataset?
I also checked and neither blog has a direct view count measure—some other proxy metric would need to be used.
Hmmm. You’re focused on the input text.
Maybe, like, it seems like you want to focus on the “output” instead (and define some metric[1] relative to this output and the “your targeted performance of the model”) ?
In contrast to focusing on the output, focusing on the mix of input data seems different.
For example, it’s not clear that a pass with a batch of GiveWell content, will shift GPT-3 more or less vs a same size batch of 80k content. It’s not clear that the input length of text would be a good measure, versus something like “perplexity of the fine tune text to the current GPT-3 output”. I haven’t trained a GPT-3 model though so I’m not sure.
Although, in some sense, it’s really hard/crazy to think about what this metric would be, besides something trivial like perplexity. Maybe this difficulty is what you want to avoid?
I honestly really don’t know :/
I know it doesn’t help you, but I would expect both blogs (and all the other stuff on the websites that’s not in the blogs) to have some content aimed at a wider audience and some content that goes more into depth for a narrower audience.