Does the EA community have the norm that these comments are public? I want to make sure the consent of participants is obtained.
Thatâs a very good point and I think itâs definitely not the norm, didnât think about text potentially getting leaked from the training set.
How should GiveWell blog and 80,000 hours blog weighted against each other?
What do you mean against each other? Do you mean compared to everything else, including the forum posts/âcomments? I have no idea, I think the number of views might lead to a better representation of the wider community, while the more technical posts might be more representative of the more âprofessionalâ parts of the movement.
How much % of the training mix should be the GiveWell blog and how much should be the 80,000 hours blog? In other words, how many bytes of blog posts should be used from each, relative to the entire dataset?
What kinds of posts are on each blog, and which best reflects the wider EA community, and which reflects the professional EA community? How can this be used to create a dataset?
I also checked and neither blog has a direct view count measureâsome other proxy metric would need to be used.
Maybe, like, it seems like you want to focus on the âoutputâ instead (and define some metric[1] relative to this output and the âyour targeted performance of the modelâ) ?
In contrast to focusing on the output, focusing on the mix of input data seems different.
For example, itâs not clear that a pass with a batch of GiveWell content, will shift GPT-3 more or less vs a same size batch of 80k content. Itâs not clear that the input length of text would be a good measure, versus something like âperplexity of the fine tune text to the current GPT-3 outputâ. I havenât trained a GPT-3 model though so Iâm not sure.
Although, in some sense, itâs really hard/âcrazy to think about what this metric would be, besides something trivial like perplexity. Maybe this difficulty is what you want to avoid?
I honestly really donât know :/â I know it doesnât help you, but I would expect both blogs (and all the other stuff on the websites thatâs not in the blogs) to have some content aimed at a wider audience and some content that goes more into depth for a narrower audience.
Thatâs a very good point and I think itâs definitely not the norm, didnât think about text potentially getting leaked from the training set.
What do you mean against each other? Do you mean compared to everything else, including the forum posts/âcomments?
I have no idea, I think the number of views might lead to a better representation of the wider community, while the more technical posts might be more representative of the more âprofessionalâ parts of the movement.
How much % of the training mix should be the GiveWell blog and how much should be the 80,000 hours blog? In other words, how many bytes of blog posts should be used from each, relative to the entire dataset?
What kinds of posts are on each blog, and which best reflects the wider EA community, and which reflects the professional EA community? How can this be used to create a dataset?
I also checked and neither blog has a direct view count measureâsome other proxy metric would need to be used.
Hmmm. Youâre focused on the input text.
Maybe, like, it seems like you want to focus on the âoutputâ instead (and define some metric[1] relative to this output and the âyour targeted performance of the modelâ) ?
In contrast to focusing on the output, focusing on the mix of input data seems different.
For example, itâs not clear that a pass with a batch of GiveWell content, will shift GPT-3 more or less vs a same size batch of 80k content. Itâs not clear that the input length of text would be a good measure, versus something like âperplexity of the fine tune text to the current GPT-3 outputâ. I havenât trained a GPT-3 model though so Iâm not sure.
Although, in some sense, itâs really hard/âcrazy to think about what this metric would be, besides something trivial like perplexity. Maybe this difficulty is what you want to avoid?
I honestly really donât know :/â
I know it doesnât help you, but I would expect both blogs (and all the other stuff on the websites thatâs not in the blogs) to have some content aimed at a wider audience and some content that goes more into depth for a narrower audience.