Like Lizka said, glossaries seem to be a great idea!
Drawing on the posts and projects for software here, here,here, and here, there seems to be a concrete, accessible software project for creating a glossary procedurally.
(Somewhat technical stuff below, I wrote this quickly and it’s sort of long.)
Sketch of project
You can programmatically create an EA Jargon glossary that can complement, not replace a human glossary. It can continuously refresh itself, capturing new words as time passes.
This is writing a Python script or module that finds EA forum words and associates it with definitions.
To be concrete, here is one sketch of how to how to build this:
Essentially, the project is just counting words, filtering ones that appear a lot in EA content, and then attaching definitions to these words.
To get these words, essentially all you need to do is get a set of EA content (EA Forum and Lesswrong comments/posts, which is accessible using the GraphQL database) and compare these to words that appear in a normal corpus (this can come from Reddit, Wikipedia, e.g. see Pushshift dumps here).
You want to do some normal “NLP preprocessing”, and stuff like Tf-idf (essentially just adjusts for words that appear a lot) or n-grams (which captures two word concepts like “great reflection”). Synonyms with word vectors can be done and more advanced extensions too.
Pairing words with definitions is harder, and human input may be required. The script could probably help by making dictionary calls (words like “grok”, “differential”, “delta” probably can be found in normal dictionaries) and also produce snippets from recent contexts words were used.
For the end output, as Lizka suggested, you could integrate this into the wiki, or even some kind of “view” for the forum, like a browser plug-in or LessWrong extension.
Because the core work is essentially word counting and the later steps can be very sophisticated, this project would be accessible to people newer in NLP, and also interest more advanced practitioners.
By the way, this seems like this totally could get funded with an infrastructure grant. Maybe if you wanted go in this direction, optionally:
You might want to submit the grant with someone as a “lead”, a sort of “project manager” who has organizes people (not necessarily with formal or technical credentials, just someone friendly and creates collaboration among EAs).
There’s different styles of doing this, but you could set this up as an open source project with paid commitment, and try to tag as many EA software devs as reasonably plausible.
Maybe there’s reasons to get an EA infrastructure grant to do this:
This could help create a natural reason for collaboration and get EAs together
The formal grant might help encourage the project to get shipped (since names are on it and money has been paid)
Seems plausible it gives some experience for EAs doing collaborations in the future.
Anyways, apologies for being long. I just sometimes get excited and like to write about ideas like this. Feel free to ignore me and just do it!
Like Lizka said, glossaries seem to be a great idea!
Drawing on the posts and projects for software here, here, here, and here, there seems to be a concrete, accessible software project for creating a glossary procedurally.
(Somewhat technical stuff below, I wrote this quickly and it’s sort of long.)
Sketch of project
You can programmatically create an EA Jargon glossary that can complement, not replace a human glossary. It can continuously refresh itself, capturing new words as time passes.
This is writing a Python script or module that finds EA forum words and associates it with definitions.
To be concrete, here is one sketch of how to how to build this:
Essentially, the project is just counting words, filtering ones that appear a lot in EA content, and then attaching definitions to these words.
To get these words, essentially all you need to do is get a set of EA content (EA Forum and Lesswrong comments/posts, which is accessible using the GraphQL database) and compare these to words that appear in a normal corpus (this can come from Reddit, Wikipedia, e.g. see Pushshift dumps here).
You want to do some normal “NLP preprocessing”, and stuff like Tf-idf (essentially just adjusts for words that appear a lot) or n-grams (which captures two word concepts like “great reflection”). Synonyms with word vectors can be done and more advanced extensions too.
Pairing words with definitions is harder, and human input may be required. The script could probably help by making dictionary calls (words like “grok”, “differential”, “delta” probably can be found in normal dictionaries) and also produce snippets from recent contexts words were used.
For the end output, as Lizka suggested, you could integrate this into the wiki, or even some kind of “view” for the forum, like a browser plug-in or LessWrong extension.
Because the core work is essentially word counting and the later steps can be very sophisticated, this project would be accessible to people newer in NLP, and also interest more advanced practitioners.
By the way, this seems like this totally could get funded with an infrastructure grant. Maybe if you wanted go in this direction, optionally:
You might want to submit the grant with someone as a “lead”, a sort of “project manager” who has organizes people (not necessarily with formal or technical credentials, just someone friendly and creates collaboration among EAs).
There’s different styles of doing this, but you could set this up as an open source project with paid commitment, and try to tag as many EA software devs as reasonably plausible.
Maybe there’s reasons to get an EA infrastructure grant to do this:
This could help create a natural reason for collaboration and get EAs together
The formal grant might help encourage the project to get shipped (since names are on it and money has been paid)
Seems plausible it gives some experience for EAs doing collaborations in the future.
Anyways, apologies for being long. I just sometimes get excited and like to write about ideas like this. Feel free to ignore me and just do it!