Macro-level data is easy to find these days. If you want to know the historical GDP of China or carbon emissions of the U.S., you can find the information on many non-profit and for-profit sites via Google.
But suppose you want to quickly look up “people’s satisfaction with their daily lives” and “the amount they spend on food,” you’d have to read dozens of papers, locate the names of the datasets used, find the places where such survey data is hosted (if it’s available at all), create an account on the hosting site, download the data, and check whether the variable matches what you were looking for. The process wastes researchers’ time and stifles novel and cross-disciplinary use of existing data.
I’d like to see/build a search engine that catalogs all variable names and other pieces of meta-data for all datasets that humans have ever created. (Google’s product https://datasetsearch.research.google.com fails to catalog many important datasets and doesn’t allow variable-level search, which I think is the main value proposition of this hypothetical search engine.)
Using this hypothetical search engine, researchers can quickly look up datasets that contain the variable they want, filter by relevant parameters such as age, country, and year of data collection. Lots of academic journals now require authors to make their data public (e.g. https://dataverse.harvard.edu), so we should build on this momentum to further increase the value of open data. Re-use of existing data is very limited because researchers have no tool for discovery. Knowledge of “what data is available on X topic” largely exists in experts’ heads and transmit via word of mouth.
Another reason this search engine should be funded is that it lacks commercial viability: The amount of manual labor doesn’t decrease with scale. The datasets that will be catalogued by this hypothetical search engine take on all sorts of formats, and the codebooks don’t follow a fixed machine-readable template. (I assume large language models won’t be of much help either.) Thus, if we think that such a search engine ought to exist, it would be funded only by philanthropy.
A search engine for micro-level data
Macro-level data is easy to find these days. If you want to know the historical GDP of China or carbon emissions of the U.S., you can find the information on many non-profit and for-profit sites via Google.
But suppose you want to quickly look up “people’s satisfaction with their daily lives” and “the amount they spend on food,” you’d have to read dozens of papers, locate the names of the datasets used, find the places where such survey data is hosted (if it’s available at all), create an account on the hosting site, download the data, and check whether the variable matches what you were looking for. The process wastes researchers’ time and stifles novel and cross-disciplinary use of existing data.
I’d like to see/build a search engine that catalogs all variable names and other pieces of meta-data for all datasets that humans have ever created. (Google’s product https://datasetsearch.research.google.com fails to catalog many important datasets and doesn’t allow variable-level search, which I think is the main value proposition of this hypothetical search engine.)
Using this hypothetical search engine, researchers can quickly look up datasets that contain the variable they want, filter by relevant parameters such as age, country, and year of data collection. Lots of academic journals now require authors to make their data public (e.g. https://dataverse.harvard.edu), so we should build on this momentum to further increase the value of open data. Re-use of existing data is very limited because researchers have no tool for discovery. Knowledge of “what data is available on X topic” largely exists in experts’ heads and transmit via word of mouth.
Another reason this search engine should be funded is that it lacks commercial viability: The amount of manual labor doesn’t decrease with scale. The datasets that will be catalogued by this hypothetical search engine take on all sorts of formats, and the codebooks don’t follow a fixed machine-readable template. (I assume large language models won’t be of much help either.) Thus, if we think that such a search engine ought to exist, it would be funded only by philanthropy.