Decentralized Historical Data Preservation and Why EA Should Care

Key Points:

Trustworthy historical data preservation is crucial across disciplines, from academic research and journalism to jurisprudence, informing our actions, including EA initiatives.
Historical data, such as newspaper archives and audio/video recordings, often exist in a single physical copy and a digital copy held by a single entity.
With advancements in AI, falsifying digital records in any format (text, pictures, audio, video) will soon become easier.
It is important to have the capability to timestamp digital files to an era before the advent of advanced AI to ensure their authenticity.
Decentralized systems could be an effective solution for securing historical data reliably, though challenges include:
- Determining what to store and how to curate it.
- Finding tens of thousands (or more) participants for such a network.
- Building trust within the network itself.

Individuals and states have historically engaged in book burning and record doctoring, and our current era is no different. Much historical research, whether in academic or judicial contexts, relies on public records, including mundane items like old newspapers.

Previously, altering records required expensive forgery and was time consuming. Now, AI enables a single individual with malicious intent and access to falsify data easily, especially data that is infrequently used, generally uninteresting, and poorly secured due to funding shortages, often stored by a single entity like a state library.

For example, Vanuatu—a small island nation affected by climate change—is pursuing climate justice at the International Court of Justice. The case partially relies on the analysis of old media, newspapers and archival records from the US and other high-emission countries to demonstrate awareness of the changing climate and knowledge of the consequences. This is not to say that in this particular case anything was altered, or that solving climate change in court is the best approach; rather, the point is that establishing some form of truth and reaching an agreement in any framework requires trustworthy data.

EA, with its research-informed giving strategies, also relies on such data sources. While not always necessary, assessing any historical perspective crucially depends on understanding social discourse. If the underlying data is discredited, many research methods may become obsolete due to the inability to build sufficient trust in the data.

There is a significant difference between a professional researcher trusting their data source and the end user’s trust. Today, trust varies significantly among people with different belief systems and backgrounds. While a researcher, with a likely liberal bias, might maintain trust in institutions like libraries, others might be more skeptical. With the prevalence of AI, critics of any study could argue for potential forgeries in raw data by the state or others.

Decentralization appears to be a modern solution to this problem. Distributed ledger technology, which is based on mathematics, could serve not only as a means to safeguard data but also to build trust in it. Mathematics, arguably one of the least politicized and most trustworthy sciences, offers a reservoir of public trust we can tap into. While the contemporary world with its digital-first nature, large volumes, and significant noise (already largely driven by AI) may be hopeless to preserve in real time in a trustworthy manner, old records can still be preserved in their current digital form, already collected and partially digitized by libraries and states.

Examining the feasibility of a decentralized approach, taking the Library of Congress in the US as an example, would require distributing 21 petabytes. This could be achieved with 20,000-30,000 people with 10TB drives (assuming 10x redundancy), costing $10-20 million for hard drives to start ($50-100 per TB), and then $1-2 million per year (assuming a 10-year lifespan for a hard drive). However, this is just one library in one country.

While a “citizen librarian” model, analogous to citizen science, would be ideal from a decentralization perspective (and have Fahrenheit 451 vibes), it may be unrealistic to expect enough people with the necessary hardware to participate. However, it is not impossible as there are movements that store large databases of books on torrents. An alternative could involve a few hundred diverse institutions: universities, schools, private companies, and possibly even churches and hospitals, specifically chosen to cover the widest range of trust systems.

The challenge extends beyond merely preserving historical digital materials. It is essential to maintain public trust in their authenticity. While costly, there is still a feasible opportunity to address this issue; otherwise, doubt will always linger regarding the authenticity of what we are looking at.

I look forward to hearing your perspectives on this matter!