I think the compute/network for the hash (going through literally all content) seems large, possibly multiple orders of magnitude more than the cost implied here.
Yeah, they say[1] they have over 100PB content. That is quite a bit, and if it’s not in an inhouse datacenter, going through it will be expensive.
If 20% of content remain not timestamped, do issues about credibility of content remain?
If 20% of content remain not timestamped, then one wouldn’t consider all non-timestamped content suspicious on that account alone. The benefits come around in other ways:
If 80% of content is timestamped, then all that content is protected from suspicion that newer AI might have created it.
If the internet archive is known to have timestamped all of their content, then non-timestamped content presumably from an old enough version of a web site that is in the archive, becomes suspicious.
One might still consider non-timestamped content suspicious in a future where AI and/or institutional decline has begun nagging on the prior (default, average, general) trust for all content.
There’s probably content with many tiny variations and it’s better to group that content together? … Finding the algorithm/implementation to do this seems important but also orders of magnitude more costly?
It might be important, but it’s probably not as urgent. Timestamping has to happen at the time you want to have the timestamp for. Investigating and convincing people about what different pieces of content are equivalent from some inexact (or exact but higher-level) point of view, can be done later. I imagine that this is one possible future application for which these timestamps will be valuable. Applications such as these, I would probably put out of scope though.
Let me clarify the cryptography involved:
There is cryptographic signing, that lets Alice sign a statement X so that Bob is able to cryptographically verify that Alice claims X. X could for example be “Content Y was created in 2023”. This signature is evidence for X only to the extent that Bob trusts Alice. This is NOT what I suggest we use, at least not primarily.
There is cryptographic time-stamping, that lets Alice timestamp content X at time T so that Bob is able to cryptographically verify that content X existed before time T. Bob does not need to trust Alice, or anyone else at all, for this to work. This is what I suggest we use.
Back-dating content is therefore cryptographically impossible when using cryptographic time-stamping. That is kind of the point; otherwise I wouldn’t be convinced that the value of the timestamps would grow over time. To the extent we use cryptographic time-stamping, the argument here is ‘it will be entirely impossible in the future’.
However, cryptographic time-stamping and cryptographic signing can be combined in interesting ways:
We could sign first and then timestamp, achieving a cryptographic proof that in or before 2023, archive.org claimed that content X was created in 1987. This might be valuable if the organization or its cryptographic key at a later date were to be compromised, e.g. by corruption, hacking, or government overreach. Timestamps created after an organization is compromised can still be trusted: You can always know the content was created in or before 2023, even if you have reason to doubt a claim made at that time.
We could timestamp, then sign, then timestamp. This allows anyone to cryptographically verify that e.g. sometime between 2023-01-20 and 2023-01-30, Alice claimed that content X was created in 1987. This could be valuable if we later learn we have reason to distrust the organization before a certain date. Again, we will always know X was created before 2023-01-30, no matter anyone’s trustworthiness.
As for the issue with 2023 timestamps being misleading for 1995 content: This issue is probably very real, but it’s less urgent. Making the timestamps is urgent. On top of that underlying data and cryptographic proofs, different UIs can be built and improved over time.