I think the compute/network for the hash (going through literally all content) seems large, possibly multiple orders of magnitude more than the cost implied here.
Yeah, they say[1] they have over 100PB content. That is quite a bit, and if it’s not in an inhouse datacenter, going through it will be expensive.
If 20% of content remain not timestamped, do issues about credibility of content remain?
If 20% of content remain not timestamped, then one wouldn’t consider all non-timestamped content suspicious on that account alone. The benefits come around in other ways:
If 80% of content is timestamped, then all that content is protected from suspicion that newer AI might have created it.
If the internet archive is known to have timestamped all of their content, then non-timestamped content presumably from an old enough version of a web site that is in the archive, becomes suspicious.
One might still consider non-timestamped content suspicious in a future where AI and/or institutional decline has begun nagging on the prior (default, average, general) trust for all content.
There’s probably content with many tiny variations and it’s better to group that content together? … Finding the algorithm/implementation to do this seems important but also orders of magnitude more costly?
It might be important, but it’s probably not as urgent. Timestamping has to happen at the time you want to have the timestamp for. Investigating and convincing people about what different pieces of content are equivalent from some inexact (or exact but higher-level) point of view, can be done later. I imagine that this is one possible future application for which these timestamps will be valuable. Applications such as these, I would probably put out of scope though.
Yeah, they say[1] they have over 100PB content. That is quite a bit, and if it’s not in an inhouse datacenter, going through it will be expensive.
[1] https://archive.devcon.org/archive/watch/6/universal-access-to-all-knowledge-decentralization-experiments-at-the-internet-archive
If 20% of content remain not timestamped, then one wouldn’t consider all non-timestamped content suspicious on that account alone. The benefits come around in other ways:
If 80% of content is timestamped, then all that content is protected from suspicion that newer AI might have created it.
If the internet archive is known to have timestamped all of their content, then non-timestamped content presumably from an old enough version of a web site that is in the archive, becomes suspicious.
One might still consider non-timestamped content suspicious in a future where AI and/or institutional decline has begun nagging on the prior (default, average, general) trust for all content.
It might be important, but it’s probably not as urgent. Timestamping has to happen at the time you want to have the timestamp for. Investigating and convincing people about what different pieces of content are equivalent from some inexact (or exact but higher-level) point of view, can be done later. I imagine that this is one possible future application for which these timestamps will be valuable. Applications such as these, I would probably put out of scope though.