This post was well written and well structured. You are talented and put a lot of effort into this!
Related to what others have commented, I think the cost of this is much higher than what you suggest, e.g. your guess of cost using HDD storage cost and a multiplier seems crude.
In addition to dev time as someone else mentioned, I think the compute/network for the hash (going through literally all content) seems large, possibly multiple orders of magnitude more than the cost implied here.
Also, I’m unsure, because I haven’t fully thought through your post, but some other thoughts:
Isn’t there really some sort of giant coordination problem here, it seems like you need to have a large fraction of all content timestamped in this way?
If 20% of content remain not timestamped, do issues about credibility of content remain?
Something beyond hashing is valuable and important? (I probably need to think about this more, this could be completely wrong)
There’s probably content with many tiny variations and it’s better to group that content together?
E.g., changing a few pixels or frames on a video alters the hash, instead you want some marker maps closer to the “content”
Finding the algorithm/implementation to do this seems important but also orders of magnitude more costly?
I think the compute/network for the hash (going through literally all content) seems large, possibly multiple orders of magnitude more than the cost implied here.
Yeah, they say[1] they have over 100PB content. That is quite a bit, and if it’s not in an inhouse datacenter, going through it will be expensive.
If 20% of content remain not timestamped, do issues about credibility of content remain?
If 20% of content remain not timestamped, then one wouldn’t consider all non-timestamped content suspicious on that account alone. The benefits come around in other ways:
If 80% of content is timestamped, then all that content is protected from suspicion that newer AI might have created it.
If the internet archive is known to have timestamped all of their content, then non-timestamped content presumably from an old enough version of a web site that is in the archive, becomes suspicious.
One might still consider non-timestamped content suspicious in a future where AI and/or institutional decline has begun nagging on the prior (default, average, general) trust for all content.
There’s probably content with many tiny variations and it’s better to group that content together? … Finding the algorithm/implementation to do this seems important but also orders of magnitude more costly?
It might be important, but it’s probably not as urgent. Timestamping has to happen at the time you want to have the timestamp for. Investigating and convincing people about what different pieces of content are equivalent from some inexact (or exact but higher-level) point of view, can be done later. I imagine that this is one possible future application for which these timestamps will be valuable. Applications such as these, I would probably put out of scope though.
This post was well written and well structured. You are talented and put a lot of effort into this!
Related to what others have commented, I think the cost of this is much higher than what you suggest, e.g. your guess of cost using HDD storage cost and a multiplier seems crude.
In addition to dev time as someone else mentioned, I think the compute/network for the hash (going through literally all content) seems large, possibly multiple orders of magnitude more than the cost implied here.
Also, I’m unsure, because I haven’t fully thought through your post, but some other thoughts:
Isn’t there really some sort of giant coordination problem here, it seems like you need to have a large fraction of all content timestamped in this way?
If 20% of content remain not timestamped, do issues about credibility of content remain?
Something beyond hashing is valuable and important? (I probably need to think about this more, this could be completely wrong)
There’s probably content with many tiny variations and it’s better to group that content together?
E.g., changing a few pixels or frames on a video alters the hash, instead you want some marker maps closer to the “content”
Finding the algorithm/implementation to do this seems important but also orders of magnitude more costly?
Yeah, they say[1] they have over 100PB content. That is quite a bit, and if it’s not in an inhouse datacenter, going through it will be expensive.
[1] https://archive.devcon.org/archive/watch/6/universal-access-to-all-knowledge-decentralization-experiments-at-the-internet-archive
If 20% of content remain not timestamped, then one wouldn’t consider all non-timestamped content suspicious on that account alone. The benefits come around in other ways:
If 80% of content is timestamped, then all that content is protected from suspicion that newer AI might have created it.
If the internet archive is known to have timestamped all of their content, then non-timestamped content presumably from an old enough version of a web site that is in the archive, becomes suspicious.
One might still consider non-timestamped content suspicious in a future where AI and/or institutional decline has begun nagging on the prior (default, average, general) trust for all content.
It might be important, but it’s probably not as urgent. Timestamping has to happen at the time you want to have the timestamp for. Investigating and convincing people about what different pieces of content are equivalent from some inexact (or exact but higher-level) point of view, can be done later. I imagine that this is one possible future application for which these timestamps will be valuable. Applications such as these, I would probably put out of scope though.