The Internet Is Failing The Website Preservation Test

Gork@beehaw.org · 2 years ago

The Internet Is Failing The Website Preservation Test

Lucien@beehaw.org · edit-2 2 years ago

I don’t think this will ever happen. The web is more than a network of changing documents. It’s a network of portals into systems which change state based on who is looking at them and what they do.

In order for something like this to work, you’d need to determine what the “official” view of any given document is, but the reality is that most documents are generated on the spot from many sources of data. And they aren’t just generated on the spot, they’re Turing complete documents which change themselves over time.

It’s a bit of a quantum problem - you can’t perfectly store a document while also allowing it to change, and the change in many cases is what gives it value.

Snapshots, distributed storage, and change feeds only work for static documents. Archive.org does this, and while you could probably improve the fidelity or efficiency, you won’t be able to change the underlying nature of what it is storing.

If all of reddit were deleted, it would definitely be useful to have a publically archived snapshot of Reddit. Doing so is definitely possible, particularly if they decide to cooperate with archival efforts. On the other hand, you can’t preserve all of the value by simply making a snapshot of the static content available.

All that said, if we limit ourselves to static documents, you still need to convince everyone to take part. That takes time and money away from productive pursuits such as actually creating content, to solve something which honestly doesn’t matter to the creator. It’s a solution to a problem which solely affects people accessing information after those who created it are no longer in a position to care about said information, with deep tradeoffs in efficiency, accessibility, and cost at the time of creation. You’d never get enough people to agree to it that it would make a difference.

lloram239@feddit.de · 2 years ago

but the reality is that most documents are generated on the spot from many sources of data.

That’s only true due to the way the current Web (d)evolved into a bunch of apps rendered in HTML. But there is fundamentally no reason why it should be that way. The actual data that drives the Web is mostly completely static. The videos Youtube has on their server don’t change. The post on Reddit very rarely change. Twitter posts don’t change either. The dynamic parts of the Web are the UI and the ads, they might change on each and every access, or be different for different users, but they aren’t the parts you want to link to anyway, you want to link to a specific users comment, not a specific users comment rendered in a specific version of the Reddit UI with whatever ads were on display that day.

Usenet did that (almost) correct 40 years ago, each message got an message-id, each message replying to that message would contain that id in a header. This is why large chunks of Usenet could be restored from tape archives and put be back together. The way content linked to each other didn’t depend on a storage location. It wasn’t perfect of course, it had no cryptography going on and depended completely on users behaving nicely.

Doing so is definitely possible, particularly if they decide to cooperate with archival efforts.

No, that’s the problem with URLs. This is not possible. The domain reddit.com belongs to a company and they control what gets shown when you access it. You can make your own reddit-archive.org, but that’s not going to fix the millions of links that point to reddit.com and are now all 404.

All that said, if we limit ourselves to static documents, you still need to convince everyone to take part.

The software world operates in large part on Git, which already does most of this. What’s missing there is some kind of DHT to automatically lookup content. It’s also not an all or nothing, take the Fediverse, the idea of distributing content is already there, but the URLs are garbage, like:

https://beehaw.org/comment/291402

What’s 291402? Why is the id 854874 when accessing the same post through feddit.de? Those are storage locations implementation details leaking out into the public. That really shouldn’t happen, that should be a globally unique content hash or a UUID.

When you have a real content hash you can do fun stuff, in IPFS URLs for example:

https://ipfs.io/ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf

The /ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf part is server independent, you can access the same document via:

https://dweb.link/ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf

or even just view it on your local machine directly via the filesystem, without manually downloading:

$ acrobat /ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf

There are a whole lot of possibilities that open up when you have better names for content, having links on the Web that don’t go 404 is just the start.

soiling@beehaw.org · 2 years ago

re: static content

How does authentication factor into this? even if we exclude marketing/tracking bullshit, there is a very real concern on many sites about people seeing the data they’re allowed to see. There are even legal requirements. If that data (such as health records) is statically held in a blockchain such that anyone can access it by its hash, privacy evaporates, doesn’t it?

lloram239@feddit.de · 2 years ago

How does authentication factor into this?

That’s where it gets complicated. Git sidesteps the problem by simply being a file format, the downloading still happens over regular old HTTP, so you can apply all the same restrictions as on a regular website. IPFS on the other side ignores the problem and assumes all data is redistributable and accessible to everybody. I find that approach rather problematic and short sighted, as that’s just not how copyright and licensing works. Even data that is freely redistributable needs to declare so, as otherwise the default fallback is copyright and that doesn’t allow redistribution unless explicitly allowed. IPFS so far has no way to tag data with license, author, etc. LBRY (the thing behind Odysee.com) should handle that a bit better, though I am not sure on the detail.

LewsTherinTelescope@beehaw.org · edit-2 2 years ago

Inability to edit or delete anything also fundamentally has a lot of problems on its own. Accidentally post a picture with a piece of mail in the background and catch it a second after sending? Too late, anyone who looks now has your home address. Child shares too much online and parent wants to undo that? No can do, it’s there forever now. Post a link and later learn it was misinformation and want to take it down? Sucks to be you, or anyone else that sees it. Your ex post revenge porn? Just gotta live with it for the rest of time.

There’s always a risk of that when posting anything online, but that doesn’t mean systems should be designed to lean into that by default.