Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

How to handle sensibly lossful digital long term preservation?

Is there an application, or a module/plug-in in a long time preservation framework, which makes a storage system systematically lose information you don't need any longer while preserving the data you will need again in the future?

I think of a system that could distinguish accesses to data which indicate that it was used (e.g. reading in a file for a longer time that suggests that a user really reads it, or re-opening the same file without using search in prior, or with a certain frequency per month, year, ...) from accesses which are not likely to suggest the data to be of any worth any more (e.g. virus scanners or search systems inspecting those files or users flipping through files found from search results for a short time only, then going on to the next hit suggesting they didn't find the information they were looking for in them) and finally forgetting / overwriting uninteresting data in case that storage becomes short while keeping the worthwhile data.

I observe this for a couple of times whenever I migrated to new desktop computer hardware, I kept a backup but usually I only need very few things out of it later on, either because old software cannot be used in a sensible way any longer (what to do with a DOS screen saver on Windows?) or, for personal data, because of interests simply shifting. (For example, I am no longer interested in things I cared for as a teenager.) The same can be observed in libraries: It may happen that, if you randomly pick a book from a shelf, its pages stick together, indicating no one has read it for a couple of centuries.

Are there (problably software) approaches that address the problem of sensible forgetting in digital age?

Paramaeleon

Comments

Answer by Paramaeleon

Freenet is an open source software that uses a decentralized distributed data store to store information. Servers contribute memory from their own hardware to “the net”, but the data store is limited to the sum of these contributions. Whenever a user requests a file it is copied peer-to-peer, leaving copies on the systems it passes. So files often requested are highly available, files never requested fall off aback after some time.

Comments

Answer by Paul Wheatley

De-selection or disposal of unwanted data is a difficult challenge. Your question assumes that data becomes less valuable the longer it remains stored without access by a user. However, this is not always a good measure of its value. Interest in data by users can remain low or none existent for a considerable period of time, before current events or some other triggers create new interest in the data, leading to a spike in access. This is indeed quite common in library and archival collections.

Manual de-selection can be very time consuming and costly, but automated de-selection may not do the job of "sensible forgetting" well, potentially putting important data at risk. Unfortunately there are no easy answers.

Comments

Answer by wizzard0

There are generally 2 approaches:

  1. "exponential memory", which assumes we know how to aggregate two pieces of data into one, maybe as simple as throwing one out, or averaging 2 data points

    • for one year, we store 100 points of data, for next year we throw out 50, then 25 and so on
  2. "store everything", because the data volumes grow exponentially and storage becomes cheaper exponentially, so the total cost is limited even when storing infinite history.

Approach 1 needs cherry-picking of nontrivial data (such as books), which can make it even more costly than approach 2. So, I vote to save everything with the highest precision possible.

Comments

Answer by Andy Jackson

This system already exists - it's called the Internet! The popular, widely used stuff gets copied, shared and re-shared. It won't get lost, but then it will also be difficult to tell if it is authentic.

The main problem, however, is that only keeping the popular or frequently used things may be too simplistic as a way of judging the value of content. Is a resource most people use rarely more or less valuable than a resource some people use daily?

Perhaps the popular, mainstream stuff will largely look after itself. Maybe we should worry more about making sure the smaller, quieter voices are not lost in the noise.

Comments