Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

What types of data compression tactics have a place in digital preservation practice?

Various tactics for data compression can decrease the cost of long-term preservation by reducing the amount of storage space required. At the same time, different compression approaches bring with them their own risks. How should one weigh the risks and rewards in the case of the following kinds of compression?

Three types of compression to consider (feel free to suggest others as a comment)

  1. File compression: using a file compression algorithm suited to the file type. (Considering both lossy or lossless compression)
  2. Hardware compression: which usually means compression done by a tape drive as the data is written to tape
  3. Disk compression: which is performed by many new storage appliances and uses a combination of compression and de-duplication

Trevor Owens

Comments

Answer by Joe Atzberger

You will need further specifications, including:

It may actually be an incorrect approach to focus on disk and file-level representation as something you will personally improve by intervening with compression/decompression. Many of the best platforms attempt to encapsulate this level of detail from the user, allowing the platform to handle complications like replication, availability, indexing, data integrity, caching, etc.

All significant compression gains involve some trade-off between read and write costs. You may introduce performance-limiting processor costs merely by attempting to repeatedly retrieve and decompress a popular set of files (w/o caching) or by attempting to repeatedly write "just one more" file to an already compressed set. Or worse! For example, attempting to file-compress rich media already encoded in a compression-enabled codec frequently produces a larger file that now requires additional space and serial processing.

Comments