Fixity checking in digital preservation systems

If your institution archives digital content and actively records fixity information (checksums) for files how often and how many of the files are checked? Some surveys have been done before, but I am specifically interested in how many files are checked: are all files checked, or is a sample checked, and if the latter how is the sample chosen?

Ed Summers

preservation
digital-archiving
digital-preservation
statistics

Comments

Answer by Paul Wheatley

Matthew Addis suggests that answering this question requires a trade off between risk and cost, which he blogs about here and presents a more formal write up with some visualisation and links to costing tools here. As is usually the case in digital preservation, guaranteeing longevity with an unlimited budget is straightforward. Doing it within the usual (tight) budget constraints is harder. Deciding how much to spend to achieve high (if not absolute) confidence in longevity, is the holy grail.

Comments

Ed Summers: Are there really any places with an unlimited budget? Even with a limited large budget the situation is somewhat less than optimal.

Answer by Declan Fleming

ACE is the fixity checker that Chronopolis uses: https://wiki.umiacs.umd.edu/adapt/index.php/Ace:Main

I'm not really sure how long it takes to get from one end of our set of objects to the other, but I believe it's running all the time.

Comments

Ed Summers: And whether it ever finishes is dependent on how much data is in there I suppose...

Answer by Andy Jackson

At the UK Web Archive we store our 56TB of data on HDFS, where each block is replicated three times giving \~170TB in total. We use the default HDFS data scanning frequency of three weeks, i.e. every block on every node is checked once every three weeks. For comparison, this best practice page indicates that weekly-to-monthly fixity checking (a.k.a. 'scrubbing') is recommended for ZFS.

In general, as @paul-wheatley says, it's a trade off between risk of (multiple) failures versus the cost of checking the data, and thus controlled by the time between fixity checks. This also has consequences for your choice of storage architecture - high latency storage (e.g. tapes on shelves) is difficult to scan effectively compared to a HD cluster with plenty of CPU time per TB.

UPDATE: I should point out that my answer only applies to the routine fixity checking that is used to ensure that stable resources are unchanged. Further fixity checks can and should be performed when data is copied, moved or rearranged. HDFS can only ensure the data has not changed since it was added to HDFS. The process of moving data into or around HDFS requires its own workflow-level checks to ensure that the data has been transferred correctly.

Comments

Answer by Bill Lefurgy

According to the Portico website describing the the "Processing and Archival Deposit" step, "All content processing includes: 1) Validating files against their format specifications; 2) Verifying checksums; ...." In addition, Portico states in this PDF that for "Monitoring and management" they "perform regular fixity and completeness checks of preserved content and restore any content identified as damaged from an archive replica."

So it seems that this approach involves fixity checking all files upon arrival and then again on a periodic basic once under management.

Comments

Answer by dsalo

For teensy collections, this can be straightforward. DSpace checks checksums on a cron job; everywhere I've used it, the job has run every night, checking everything in the repository.

(Yes, I know from experience it works. Tripped the checksum mismatch error once -- long story, it was my fault, I fixed it.)

Comments

Answer by trevormunoz

In a recent presentation on the SafeArchive, one of the presenters made the somewhat obvious but potentially overlooked point that it also makes sense to vary the frequency of fixity checks based on how often a set of data changes---in other words, how often a system is interacting with some stored data. A data archive that takes in web crawls may have different needs from one that takes in more infrequent data dumps from institutional sources.

This is particularly important in cases where fixity checking is part of data replication systems, i.e., a LOCKSS network. Of course, they hastened to add that this must be balanced against making sure that the intervals between fixity checks do not get too long. Since validating checksums takes computation time and something like a LOCKSS network takes a fairly "laidback" approach to requesting server responses, an audit system can be used to ensure that fixity checks complete at the desired intervals.

The SafeArchive system proposes to help with scheduling and verifying digital preservation actions like fixity checks but I'm not involved with the project nor do I have any direct experience using ArchivesSpace for this specifically (yet).

Comments

trevormunoz: Original version of this answer named the wrong piece of software (darn multitasking). Meant to refer to SafeArchive. Sorry.
Chris Adams: A related point which is similarly underlooked: if your system checks on access (e.g. ZFS has the contract that your data will either be correct or you will get an IO error) you are in a fundamentally better situation than a system which has to explicitly audit to find problems as there's a much lower risk of silent bitrot. This is also nice because your users are actively helping scrub the most popular data so your scheduled checks are almost certainly known to be lower priority