Guarding data against corruption or loss during file organisation/curation

What measures, technologies, or techniques are applicable to guarding information against damage/loss during organisation?

Often when I'm archiving data, before the archiving process I also need to de-duplicate and organise the data. I'm aware that this is a risky time for the data and that human, software or mechanical error might lead to some data being corrupt or lost, without me even knowing about it.

So there are two questions:

How can I keep an eye on the data so that I can audit where it was versus where it is now, and see if any data has been deleted or changed during organisation?
- An obvious strategy is to have some sort of checksumming going on
How can I reduce human error while organising the data?
- Example: you might have a filesystem that simply refuses to let you delete a file if it is unique. Or something like ZFS.

occulus

organization
checksum
deduplication
corruption
bagit

Comments

Paul Wheatley: This is a great question as I believe many experience this challenge. I wonder if we could shape the title to make it more understandable? The "during organisation" bit is not very clear. How about "...during file re-organisation/curation".
occulus: I've edited the title, good idea.
Andy Jackson: It would be relatively straightforward to make a FUSE filesystem that always kept deleted items and automatically versioned content. Please, someone, build this!

Answer by Nicholas Webb

As you mention in your post, taking regular checksums should be the first step for ensuring continued integrity.

There are a number of tools that can help you manage files, check for duplicates and calculate checksums, including CINCH, the Duke Data Accessioner and Karen's Directory Printer. If you're concerned about getting rid of duplicate files without accidentally deleting non-duplicates, one strategy might be to output a file list with checksums, use it to detect and delete duplicates, and compare the resulting list with the original to confirm that no unique checksums were lost. You could save both outputs as evidence of the process. I don't know if any tools do this automatically but it shouldn't be difficult to script.

Comments

Answer by wizzard0

Overall, I had a number of devastating data losses during organization, so here are empirically chosen rules:

Never "move". Always copy. If you need to re-organize something, create a fresh repository (folder, database, whatever) and gradually copy the data over to the new database. The hardware costs nothing compared to time lost on restoration.
Create a separate list of "processed" things. And compare it to the "original" list to see if you're done. Preferably by a tool.
If the new data arrives during re-org, add it BOTH to new and old repositories.
Keep the original copy for at least 1 year after re-org.
Have a person different from the one who did the re-org to have a fresh glance over the data (This is surprisingly effective)
De-duplication should also be done by a tool, not manually. And unless the files are bit-by-bit identical, it is usually better to keep both copies (e.g. photo both in TIFF and in JPEG format), selecting one copy as the primary one. For example, when deduplicating books, you might mistake different editions as one, and lose the differences. Even more important is to keep all translations if you have many.

Comments

occulus: re: "use tools" - agree. One common anti-pattern with tool use is not understand the tool properly, and damaging your data that way.

Answer by Trevor Owens

There are some emerging practices in archives that might be helpful:

Using write blockers so you don't damage data on accident
It's probably a good idea to generate a manifest (even just a full file listing) that you hang onto. Something like Bagger would let you do that and generate fixity information (ie checksums).

The idea would be to capture a copy of what you want to preserve in it's most pristine condition, document that, and then make decisions on what you actually want to keep.

I think the most concise info I've seen on this is the technical section of the free OCLC report You've Got to Walk Before You Can Run: First Steps for Managing Born-Digital Content Received on Physical Media (PDF)

More broadly, the Open Archival Information System Reference model suggests thinking about submission information packages (what you start with or are given), archival information packages (what you keep for the long haul) and dissemination information packages (what you are going to make available in a given context). What's useful about this framework in my mind is thinking of that organization process as getting the submission together and creating the archival copy. So you want to do the things mentioned at the top to make sure you get the submission right before you start thinking about turning it into the archival package you plan on keeping. So, documenting what you got and what you kept is an important part of that.

Comments

Answer by Paul Wheatley

As you suggest, checksums are definitely going to play a useful role in meeting this challenge. This Stack Q&A has some initial suggestions on approach and tools, although it may need a slightly different approach for your specific needs here.

If the data changes as part of the organisation process, checksums suddenly become a lot less useful. This is a very simple but effective tool for matching source and destination filenames, and may be of some help here.

There are many possible approaches for dealing with many different flavours of de-duplication. These are some descriptions of duplication challenges and some experiences of solving them, which might be of use.

As things stand I think there is still a need for a more comprehensive organisation or curatorial tool to tackle this challenge more effectively. It needs to enable all the potentially catastrophic changes (such as delete, and rename) while allowing undo of these changes, and capturing a change log or detailed provenance metadata. From my experiences of working with digital preservation practitioners from Libraries and Archives, this is a pretty common use case.

Comments

Answer by Aaron Rubinstein

I would emphasize the importance of workflow planning and documentation.

A Submission Information Package (SIP), as described by the OAIS reference model, can develop out of multiple steps: initial transfer of data, possibly moving that transfer from a dropbox to a workspace, format conversion, metadata creation/extraction. Understanding each step of the process will help clarify at what point data is being manipulated and potentially damaged and thus at what point it makes sense to check for damage before the process has moved too far ahead.

Comments

Answer by Chris Adams

This not the formal answer but as [primarily] a software developer, I felt the need to note that I commonly use Git to do this because it's a version control system which can be used completely locally and uses strong checksums to track file contents.

In practice, this means that I can work with incoming files like this (e.g. I've had to make technical corrections to partner-provided metadata, much of it in languages I cannot read):

git init the directory
git add . & git commit to track the initial received version
After each round of changes, repeat step 2

This is a very lightweight process which provides full history & integrity checks and has the nice aspect that you can easily synchronize copies with full history to other locations. It will work on Linux, Mac, Windows, etc. (GUI tools are available for all of these) and has very little friction once you've installed the software and learned a few basic operations.

The downside is that the approach is slow for very large binary files or operations which alter many tens of thousands of files. There are tools like bup which use the same format but are optimized for binary content and I would recommend investigating that if you are faced with that challenge.

Comments

Answer by Greg Jansen

Addressing this problem was a core design driver for the Curator's Workbench software at UNC. We needed to allow re-arrangement while leaving all original sources intact, i.e. read only. It's done by capturing the structure in a METS manifest; all subsequent manipulation is only editing the manifest and not the data.

This also has the benefit of allowing staging and checksums to proceed in the background, while you perform appraisal and arrangement.

For more information: http://www.lib.unc.edu/blogs/cdr/index.php/about-the-curators-workbench/