Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

What tools can we use to appraise content before digital preservation?

Short-term storage seems cheap, but long-term storage for digital preservation is expensive. Part of the solution to this problem is using archival appraisal to identify what content should be preserved and what content can be discarded, but how do we appraise gigabytes or terabytes of data?

Visualization tools like WinDirStat and SpaceSniffer let you scan a folder structure quickly to prioritize potentially redundant data (e.g. system files) or core content (e.g. My Documents). Other tools like C3PO let you survey technical metadata, allowing you to look at rough estimates of format types to see what formats a creator used the most. Are there other tools used to quickly appraise data?

Nick Krabbenhoeft

Comments

Answer by jweise

One possibility is Archivematica. "Archivematica uses a micro-services design pattern to provide an integrated suite of software tools that allows users to process digital objects from ingest to access in compliance with the ISO-OAIS functional model."

A second type of approach is described in a paper called "Automating Digital Processing at the Bentley Historical Library" that was presented by Michael Shallcross and Nancy Deromedi at iPRES 2012. They assembled a Windows-based processing workflow called "AutoPro" comprised of numerous off-the-shelf tools and custom batch scripts to facilitate appraisal. Their second slide lists the tools they are using under "4. Digital Processing." I am not replicating the list here because the three brief documents they provide are very concise and it would be a shame to lose the context.

Comments

Answer by Greg Jansen

The Curator's Workbench is useful for appraisal in some scenarios. It doesn't report much technical metadata, but it does help you capture the file structure and make a new arrangement of items. It will stage files and calculate checksums while you work with the folder structure and names. For more information, see the link: http://www.lib.unc.edu/blogs/cdr/index.php/about-the-curators-workbench/

Comments

Answer by Andy Jackson

For the web archives I work with, we use Apache Tika to extract properties of interest along with text (for search indexing) along with a few extensions of our own. This works well from Java and on streamed data, which suites our HDFS-hosted WARC files very well.

Comments

Answer by mopennock

A few prototypes have been hacked together at SPRUCE mashups along these lines and, like Andy's work with the web archive, have used Apache Tika. This one seems more pertinent - Extracting and aggregating metadata with Apache Tika. The datasets in question were mainly text and PDF.

Comments