What tools are recommended for characterising, assessing or appraising digital content acquired for preservation by an archive?

I'm particularly interested in positive (or negative!) experiences of using particular tools to extract basic technical or content focused information which can then be used to inform decisions around selection, collection management or preservation planning.

Paul Wheatley

preservation
digital-collections
digital-preservation
digital-archiving

Comments

Andy Jackson: Great question. I'd be particularly interested if anyone knows of any reliable tools for spotting DRM, and in experiences from people running JHOVE2.
Paul Wheatley: I wonder if the DRM issue would warrant it's own question? Might go beyond ready to apply tools, to suggest existing code that might be adapted for digital preservation focused DRM identification. My guess is ready to apply solutions are thin on the ground. Ditto on JHOVE2!

Answer by Ed Summers

Format Identification for Digital Objects (FIDO) is a Python command-line tool and re-usable library for identifying file formats of digital objects. It runs wherever you have Python (Windows, OS X, Linux, etc), is memory efficient, speedy (about as fast as the venerable Unix file utility), and uses PRONOM's repository of file signatures to recognize file formats. FIDO is opensource, and the code is on Github. There are other tools out there like JHOVE, but I'm kind of partial to Python. Having a command line tool should make it pretty easy to integrate into whatever environment you happen to be working in though.

Here's the usage from the command line:

ed@taylor:~/Projects/fido/fido$ python fido.py ~/Documents/learning-go.pdf 
FIDO v1.1.0 (formats-v60.xml, container-signature-20110204.xml, format_extensions.xml)
OK,360,fmt/18,"Acrobat PDF 1.4 - Portable Document Format","PDF 1.4",1594545,"/home/ed/Documents/learning-go.pdf","application/pdf","signature"
FIDO: Processed      1 files in 500.00 msec,  2 files/sec

The "fmt/18" in the output there is the PRONOM identifier, which you can use to look up information about the format. You can also use the PRONOM identifier in their prototype linked data service for automated lookups. So fmt/18 maps to this URL:

http://test.linkeddatapronom.nationalarchives.gov.uk/doc/file-format/18

You will see at the top right that there are other machine readable formats such as RDF, XML and JSON available at similar URLs like:

http://test.linkeddatapronom.nationalarchives.gov.uk/doc/file-format/18.json

Comments

anarchivist: I'm extremely partial to FIDO since it's so lightweight. At Yale, we integrated FIDO with fiwalk to identify files within disk images that we acquired off of media. Our code can be found on Github.

Answer by AaronC

The human brain is probably a must have tool in this software stack. Tools are good at extracting technical and preservation metadata, creating and monitoring things like file manifests with checksum values, and so on but they are not that great at "appraising" the actual content. In other words, tools are good for inspecting the packaging or carrier for digital content--inspecting the content could potentially require subject experts, users, catalogers, specialists (imaging experts, audio engineers, etc).

The FITS (File Information Tool Set) is of note here, mainly because it bundles a whole bunch of other useful tools--each of which could be used on their own:

JHOVE (note, note JHOVE2)
Exiftool
National Library of New Zealand Metadata Extractor (NLNZ)
File Utility
DROID
FFIdent
FileInfo
XmlMetadata

All of the individual tools have their uses too, so check them out. My experience is with JHOVE & JHOVE2, though I've also used FITS.

Note that JHOVE is also included as a microservice in the Archivematica beta suite.

Comments

Joe: Not to be confused with FITS (flexible image transport system) which is a scientific file format used by the astronomical (and occasionally the preservation) community for storing images and other multi-dimensional arrays.
Ed Summers: Tools like FITS make me wonder if more metadata that you don't actually use, is better than less metadata that you do use. I'm inclined to think that the answer is no, and do some hand waving in the direction of the software development principles of Less is More or Worse is Better in my defense :-)
Joe: @Ed: It depends on the case. I know of projects where so much metadata was in the files that it broke some programs that tried to read them. And where the metadata was presented in multiple ways but with a minor inconsistancy and you had no way to tell which were derived vs. original. But I have way more examples where the metadata is just not there. (I know what size the image is ... but when it was taken? from which instrument? observing mode? No clue other than knowing how the filenames were encoded ... and the filename wasn't stored in the metadata, just the OS.)

Answer by Nick Krabbenhoeft

We have two strategies for appraising content: what are the technical characteristics and what are the content characteristics. Many of the other answers address the technical characterization. I'm more interested in the content characterization.

1) In addition to the various utilities to extract technical characteristics, visualization tools are invaluable assistant in appraising ingests of large digital collections. General tools like disk space analyzers WinDirStat and SpaceSniffer offer a great foothold into exploring and appraising large, unknown collections by creating a visual comparison of the size of folders. Both of the tools I mentioned are stand-alone, portable and very easy to use.

2) Technical metadata tools can help explore the contents of a collection textually. Duke Data Accessioner comes with plug-ins for JHOVE and DROID. It's not as comprehensive as FITS, but it has a GUI and is more user friendly. Ben Goldman from the American Heritage Center has built a workflow using DDA and explains some of his justification here. http://e-records.chrisprom.com/guest-post-ben-goldman/

DDA transfers information off of storage media to a workspace and generates a METS file with the identified file formats and folder structure. When I was testing it, it was much easier to read the METS in an XML editor like Oxygen (costs \~\$500) so that I could zoom into and out of the folders as was necessary.

3) On the horizon are tools that take extracted technical metadata and turn it into human-readable information. C3PO is a project out of Tuwein that visualizes FITS output. Look at the demo video starting at 3:20 for the results. These are the kind of tools that will be essential to appraising large collections like the digital photography collection discussed on this site. Unfortunately, I haven't had the time to experiment with C3PO myself.

Edit 25-01-2013: Rearranged to incorporate more information about visualization tools.

Comments

Answer by Andy Jackson

We've been using Apache Tika fairly extensively. It is pure Java and runs fairly easily under Apache Hadoop. It identifies significantly more file formats than DROID, although it does not currently identify specific versions of formats, so works best in combination with DROID if you care about that kind of thing. It also extracts properties and metadata, and it is pretty easy to extend or modify in order to extract more detailed information. It also converts many formats to plain text, which is useful for indexing. This means I can perform feature extraction and indexing in one pass, which is handy.

Tika has a GUI too, and can be set up as a command line tool pretty easily as well (see my Homebrew config as an example). Of course, being Java, there's a bit of a startup cost, so Fido may be a better bet there.

This is an example of applying Tika to identify file formats and extract some basic characteristics from data in ISO CD images at the Bodleian Library, Oxford.

Comments

Answer by amy

I'll put in a plug for our CINCH (Capture, Ingest, and CHecksum) tool. You can read more about it here: http://cinch.nclive.org. And, you can test our hosted version or download it here: http://slnc-dimp.github.com/Cinch/. We developed it for small and mid-sized institutions, so it is lightweight and easy to use.

Comments

Answer by johan

As far as identification is concerned, I also wouldn't overlook good old Unix File. It covers a very large number of formats, it's fast, and because of its command-line interface it can be integrated in workflows fairly easily. If you want to use it on a Windows-based system I'd suggest to install Cygwin, which includes the most recent version (5.11) of File.

As for feature/properties extraction, one of my personal favourites is ExifTool, which supports a large number of formats (with focus on raster graphics). It is also fast, and has various options for fine-tuning the output (hint: use the -X switch!).

Incidentally both of the above tools are also included in the FITS tool (described above), but I'd recommend checking out these tools on their own as well.

Comments

Answer by johan

In addition to my answer above: within the EU-funded SCAPE project we recently did an intercomparison of different identification tools. See the following 2 links for more information:

http://www.openplanetsfoundation.org/blogs/2011-09-21-evaluation-identification-tools-first-results-scape

Note that this report is partially outdated! In particular most (or even all!) of the issues on FIDO have been fixed since this report was released.

Here's a follow-up on this work that some colleagues did more recently:

http://www.openplanetsfoundation.org/blogs/2012-02-23-identification-tools-evaluation

Finally, the National Library of Australia just released a very interesting report on characterisation tools that encompasses both identification and feature extraction. I've only given this a fairly cursory glance so far, but the results and recommendations make lots of sense to me. They also included some commercially available tools, and think it complements the SCAPE work really well:

http://www.openplanetsfoundation.org/blogs/2012-08-12-file-characterisation-tools-report-testing-project-conducted-national-library

Comments

Answer by Ross Spencer

As well as general purpose utilities there are more specific utilities to complement them within our workflows. Two tools which I speak quite highly of are MediaInfo and Jyplyzer.

MediaInfo has two versions featuring a GUI and a CLI. The tool can characterize a large number of video formats and components that you might find in streams - such as subtitles. It will happily read a single file or produce a compound output with a full analysis of all the video you've pointed the tool at. It supports many different output forms - HTML, XML, plain text etc. It would be my first point of call for any video I am receiving: http://mediainfo.sourceforge.net/en

Jpylyzer was written by johan van der Knijff and is hosted by the Open Planets Foundation and it does a similar thing to MediaInfo in reading the various components of a JPEG2000 image, extracting them and exporting them to an XML file for further analysis. Jpylyzer will tell you important information like the compression type used on an image, existence of a color space, etc. Importantly, Jpylyzer also checks the validity of a JPEG2000 file. All of this information can be analysed and used for collection management and making preservation decisions: http://www.openplanetsfoundation.org/software/jpylyzer