Why hasn't the library community settled on using hashes as unique universal ids for digital content?

It seemed to me that the ability to compute something like an MD5 hash has amazing potential as a unique identifier system for digital content. Anyone can generate and publish them, and they are more unique than fingerprints. Is there any major reason why hashes arn't published as unique identifier metadata for digital objects?

Trevor Owens

metadata
identifiers

Comments

Ashley Nunn: "Why" questions are often tricky - the answer could be "because they haven't." - especially with something asking about the library community as a whole, which covers a wide range of geography and experiences.
Matt Stephenson: +1, good question as hash functions are useful in so many places. Unfortunately, what's a feature in one domain (crypto) can sometimes be crippling here.

Answer by ksclarke

Other than the answer that getting everyone to agree on a single way is a very difficult (probably impossible) task, perhaps one answer might be that people sometimes want to be able to group digital objects using an identifier scheme.

I'm assuming you're suggesting hashing the digital object itself, rather than metadata about the digital object (which might change as authorized headings change over time)...

Imagine a dataset that has been submitted to the institutional repository. You could hash an uploaded spreadsheet, but what do you do when the author realizes there was a mistake in it and wants to upload a modified version? You could hash the modified version, but then you have two objects, that have a relationship, with no indication of that relationship in their identifiers (which some would argue is a good thing, granted).

Some folks want to be able to embed this sort of relationship in the identifier. One example is how Dryad (a data repository for scientific datasets) uses DOIs:

http://wiki.datadryad.org/DOI_Usage

Another example would be people who choose to use Cool URIs as their identifiers. They may create URIs that have a structure indicating relationships between different items.

I think you can definitely argue that opaque identifiers are better, but that some people want to embed information in their identifiers is one example of why hashs wouldn't work for everyone.

Comments

Trevor Owens: Sure, but this seems like the situation we have with books, where there may be many ISBNs for a single book, and you can agragate those, but the individual ISBNs uniquely identify particular manifestations. I for one would want a unique identifier for a data set to only identify the actual data set I worked with, not one that was later updated.
ksclarke: Right, that's what the DOI Usage plan that Dryad came up with does. When you cite something (as you said), you want to cite the exact thing you used, but having that relationship in the identifier (where you can easily learn there is a newer dataset, perhaps with corrections) lets you know perhaps you should check out the related version(s). What if the conclusions you arrived at depended on the flaw in the original data? Do ISBNs do any sort of grouping like the Dryad usage page describes (with DOIs)? I'm not aware of any.
ksclarke: So, for example, maybe you used (and cited): doi:10.5061/dryad.1234.2 but then the author updated the dataset and the repository gave the new version a doi:10.5061/dryad.1234.3 identifier. The newer one is what's prominently displayed on the site, but the older one is still accessible (so your link in your citation doesn't become a dead link). When someone follows your link though they'll see there is a newer version available (so may choose to examine the differences between the two).
ksclarke: Also, personally, I should add that I think there is a very good case to be made for opaque identifiers, but I'm just presenting (here) a case that gives an example of why everyone won't agree on a hash as an identifier.

Answer by Joe

Hashes only really work to identify a particular packaging of the content, not the content itself.

So, take for instance if we have an MS Word document and a Pages file with the exact same content, same metadata, etc -- they're going to be written to disk differently, and thus, generate different hashes. (and MS Word is particularly tricky, as it stores the content out of order, so it can handle undo) Yet, it's perfectly normal to assign a DOI that takes you to a page asking you what format you want to download in item in. (typically PDF + something else)

PDF's another problematic one, as it's more a layout language ... this word gets placed in a particular spot on a page (which is why you'll try to select a block of text in columns, but it'll select across multiple columns)... you could have multiple files that generate an identical printed page. The whole thing might be an image ... or an image + tagged text ... or text laid out any number of ways.

The stuff that I work with is multi-dimensional scientific data. We have lots of problems regarding serialization which may or may not be the 'same' content (eg, if you convert integer datum to floats, so there's no loss of precision or addition of error, is it still the 'same'?) But even without that, we have column-major form vs. row-major form. Eg:

1 2 3
4 5 6
7 8 9

You can store this as either [1 2 3] [4 5 6] [7 8 9] or [1 4 7] [2 5 8] [3 6 9], and so long as you have the correct metadata to explain the serialization, you can get back to the 'same' content.

Another problem I've run into are people suggesting that we include checksums of some sort in data citations. (eg, Micah Altman's 2007 paper in DLib ... which doesn't work for massive datasets that are 100s of TB.

... and for those wondering why I kept quoting 'same'. The terms 'same' and 'unique' are used a lot, but you need to define what properties / characteristics we care about when we define 'same'ness. We typically care about 'equivalent' rather than 'same' or 'unique', and use of the term acknowledges that they may not be 100% identical.

Comments

Jakob: Truly said, but problems of sameness and uniqueness also exist if you use other kinds of identifiers. As soon as one can specify characteristic properties, one can create a normalized version of the content and create a hash of it. Your argument still holds to some degree but don't shoot the messager: the hash only reveals existing problems that one has to deal with anyway.

Answer by Joe Atzberger

Comments

Jakob: MD5 is outdated and one would use a better algorithm such as SHA≥2.

Answer by Matt Stephenson

MD5 was a hash function used in cryptography. Wikipedia has a good overview of crypto hash functions and properties desirable for them. Most relevant to your question:

It should be impossible for an adversary to find two messages with substantially similar digests; or to infer any useful information about the data, given only its digest. Therefore, a cryptographic hash function should behave as much as possible like a random function while still being deterministic and efficiently computable.

Put differently, MD5 will give two inputs with very similar but unequal content (1st and 2nd editions of the same thing) very dissimilar hashes. The amount of information you can determine about the input behind the hash is as low as possible, by design.

This doesn't bode well for anyone trying to learn something about a set of inputs from their hashes, and it shouldn't.

Comments

Answer by Simon Spero

MD5 is officially considered broken, as is SHA-1. SHA-2 is ok for now, and the winner of the SHA-3 contest is due to be chosen soon.

Peer-to-peer systems that use distributed hash trees for locating items use cryptographic hashes to generate identifiers for use as hash keys for locating content. For archival purposes, this is the approach I wanted to use for SCHMEER (Several Copies Help Make Everything Eventually Reachable). It's robust and scalable.

Comments

Jakob: Aren't hashes used as identifiers in the LOCKSS (Lots Of Copies Keep Stuff Safe) system?