Digital preservation: questions of format and volume for a digital photography collection

Our archives has been offered over 15TB of digital photography, which we are motivated to take. Here are some of the pertinent details:

The files consistently include two versions of every image:
- A RAW file in Nikon's proprietary NEF format
- A high-quality JPG, which include some minor editing by the photographers (contrast, color, etc., adjustments). A typical JPG that I was sent as a sample is 8MB.
The images are of interest for their content, not their artistic merit. These are the event photographs, official portraits, campus shots, etc. that are familiar to any institutional archives. Important elements of our institutional history, but generally not of artistic or scientific interest.
We anticipate that this is just the first of what will be an ongoing series of accessions; we will probably be adding over a TB per year with the current formats.
The metadata for these images is very, very scant. They are organized by shoot, and we may have some basic date and event information, but image-level metadata is limited to what was automatically created by the camera. So there will be, for example, a lot of photographs of people at events, but no identification of who the people are.
Our library is in the process of building a (probably) Hydra-based Fedora repository that will be used to store these and other born-digital materials. We're thinking this would be a good project to build some of the ingest and preservation functionality around.

So here's a set of questions I'm pondering, and to which I welcome discussion:

Do we want the NEF files? Heresy, I know, but a question I need to ask since based on some preliminary reading I've done on the subject, it looks safe to estimate that the NEF files make up about 2/3 of the storage space. What would be the potential loss to the archives and to the historical record if we only preserved the JPGs?
If we take the NEF files, do we convert them immediately to DNG, TIFF or another format?
What about some kind of heuristically-guided, content-based weeding? In other words, we keep both versions of every image we're keeping, but we are selective about which images we keep. At 15 TB, we would have to develop some really big sorting buckets, but we could make decisions based on event type, keeping only a limited number of shots from portrait sittings, etc. The department is very eager for us to take these and is willing to follow our guidance on how to proceed, and also has student labor to throw at a project if needed.
15 TB would be the largest set of materials we've accessioned, by a very long shot. For those of you who work in this scale routinely, what computational or other challenges will this volume pose that we probably haven't encountered before?

I'd appreciate SOPs for libraries/collection managers in this type of situation. Anecdata backed up by canonical sources is acceptable.

save4use

digital-preservation
preservation
digital-archiving
archives
archiving

Comments

Ashley Nunn: Welcome to Libraries, save4use! A few notes about your question - you seem to be asking a number of questions all at once. Generally, Stack Exchange sites work best when you focus on one issue at a time in your questions - and there is no harm in asking multiple questions around the same topic! Also, we tend to frown on discussion, as we are not a forum (see our [FAQ] for more details), but it seems like you can still get some good answers with your question, although you should consider breaking it into smaller, more concise single question posts.
jonsca: I agree with Ashley, so I modified the language of your request a bit to match the goals of the site and added the banner notice for anyone who may want to answer it in the future. I think the answers are so far so good.
save4use: Thank you both for the info/assistance.

Answer by Chris R

What you do about the NEFF files should depend a lot on how important the information in these images is to you. If the images represent a more-or-less casual record, then the JPEGs may be sufficient (they contain sufficient information for most display and printing purposes). But the NEFF files will definitely contain the most information; they will allow more flexible further post-processing (for example, images may be 12 or 14 bits per channel rather than 8 bits for JPEG; there will be information on white balance settings etc, that can be adjusted). If you want to ensure you retain that information for future use, retain the NEFF files. Converting to TIFF will lose (camera-oriented) information; converting to DNG can retain all information provided you embed the original raw, ie the DNG is then bigger (although it may be more future proof). Not certain how good Nikon and others have been in keeping backwards compatibility with earlier versions of NEFF.

Comments

Joe: I thought NEF was based on TIFF. What information would be lost?
Joe: And it looks like NEF is based on TIFF/EP, which is not the same as the Adobe TIFF. (not sure if it's a version thing, or completely unrelated at this point)
Jay Gattuso: @Joe The NEF file types generally use the Tagged Image File Format as the container for the raw data. They are usually proprietary, not standardised, and can change quite rapidly between versions (even if you can establish what version you have). There are a number of standard TIF format versions, each supporting different/newer features. TIF has become 'shorthand' for a TIFF containered file, but its not always the case.

Answer by Euan

As Chris says the NEFF files will likely contain a lot more information than the JPEG files. As the new Lytro camera shows (though that is a special case), there is often a lot of value that can be gained from the data embedded in image files. Having said that, if you select an appropriate format you may see great space gains with very little information lost if you move to, e.g. JPEG-2000.

As far as content-based weeding goes, you may want to try something like this, however automated content based indexing tools are still fairly inaccurate so any weeding based on content will mostly have to be done manually. One option might be to put up the jpegs online and have the public, (or some students) tag all the images to aid in automated assessment of which you want to keep.

You might also be able to script/automate reverse image searches via google reverse image search or tineye, and add metadata that way to help aid in automated weeding/disposal. For example if you got any hits for an image of a person, that might help you to identify the person from the metadata associated with the related images, and then you may be in a better position to decide if you want to keep that image or not. It also looks like google's reverse image search can also do some object identification now, though I am not sure how accurate it is.

As far as working with 15TB goes, its mostly a matter of scaling up all processes you have used in the past or developing more efficient processes if you cant afford to scale those ones. You also often come across I/O bottlenecks when moving/copying large volumes of data that you will have to account for in any projections around time-scales etc. Having processes that do not completely fail if one file processed fails can also help as when e.g. running overnight batch ingest processes, you can lose a lot of time from a single file which can be costly when you are dealing with such volume.

Comments

Answer by Joe

I admit, I don't deal with these formats specifically (most of my data are telescope images in FITS), but I do have some experience with photographer as my dad's family has a number of portrait & commercial photographers. So here's my take on the questions asked:

Do we want the NEF files?

I'm going to look at a bit differently than most -- what you have are two different Works -- effectively, there's the taking of the picture, but there's also the correction of the picture by the photographer, which is not in the NEF files.

If you're trying to archive the artist's finished work, then the NEF files are useless. If you're trying to show his creative process, then they might have value.

The discussions of bits per channel is misleading -- if we were to talk about books, think of the NEF file as a high resolution scan of an early draft of a written work, and the JPEG has a lower resolution scan of the published work. Yes, one's of higher resolution, but does it actually capture something that's worth keeping? (and it might, it's going to vary for each collection ingested)

What would be the potential loss to the archives and to the historical record if we only preserved the JPGs?

You're losing provenance information. If might be possible to reconstruct a higher-fidelity facsimile of the images represented in the JPEGs, but that would require some significant reconstructive efforts that would likely not be appropriate for most archives to do except for very special cases.

If we take the NEF files, do we convert them immediately to DNG, TIFF or another format?

I've done a little bit of reading up on NEF, and as it's based on TIFF/EP, I'd suggest reading the Library of Congress's notes on TIFF/EP for preservation, and specifically the link to What's in a Raw File? which talks about NEF to DNG conversion.

I'm not an expert on any of these file formats, but I'd recommend when possible, selecting an open format that's in wide use right now for the best chance of being likely to read it again in the future.

What about some kind of heuristically-guided, content-based weeding? In other words, we keep both versions of every image we're keeping, but we are selective about which images we keep.

Are there corresponding DNG files for every JPEG? If so, that might be something to eliminate. (eg, the equivalent of a photographer not making prints from every negative). And as you mention, similarity to other photos may be the reason, if the photographer selected what they considered 'best' ... but depending on the purpose of the archive, someone doing future research may be interested in what the photographer considered to be sub-par.

(depending on the type of work they did, it's possible that they weren't making the selections, eg, if it were wedding photos or something else where a client might decide which ones they wanted, and then the additional processing was done)

And remember, you have options besides not accepting the image -- it may be that you reduce the individual files -- either lossy compression or rescaling so that you preserve every image, but possibly not at the full fidelity that you received them. You may want to identify how much storage you can devote to this, and then come up with various scenarios:

keep 50% of the images at original size, discard the rest
keep 40% of the images at original size, the rest reduced to 1/6 their size
keep 30% of the images at original size, the rest reduced to \~1/4 their size
keep 40% of the images in the main repository, the rest on a lower class of storage.

Obviously, this could get really complex ... the first analysis pass may not identify enough candidates for weeding, and so you may have to expend more resources for additional analysis, making it cheaper to adjust the strategy or allocate more storage..

For those of you who work in this scale routinely, what computational or other challenges will this volume pose that we probably haven't encountered before?

If you have to reprocess the data, there might be some computational issues, but it might be possible to use cloud computing for this. The main issue that I run into is access -- we don't want to serve the full resolution file if there is some smaller variant that will suffice.

Eg, if you have 20 megapixel images, most people won't need that resolution if they're just browsing through the images. You might be able to serve images at 1/2 or 1/4 scale (1/4 or 1/16 the pixel count) for most people. Those that want to zoom in to see detail could then request the higher fidelity image if the browse image won't suffice for their needs.

Ideally, you could use something like JPEG2000 and a JPIP server so that you don't have to maintain multiple processed forms of the images, but support isn't currently baked into most web browsers, so you have to have some sort of intermediary service do the conversion between JPEG2000 and JPEG. I don't know what the current status is for this sort of thing in conventional archives, as we have some abnormal use cases and interface with a bunch of ad-hoc services.

I guess the only real consideration that I can think of is how you're going to be getting the files. Depending on the schedule that you'll be ingesting, over the network may not be efficient. For large, bulk transfers (eg, the 15TB initial load), we physically transport the disks. (USB or firewire drives + FedEx or UPS if it's long distance)

Personally, I find that 15TB isn't really the important number, the important issues are the maximum image size (can we open them to look at them, or are they too large for most of our current desktops?) and the number of discrete files (it takes just as long to catalog a 50kB image as a 50MB one, once you get past the process of opening the file)

Comments

Answer by Jay Gattuso

Do we want the NEF files?

It really depends on the collection, and what the collection represents. A nice way to consider the position is to flip this straight to an analogue paradigm. Its not identical, but it works close enough. Assuming there is a 1:1 relationship between the NEF and JPG files, the NEF are the print negatives and the JPG are the print positive. Now how does that fit with your existing collection strategy?

it looks safe to estimate that the NEF files make up about 2/3 of the storage space.

Find out. This is one of your most urgent tasks - to know what to do, you must first understand what you have. Find out the relationship between NEF:TIF instances of the same intellectual object. Find out the number of each. Find out the average sizes of each type (plus standard deviation so you can see how close the set is to the average) - use a file listing application for this. Find out the resolutions for the two sets - (NEF should be relatively easy - close cohesion to the average file size? very likely all the same resolution, jpg? use exiftool and pull out just the h x w resolution, h/w resolution dpi, and possibly bit depth from the exif - throw that into a db / spread sheet, and look through the range of sizes you. You might be able to throw away Lo Res jpgs from the off).

What would be the potential loss to the archives and to the historical record if we only preserved the JPGs

A very broad question, with lots of missing parts of the story for us to judge on. Assuming the JPGs are all acceptably high quality versions of the NEF files, arguably not a great deal.... unless you are interested in documenting the photographic process - example - what is the relationship between the NEF & JPEG? cropping? tonal shifts? touch up? de-skew/rotation?

From a technical angle, the NEF will offer significantly higher informational density for the image, but you should consider what that informational density represents. The Jpeg compression works by 'throwing away' higher frequency information. What does the jpeg compression thats been applied do to the resulting image? a good implementation of a jpeg compressor, set to a light compression ratio can be visually imperceptible in an image. A poor one is very noticeable. At this point, its a trade off. If you wanted, you could profile the difference between the two "identical" images in a couple of ways - computationally you can find the RMSe noise ratio between two images, do some experiments so you understand what the RMSe value is telling you, or your could open them both, and 'subtract' one from the other, looking at the 'difference' channel. This will show you where there is a difference between the tow images. Do this a few times, and start to understand the representational difference between the two types.

From a storage angle... What access do you have to storage? is your solution going to gracefully scale to fulfill this requirement? Whats the slightly longer term perspective (being cognisant of Moore's and Kryder's Laws...). Are these files being kept somewhere safe while you explore/research and decision make? What risks are associated with their current storage carrier?

From an access angle... who is going to be the primary user of this collection. Will they be happy being offered large NEF files that they need to interpret somehow to see the images? Will you / do you currently offer access copies - could these be the JPGs and save you a processing step elsewhere in your lifecycle?

Moving on.

If we take the NEF files, do we convert them immediately to DNG, TIFF or another format?

For me, I would be exploring this very quickly, probably 2nd priority. What version of NEF are they? what tools currently exist to interact with them. Is there a current viable format I could move them too? or am I just kicking the can a little way down the road... TIFF does not offer the same capability as the NEF file, so I would argue its not a lossless transformation, and thus it doesn't really belong in this discussion, its a 'middle' format between NEF and jpg. Higher visual fidelity is achievable, but not all the features of the NEF can be captured. Repeat the tests between the JPG and NEF to get a feel for any informational loss. (if a lossy compression is used).

Ask yourself, is DNG stable enough? supported enough with render tools and other methods of interacting with DNG file types? Do you have the ability and capability to convert the numbers of files you are facing?

What about some kind of heuristically-guided, content-based weeding?

What methods exist to identify content suitably? You could for example use a face detector (such as the one found in picassa v3 onwards) to locate images with faces that are large enough for it to trigger a match. other than that, I'm not sure there are many options open to you on the image content side. Can you do anything with creation/last modified dates? are there any file structures that you can use to filter? Could you get a quick tagger written that would allow you to quickly flag a low res version as a "keeper" (or perhaps use tagging functions in other tools like Adobe Lightroom or Picassa.

In other words, we keep both versions of every image we're keeping, but we are selective about which images we keep. At 15 TB, we would have to develop some really big sorting buckets, but we could make decisions based on event type, keeping only a limited number of shots from portrait sittings, etc.

It does sound like it would be useful to make some lower res'ed versions (jpgs would be easiest to wrangle) so you can throw around sets to decision makers to explore / tag. Your storage requirements change at this point, as you can keep the large res versions where they are, and only throw the low res versions only costly spinning (shared?) storage when you need to assess the content.

The department is very eager for us to take these and is willing to follow our guidance on how to proceed, and also has student labor to throw at a project if needed.

It sounds like you have the person power, and the direction from your business unit... it looks like its all about the technical challenges....

15 TB would be the largest set of materials we've accessioned, by a very long shot. For those of you who work in this scale routinely, what computational or other challenges will this volume pose that we probably haven't encountered before?

Firstly, throughput. What are the bottle necks in your workflow? human or otherwise. Test out some volume ingests and see where and what breaks. Are you grabbing fixity data on the way through? doing a virus check? what access do you have to CPU? what about your network for shared resources? (app servers, storage locations etc). What error checking do you have in place? What happens if one file breaks your workflow? how do you handle exceptions? What other systems are involved in the workflow (CMS/ILS? etc)

You are going to want as much automated as you possibly can. Have you got access to accurate file manifests / lists that you can feed a tool? is the collection easy to break down into more manageable logical lumps? How will you what you've completed, and what you've yet to complete?

I'm sure its all stuff you thought about, but worth reflecting on.

Comments

Jim DeLaHunt: Your third block quote repeats, "it looks safe to estimate that the NEF files make up about 2/3 of the storage space." Did you mean this quote to be instead, "What would be the potential loss to the archives and to the historical record if we only preserved the JPGs"?
Jay Gattuso: Huh. You are correct. Thank you! not sure how I missed that!
Chris Adams: The negative:positive comparison is great - the only other detail I'd add is that, if possible, it'd be fantastic to get the projects settings if a non-destructive tool like Adobe Lightroom or Apple Aperture was used to generate those JPEGs as that would allow things like generating consistent JPEG, TIFF, JP2, etc. if the decision is made not to keep the NEFs
Jay Gattuso: @ChrisAdams - Project settings is an interesting idea. To a point it should be possible to get back to the (encoder) setting from the technical exif data in the jpeg/jp2 etc, however it would be really useful to have that 'baked' into the file as a block of data rather than having to churn through the image metadata and make some inferences. The other aspect I didn't go into is the post processing 'middle step'... You may have the NEF source, and the jp2 output, but you may not have the working (layers, touchup etc) that went into the making of the final jp2. What do you keep then?
Chris Adams: Jay: That was actually most of my interest - the NEF files are great for having the maximum original intention but some photographers extensively tune their images and I can easily imagine those decisions being of high interest to future researchers.