The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

Is digitisation on topic?

There have already been a couple of questions focused on digitisation rather than digital preservation. I would assume that these are actually off topic and should be closed? The boundary is of course a little fuzzy.

I would propose that this community should be involved in decisions related to the results of a digitisation initiative. For example, the file formats and metadata schemas used, where/how the results are stored, and so on. However, questions focused on how to engage in digitisation, what types of scanners should be used or resolution to digitise at, will be off topic.

This question is about OCR, although is phrased in the title as a more general digitisation question. Is this in scope? Its a tough call.

As many of us in the digitisation community are well aware, confusing digital preservation for digitisation is a common mistake. I'd suggest adding some clear scoping detail on this to the "What kind of questions should I not ask here?" section of the FAQ. Examples may well be necessary to keep things clear.

Paul Wheatley


Answer by Donald.McLean

(updated to reflect a more careful consideration of the subject)

It occurred to me that the answer should be a conditional yes, rather than an unqualified yes.

If someone is just interested in digitizing audio, video, or photographs, they should probably be asking in Audio and Video Production or Photography.

If someone is asking about digitizing as part of a digital preservation project or with the intention of digitally preserving their material, then I think that it should definitely be on-topic.

It seems to me that if someone wants to digitally preserve something, and they have a legitimate question about the digitization process, especially if it relates to the preservation aspect, then we shouldn't be turning them away. They're a legitimate part of the constituency of this site and we should be taking their questions seriously.


Answer by Courtney C. Mumma

I agree that the mechanics of digitization are not in scope of this site, but think that the preservation of the results are well within scope, and that includes the formats that one creates in the course of digitization projects since those formats may determine the course of preservation activities.


Answer by warren

If you're not digitizing, how can you be doing digital preservation?

Seems like it has to be on-topic.


Answer by wizzard0

I think digitization of analog objects with the goal of digitally preserving them, or keeping them easily-preservable in the future is on topic.

Digitization by itself, however, may better be discussed elsewhere.


Answer by Nicholas Webb

As Donald and Courtney have pointed out, there's a continuum of relevance here. "What's the best scanner to use" and "what DPI should I scan at" seem obviously out of scope, "should I save as TIFF or JPG" less so, "how should I organize and describe the files once I've scanned them" clearly on-topic.

My experience on archives listservs and discussion groups is that basic scanning guidelines are a perennial topic of inquiry from small LAM institutions trying to bootstrap a digitization program. Perhaps the answer is to link to some good introductory digital imaging resources from the FAQ, with a newbie-friendly explanation of why these types of questions are out of scope.

A distinction that might be useful: is the question about creating an authentic representation of an analog object ("will this file format capture all the details of my book/manuscript/photograph?"), or is it about ensuring the sustainability of the resulting digital object ("will my institution be able to read this file format in ten years?"). This might be a workable guideline for determining relevance.


Answer by Ross Spencer


The phrasing is key however and it should be focused on the results of digitization - the formats, the metadata associated with it, the storage etc. and beyond that, errors that appear in the data stream.

I would be interested in seeing questions about the digitization process, Sean Martin's work at the BL for example comes to mind re: the difference lenses make between scans of the same document in the same position etc. Also DPI questions. These are questions that affect the value of the digital data being created and impact whether or not this data becomes an asset that we do want to preserve.

It might not start as a digital preservation question but it will become one. Best to start with preservation in mind. Some records are not kept after digitization - that makes these new digital records quasi-born digital.

The compromise would be to discuss just the results of the digitization process.

Incidentally in my mind it is clear the OCR question referenced in the OP is NOT a digital preservation question. It belongs on the libraries Stack Exchange. My rationale is simple - OCR can be done over and over again. The OCR output is data associated with the digital object we're interested in, almost a second-layer of metadata. The error is an artifact of the current OCR process that doesn't impact preservation.

A few naive OCR DP questions (not having handled OCR much in the past) might be:

  1. What standard formats are there for storing OCR data?\
  2. What long term metrics should I maintain about my OCR that will be useful to future users of the data?\
  3. How do I tie my OCR data to my digital images for long term access?

Those may not be perfect so happy to discuss in the comments below, I'm eager to find the boundary here too.

I think the site has to be carefully moderated. It is a Q&A site so ensuring the same digitization question isn't repeated over and over will be important and moderators, maybe via the discussion forum should have a clear FAQ about what constitutes an acceptable digitization question before closing anything.

Hope that helps.


Answer by Bill Lefurgy

Let's think about it in narrow terms as creation of digital content from a preservation perspective. Questions about optimal design of a prospective website in connection with formats, metadata and architecture presumably would be in scope.

Someone undertaking digitization could ask the same kinds of questions, as long as they were focused on how best to create preservable output. The same could be true of any digital creation activity--photographs, videos, data sets and so on. Anything that doesn't relate to creating content for clear-cut preservation purposes (equipment, throughput, QC) would be out of scope.