Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

What preservation risks are associated with the PDF file format?

If the aim is to preserve access to intellectual content encoded in a PDF file, I'm interested in what risks might prevent (or increase the difficulty of) access to that content now or in the future.

I am also interested in tools and/or approaches for identifying or mitigating these risks in specific PDF files.

Paul Wheatley

Comments

Answer by dsalo

Some variants of PDF fall under documented ISO standards (see also); these variants incur less format-obsolescence/unreadability risk. Whenever possible, it is obviously wise to produce PDFs that conform to an ISO standard!

Because Adobe can't resist cramming more bells and whistles into the format, though, it's eminently possible and quite common to produce non-standards-compliant PDFs. I offer a list of potential preservation risks in such PDFs, with the understanding that any given PDF may incur all, some, or none of them:

Multimedia risk

Embedding multimedia (audio, video) in PDFs is usually a bad idea. Should the embedded multimedia format become obsolescent, you'll have to can-open the PDF to yank it out and migrate it, afterwards trying to reconstitute the PDF. Complex objects are generally easier to preserve when different media types are kept in separate files (as with most HTML-based web pages).

Font-loss risk

If you don't embed fonts in your PDF, glyphs in those fonts may display wrongly or not at all on computers that don't have that font loaded. Over time, entire fonts can be lost or become unusable on modern equipment; I've heard from electronic thesis and dissertation managers that ETDs without fonts embedded have indeed suffered loss of information content.

Text non-computability risk

Scanning a print book page to PDF does not create computable text; it creates a picture of text. This is a common example of non-computable (non-indexable, non-searchable, non-text-minable, non-copy-and-pastable) digital "text," though hardly the only one. Some word-processing and page-layout programs produce PDFs in which (for example) the page layout is incorrectly hinted such that trying to copy-and-paste across two columns of text produces sentence salad! Hyphenation and page breaks also distort text for purposes of indexing.

Overzealous security risk

Many software packages that produce PDFs offer various kinds of "security measures" such as passworded file access or restricted printing. As with any technique that intentionally distorts or restricts access to files, such measures pose a serious risk to preservation. Do not use them on a PDF you hope to preserve.

Loss of text-structure risk

Computers are very good at producing nice typography, much less good at interpreting it. If the structure of a text (section divisions, headings, lists, etc) is important information, PDF is a poor choice to preserve that text, since (unless very carefully produced) it destroys structural information, retaining only appearance information.

(Conversely, if text appearance is vital, XML is usually a ridiculous preservation choice.)

Bad configuration risk

PDF is one of those million-option formats! Choosing software and PDF-production configuration poorly incurs any number of information-loss risks, from inappropriately low-resolution images to lost fonts to OS dependence to... you name it, really.

Comments

Answer by johan

A couple of years I wrote a report on this, which was originally targeted at my colleagues at the KB. As I think it may be interesting to wider audience, I just published it on the Open Planets website. The link below points to a blog post that links to the document (and also addresses some its shortcomings):

PDF - Inventory of long-term preservation risks

Comments

Answer by Chris Adams

This started out as a comment on @dsalo's excellent answer above but rather quickly expanded beyond 500 characters:

PDF is a container format: a single PDF file has metadata and one or more content streams, conceptually similar to a ZIP archive containing multiple files. The core PDF format is based on a subset of PostScript, which is a programming language designed to produce graphics, and common graphics formats, but over time the format was expanded to allow streams to contain any type of data.

  1. The PDF format is very complicated and pulls in several other complex specifications. In practice, the vast majority of PDF files were only validated by testing whether Adobe Acrobat can display them as intended and it is quite common to have PDF encoders generate output which breaks the standard in ways which Acrobat tolerates, leaving the problem to be detected only when the file is first used with other tools.

  2. While the subset of PostScript supported in PDF is not as capable as full PostScript (fortunately, as the latter which is Turing Complete), it is still the case that what you actually have is executable program code and thus the only way to display PDF content is to execute each PostScript command in order:

    /Times findfont 100 scalefont setfont
    10 10 moveto
    .5 .5 .5 setrgbcolor
    (Hello World) true charpath fill
    showpage
    

    This fragment uses only a subset of the language exposes the key areas of concern for simple PDF display:

    1. Since this is program code, implementation details can affect the output. As a simple, hopefully purely hypothetical example, consider how processor or compiler-specific differences in floating-point rounding could affect a complex document after many operations cause display problems such as lines which are supposed to appear joined to have visible space.

      As the full language is far more complicated than the subset above, there are many variations on this theme. Fortunately the mainstream implementations have generally converged on reliable inter-operability but you are still likely to need a copy of Acrobat if you receive content from a wide range of sources.

      This was particularly a problem in the past with older “print to PDF” drivers which simply took the raw PostScript which they would have sent directly to a printer and wrapped in a PDF container.

    2. Font choices are specified by name. The corresponding font file may be embedded within the PDF file but as system fonts are also supported, it's quite easy for authors to use special fonts and forget to embed them until the first time the PDF is opened on a system which does not have those fonts installed.

      e seen this somewhat frequently with academic journal articles which were created using LaTex and use its fonts to display mathematical symbols. A Google search confirms that this is not an uncommon mistake as it will only be a problem when documents circulate outside the significant portion of scientific users who have the LaTeX fonts installed: https://www.google.com/search?q=%2B%22Cannot+find+or+create+the+font%22

      Additionally, the TrueType and particularly OpenType font formats are by necessity quite complex to deal with the range of human writing systems. Again, this is an area of potentially significant difference between implementations and, particularly for complex scripts like Arabic or Devangari, the failures can potentially lead to the text being incorrect. Fonts are versioned, so it's possible to have text which would be displayed correctly if the operating system's version of a font is used instead of the embedded version or vice versa. The more obscure the languages you work with, the more you need to have some sort of system to check for correctness.

  3. For simple images, PDF writers are allowed to use a number of encodings and over the years various image formats have been added, all of which have require full software support:

    http://en.wikipedia.org/wiki/PDF#Adobe.27s_versions

  4. Over the years, Adobe has also added many other types of rich content: audio, video, 3D imagery, etc. All of these include the full set of challenges for preserving their respective formats.

  5. Primarily for business users, Adobe has added several types of interactive forms, which rely on several complicated specifications and have in my experience been far less supported by third-party implementations, particularly the open-source community.

  6. In PDF 1.2, support was added for JavaScript as part of the forms specification. Since JavaScript is a full programming language, this means that the only way to process those actions is requires executing code in a manner consistent with the original implementation. Fortunately, this is likely to be uncommon in most preservation scenarios.

  7. The specification includes varying levels of encryption. It is possible to brute-force weak passwords and the older encryption algorithms but that might be possible and the software to do so might be difficult or illegal to obtain.

In practice, many of the concerns are manageable with several precautions. If your content is not supposed to include the various rich media features the best place to start is by requiring the restricted subsets of PDF which have been developed to avoid many of these issues: PDF/A, intended for preservation, and PDF/X, intended for reliable graphics exchange, which do not allow the more complex features and dramatically simplify the problem. If your goal is to archive general PDFs, however, you'll need to develop a more nuanced approach to audit the various complex features to check that a document does not include content which you are unprepared for (e.g. if your content includes embedded video, your auditing script could verify that the video stream uses a long-term viable codec).

Here are some features which you might want to audit:

  1. All fonts in the file are the standard core PDF fonts or embedded within the file.
  2. All images are in the subset of formats which you are prepared to support and decode without errors.
  3. All content streams are checked against a whitelist of supported types
  4. The PDF is unencrypted or at least that the password is known and the file decrypts successfully

Comments

Answer by johan

In addition to @chris-adams'excellent suggestion to audit files for font embedding, encoding, encryption etc:

If you have Adobe Acrobat Pro, you can use its 'Preflight' module (which you can find in the 'Advanced' menu) to validate a PDF against a number of compatibility profiles. If you select one of the PDF/A profiles here, it will check against all of the features mentioned by Chris (and more). Even if your PDF's are not of the actual PDF/A variety this is still useful, as it will immediately show you any features that are problematic for preservation (e.g. non-embedded fonts).

Obviously this is not a practical solution if you need to check very large numbers of files. There are a number of commercially available PDF/A validator tools out there that are better suited to this. In 2009 German company PDFLib did a comparison of a number of such tools. The results are given in the following report:

Bavaria Report on PDF/A Validation Accuracy.

On a final note I haven't tried out any of the tools mentioned in the report myself, but would be interested in having a go at this at some point.

Comments