Converting invalid PDFs or not for digital preservation?

I have been coming across a number of PDFs that have been created by known PDF programs (including various versions of Adobe Acrobat) that are marked as invalid by JHOVE2. They display fine in every piece of PDF-reading software I tried.

I can run them through pdftk which fixes them, but it is changing the document which always runs some risks -- possibly unnecessarily in this case. Since this are lengthy documents, actually looking at everything to make sure nothing was damaged or changed is not practical (although they are looked at, but not every word). In the case of these documents, there is no fancy 3d rendering, embedded multimedia, or anything like that. Typically I also convert the to PDF/A at the same time when practical.

Is it better to just accept the documents as delivered or to convert them? In other words, what is the bigger risk. Conversion or files that JHOVE2 thinks is invalid?

Incidentally for a related blog post by Gary McGath on TIFF files, see http://fileformats.wordpress.com/2012/09/18/conformity/

ecorrado

digital-preservation
pdf

Comments

dsalo: Is it feasible to keep both versions, original and converted?

Answer by Andy Jackson

I would not perform any migration without understanding why the PDFs are failing validation. Please keep in mind the Adobe Acrobat is the reference implementation of PDF. Adobe bugs have been known, of course, but frankly the default assumption should be that this is caused by a bug in JHOVE, not Acrobat.

For example, JHOVE hasn't been updated to cover the latest versions of the spec...

...Someone tell them JHOVE knows nothing about PDF 1.7... https://twitter.com/GaryM03062/status/236101776372273152

...so it is possible that more recent features are confusing it.

Comments

Answer by Ross Spencer

To re-iterate what Andy has said, I would not perform any migration without understanding why the PDFs are failing validation. It could actually bear an impact on answers here if you could provide that information.

On top of this, I would advocate a solution where you DO migrate but maintain the originals. One way is to keep two digital objects. This has its own technical implications beyond the scope of this answer. Another mechanism which a colleague of mine adopted for digital objects was to maintain a binary diff of the old and new objects using bsdiff. Adopting this approach one could apply bspatch to the new file to re-create the old one bit for bit without having to keep the original around - thus saving on storage. Again this approach has its own technical implications and you should study the output to understand the size of bspatch files created from bsdiff. This will also be more beneficial for files where binary changes are small and files that you are unlikely to migrate again for a while.

Further you need to think about controlling your migration and what output format you are selecting. For example, why have you elected PDF/A over a like for like migration, say PDF 1.7 (broken) -> PDF 1.7 (fixed)? You can study the bitstreams in both cases to understand what changes have been made during the migration. I would hypothesize the changes would be more conservative in the latter and maybe more in-keeping with the original and therefore more suitable for your needs.

More information on bsdiff/bspatch: http://www.daemonology.net/bsdiff/

Comments

Answer by Paul Wheatley

Incorrect interpretation of JHOVE and/or JHOVE2 seems to be quite common in this community and can lead to unnecessary format migration that could well do more damage than it prevents. For example, see this poster presented at IDCC2013. In my opinion, this migration activity is unnecessary and possibly damaging.

The JHOVE tools assess a particular file's adherence to its respective format specification. They do not specifically report on long term preservation risks that the PDF might contain. The two might overlap, but in many cases they are different, so further (and considered) interpretation of what these tools report is essential. The answers to this stack question should be useful in making this interpretation.

So what defines a file format? In theory the file format specification and perhaps also (as Andy mentions) a reference implementation (eg. JasPer for JPEG2000 or Acrobat for PDF). However, particular files themselves are defined by one thing, and one thing only: the creating application. How closely the creating application adheres to a file format specification varies from application to application. Many don't adhere particularly closely and as a result, rendering applications tend to be very tolerant of common breaches of the relevant specifications (whether these breaches are due to poor implementation of the creating application, or alternative interpretations of ambiguous format specifications). And of course file format specifications and reference implementations don't always match up. In summary, a file format is to an extent a fuzzy concept, rather than the precise description implied by a "file format specification" document.

So given this somewhat confusing picture, what do we need to worry about when we acquire a PDF that we want to preserve for the long term?

There appears to be a growing consensus in the community (I only have anecdotal evidence for this) that if a file is judged invalid in relation to it's file format specification, but renders completely without problem in most common renderers, then there is no significant long term risk. This is of course arguable. The counter argument would suggest that rebuilding a renderer from the specification at a future time when the format had become obsolete would not lead ultimately to successful rendering of the invalid file. In my opinion, this scenario seems rather unlikely when we have open source code (eg. xpdf) that does tolerate invalid files, and should provide some insurance against "catastrophic obsolescence" scenarios.

If action were to be taken in migrating an invalid PDF to a valid PDF, there are some key issues to think about. Any format migration is potentially very dangerous and could result in damage to the content. It's therefore important to keep the original, or consider keeping a diff of the transform (as Ross suggested), and perform some kind of quality check on the new file. Given the difficulty in performing this check in an automated manner, the case for not migrating to address invalid-ness becomes stronger.

Comments

Ross Spencer: It would be interesting to understand from the open source solutions if their flexibility always existed or if it came about through the discovery of exceptions and then 'bug-fixing' as a result. It's a good point though.
Paul Wheatley: Yes very interesting thought. Anyone fancy investigating and adding the results here?

Answer by Jay Gattuso

It very much depends on what you're seeing in JHOVE as an error - different issues would/could/should be addressed in different ways.

Without knowing the type of error its difficult to assess the impact of the error, and then further to know what action to take...

Some errors I would argue are completely ignore-able. Anything of the "warning" type is likely to be largely inconsequential (N.B. - likely). E.g. the DateTime field is not completed using the expected separator.

Some errors have significant impact on the intellectual 'essence' (E.g. non embedded fonts leading to a font substitution to a sub-par font, resulting in the visual loss of special characters (eg maconised chars such as "ā" or "ō")

Finally, some errors do result in an inability to render the file (not the case here, but worth acknowledging), e.g. missing EOF maybe because of an incomplete file transfer resulting in a partial file that is missing binary data.

From a practical perective, we've been encountering this issue for a number of years as we ingest files into our preservation system. Initially we noted every instance that we encountered. This resulted in some nice data that showed this issue (of JHOVE failing pdf) is found across all pdf types that get ingested, and at a anecdotal rate of about 1 in every \~250 being flagged by JHOVE as having an issue of some sort.

We currently programatically ignore the flag from JHOVE if we can render the file. If we can render the file, we assume complete fail, and look into the file specifically. When we ignore the flag, a record is keep that a file was given an error by JHOVE, but the file is allowed to proceed unhindered into the perm storage. By ignoring the flag, we are essentially making a statement of intent that says "We acknowledge there is a potential risk with this file - we are currently unable to isolate the specific reason for the error, nor diagnose any specific long term risk for the content of the file. We will return at some point to the issue when either (1) we are sufficiently equipped with tools to address the challenge, (2) we have raised the generic risk against the pdf file type high enough that it becomes our priority action to explore/fix/mitigate or (c) we have run out of other things to do...".

As a principle. we would always retain the original file as it was first deposited - regardless of the ability to create a 'cleaner' version.

Its also worth thinking about what the constraining factor hear actually is.

What is the threshold test for creating a 'valid' digital object. On the one hand we have "Conforms to the exact specification"(JHOVE) on the other "Functions as expected across a representational set of object renderers" (Renderability test).

This notion of validity is at the very heart of what we are trying to achieve here, and I think personally that validity is not a binary thing (either valid, or not valid), there is a huge grey area in the middle that we work in on a daily basis and often we have to err on one side or another because of systematic/programmatic/circumstantial factors, when sometimes the decision might be better left on hold to allow knowledge and experience to influence the decision when its crunch time...