The Zombie Stack Exchanges That Just Won't Die
Assuming a static website (or a dynamic website that I'm choosing to archive as a static site), what is the best file format or standard to use to capture all of the dependencies (embedded images, CSS, JavaScript, iframes, etc.) and http metadata (last modified headers, content type, etc.). Ideally, I'd like a file format with an open specification and multiple implementations for reading the file.
Peter Murray
The problem is that most of the 'solutions' are browser specific, and they're not really intended for archiving:
MHTML is based on RFC 2557
(MIME encapsulation of multiple documents), and SingleFile is based on
re-writing other elements as data:
URIs per RFC
2397
MHTML likely has the widest support for reading, but if I were to do it,
I'd personally be inclined to grab the whole site (rather than a single
page) and re-write it using similar logic as
wget --convert-links
,
then pack the whole thing using Bag
It w/ appropriate
metadata (HTTP headers & checksums).
(and my definition of 'site' isn't necessary 'all web pages on a given server' ... it's more those that share the same images/javascript/css and are highly inter-linked)
... and for reference, the Internet Archive uses a file format they call ARC (more legible alternative, if you bump the font size up), based on 100MB groups of files. The Web Curator Tool also uses ARC.
I believe you're looking for ISO 28500:2009, aka WARC, the Web ARChive file format.
The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC File Format [ARC] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web.
-- ISO 28500:2009, final draft
The development version of wget can output to WARC.
There is a sponsored project to leverage support for the WARC archiving format in JHOVE2 and NetarchiveSuite
The Internet Archive has many of their tools published on Github:
warc-tools is a Python-based suite of tools for working with WARC files, though not technically a "viewer".
Archive Team has a writeup of wget with WARC output, including links to the final draft of ISO 28500:2009 and implementation guidelines from the WARC usage task force.
The United States Library of Congress has some analysis of the WARC format with respect to its suitability as a sustainable digital format.