What is the best format/standard for archiving static web pages? What software generates that format?

Assuming a static website (or a dynamic website that I'm choosing to archive as a static site), what is the best file format or standard to use to capture all of the dependencies (embedded images, CSS, JavaScript, iframes, etc.) and http metadata (last modified headers, content type, etc.). Ideally, I'd like a file format with an open specification and multiple implementations for reading the file.

Peter Murray

software
preservation
web-archiving
digital-preservation

Comments

Answer by Joe

The problem is that most of the 'solutions' are browser specific, and they're not really intended for archiving:

Apple's WebArchive
Microsoft's MHTML
Mozilla extension MAFF
Chrome extension SingleFile

MHTML is based on RFC 2557 (MIME encapsulation of multiple documents), and SingleFile is based on re-writing other elements as data: URIs per RFC 2397

MHTML likely has the widest support for reading, but if I were to do it, I'd personally be inclined to grab the whole site (rather than a single page) and re-write it using similar logic as wget --convert-links, then pack the whole thing using Bag It w/ appropriate metadata (HTTP headers & checksums).

(and my definition of 'site' isn't necessary 'all web pages on a given server' ... it's more those that share the same images/javascript/css and are highly inter-linked)

... and for reference, the Internet Archive uses a file format they call ARC (more legible alternative, if you bump the font size up), based on 100MB groups of files. The Web Curator Tool also uses ARC.

Comments

Andy Jackson: Note that the Internet Archive use WARC instead of ARC these days and the Web Curator Tool supports both the ARC and WARC formats.

Answer by jeff

I believe you're looking for ISO 28500:2009, aka WARC, the Web ARChive file format.

The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC File Format [ARC] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web.

-- ISO 28500:2009, final draft

Software

The development version of wget can output to WARC.

There is a sponsored project to leverage support for the WARC archiving format in JHOVE2 and NetarchiveSuite

The Internet Archive has many of their tools published on Github:

"Heritrix, the Internet Archive web crawler
"draintasker", which is used to pull data in WARC format from a crawler and transmit it to longer term storage.
"warc", a Python library for reading and writing WARC files
"wayback", the infamous Wayback Machine, a de-facto WARC "viewer"

warc-tools is a Python-based suite of tools for working with WARC files, though not technically a "viewer".

Additional Reading

Archive Team has a writeup of wget with WARC output, including links to the final draft of ISO 28500:2009 and implementation guidelines from the WARC usage task force.

The United States Library of Congress has some analysis of the WARC format with respect to its suitability as a sustainable digital format.

Comments

dsalo: heh. beat me to it. :)
Ed Summers: beat me too :-)
dsalo: Is there a viewer for WARC besides wayback? Seems like that would be a useful thing; when I've archived websites (via DSpace), they've wanted dissemination as well.
Peter Murray: Thanks @jeff; this is the sort of thing I'm looking for. I'm also interested in @dsalo's question about other viewers from the perspective of unpacking at the server level and sending the components back to the browser.