Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

What is the best format/standard for archiving static web pages? What software generates that format?

Assuming a static website (or a dynamic website that I'm choosing to archive as a static site), what is the best file format or standard to use to capture all of the dependencies (embedded images, CSS, JavaScript, iframes, etc.) and http metadata (last modified headers, content type, etc.). Ideally, I'd like a file format with an open specification and multiple implementations for reading the file.

Peter Murray

Comments

Answer by Joe

The problem is that most of the 'solutions' are browser specific, and they're not really intended for archiving:

MHTML is based on RFC 2557 (MIME encapsulation of multiple documents), and SingleFile is based on re-writing other elements as data: URIs per RFC 2397

MHTML likely has the widest support for reading, but if I were to do it, I'd personally be inclined to grab the whole site (rather than a single page) and re-write it using similar logic as wget --convert-links, then pack the whole thing using Bag It w/ appropriate metadata (HTTP headers & checksums).

(and my definition of 'site' isn't necessary 'all web pages on a given server' ... it's more those that share the same images/javascript/css and are highly inter-linked)

... and for reference, the Internet Archive uses a file format they call ARC (more legible alternative, if you bump the font size up), based on 100MB groups of files. The Web Curator Tool also uses ARC.

Comments

Answer by jeff

I believe you're looking for ISO 28500:2009, aka WARC, the Web ARChive file format.

The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC File Format [ARC] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web.

-- ISO 28500:2009, final draft

Software

The development version of wget can output to WARC.

There is a sponsored project to leverage support for the WARC archiving format in JHOVE2 and NetarchiveSuite

The Internet Archive has many of their tools published on Github:

warc-tools is a Python-based suite of tools for working with WARC files, though not technically a "viewer".

Additional Reading

Archive Team has a writeup of wget with WARC output, including links to the final draft of ISO 28500:2009 and implementation guidelines from the WARC usage task force.

The United States Library of Congress has some analysis of the WARC format with respect to its suitability as a sustainable digital format.

Comments