Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

web archiving tools that produce WARC + directory tree

An important feature of any web archiving tool is the ability to specify that one would like to pull down embedded resources ("page requisites" in the parlance of wget) that are hosted on a domain other than the site that is the target of the crawl. Such cases are found not only as examples of people "hot-linking" other people's images, but is encountered heavily on sites with cloud based content delivery networks, such as Tumblr. Including such assets in a crawl is absolutely necessary when one's aim is to achieve a complete mirror of a site.

Heritrix has an option for including such assets in the crawl scope: https://webarchive.jira.com/wiki/display/Heritrix/unexpected+offsite+content

As does Httrack, using the --near flag:\

http://www.httrack.com/html/fcguide.html\ But of course, Httrack does not offer WARC output.

Wget has the -H flag, allowing one to "span hosts" (in other words hit sites with domain names other than the starting url), it lacks the ability to specify that the crawler should span hosts only for page requisites, and so tries to download the entire web if one combines an infinitely recursive crawl with -H. There are some hacky ways of getting around this, but they aren't pretty or reliable. The great thing about Wget though is that it allows the user to output WARC and a directory tree of the crawled site, thus not locking one in to WARC completely.

Are there any tools of the trade that allow one to conduct the comprehensive type of crawl mentioned above, but also have the affordance of outputting WARC, and a directory tree?

Ben Fino-Radin

Comments

Answer by Andy Jackson

I'd say you have two options. You could crawl using Heritrix and generate the WARC files, but then pop the contents of those files out into a directory tree. You could do this via one of these two warctozip scripts 1, 2 (disclaimer: I wrote the latter one and it's just a rough prototype and should probably be merged with the former).

The other option is to use an existing tool to enumerate all the URLs and then pass that list to wget. This is what my flashfreeze tool does, using a PhantomJS script to determine all the dependencies of a specific page, and then using wget to grab them as a WARC file.

Comments