Archiving the Dynamic Web

Introduction

Of the many challenges facing those of us who attempt to preserve the web, one of the most pressing is the highly dynamic nature of many modern websites. The Wikipedia SOPA blackout banner case study shows how this can prevent both crawling and playback. That example is bad enough, but unfortunately there are many more cases like this, but where almost all artefacts presented on the page are pulled down dynamically via JavaScript. Occasionally, even relatively simple constructs like conditional comments have caused problems (c.f. gov.uk rendering thread). In general, as the level and dynamic sophistication is increased, our ability to reliably reconstructing the intended interpretation outside the context of a browser becomes more error prone.

Fortunately, recent web browser developments, and their use on mobile platforms, has lead to some very nice, compact, embeddable rending engines. In particular, the WebKit engine has been very successful, and has lead to a number of embedded forms:

WebView and WebEngine - a WebKit rendering component, originally part of JavaFX but now a standard part of the Java 7 JRE.
PhantomJS - a headless WebKit browser with a JavaScript based scripting interface.
Ghost.py - an embedded WebKit browser with a Python interface, that also supports plugins (e.g. Flash).

This kind of tool is often intended to be used to aid development and regression testing for web site generator software, but can also be used to help web archiving. We have been experimenting with PhantomJS, and this document outlines how our crawler architecture might develop over time.

Architectures

Current Architecture

Currently, for every seed, we render the root page via an embedded browser, as outlined below.

Asynchronous Rendering

As the crawl proceeds, every time Heritrix3 starts processing a new seed URL, that URL is also POSTed to a RabbitMQ message queue using our AsynchronousMQExtractor. An entirely separate WebTools service polls the message queue, and runs a PhantomJS script that collects the URLs of the resources required to perform the rendering. It then saves the rendering as a screenshot, and pushes the list of URLs into Heritrix3’s ‘action directory’, which instruct the crawler to download those URLs promptly.

This has worked reasonably well, but suffers in a couple of areas.

We treat every known host and every manually annotated URL as a seed, and so around four million URLs will be rendered, out of a total of around 1.3 billion URLs. (TBA Something on tunable rendered fraction). Those four million screenshot processes take some time, and as we are only running one WebTools service at the moment, the time that elapses between the two processes can be rather long. This means the URLs downloaded by Heritrix may not be valid by the time the download is attempted.

A more serious problem is that this approach downloads everything twice, from two independent contexts, with different sessions, user agents, etc. This means that not only are we downloading the same things too often, but that the two sets of required URLs can be inconsistent. In particular, it is common to bundle up CSS files on a per-session basis, and so the list of URLs required to render a page in the embedded browser may not be the same as those in Heritrix3. As Heritrix3 downloads the main page, and the embedded renderer is only used to determine transcluded resources, the composite archive is inconsistent.

Synchronous Rendering

One way to solve the time-lag problem would be by having a larger array of WebView services consuming the message queue. We will probably experiment with that first.

However, it may instead be preferable to bring the two closer together, and perform the rendering synchronously, during the crawl, thus guaranteeing that the results would be as timely as possible.

Synchronous Rendering

This architecture would mean that the overall crawl would proceed more slowly, particularly at first. However, based on our experience so far, we do not believe this would be a major problem. The embedded renderers are very efficient, and during the first domain crawl we successfully collected the entire set of asynchronous screenshots long before the crawl drew to a close.

Taking this even further, it would be possible to bring the rendering process right inside Heritrix3.

Synchronous Embedded Rendering

Given the existence of the WebEngine component, it should be possible to do this using only pure Java 7 code rather than calling our to PhantomJS. Having said that, one issue with running a full rendering engine is that it exposes the service to possible browser-based exploits. This is not all that likely when running an odd, embedded browser on an non-mainstream platform (ours run on a flavour of Linux), but precautions should still be taken to ensure that the machine can be rebuilt easily if compromised. Due to the large amount of complex state managed by the current Heritrix3 architecture, this would not be an advisable architecture.

Army of Ghosts

The other problem we mentioned above was that of the doubly-downloaded, inconsistent content. One solution is to send the full request headers to the WebView service, thus ensuring the session and other information are the same. This is a sensible step, but it still seems unsatisfactory that every URL is being downloaded twice. To avoid that, a much larger change to our architecture would be required.

WARC Writing Proxy

In this scheme, we still use Heritrix3 (essentially as a crawl frontier queue management engine), but the WARC writing (including de-duplication) is all handled via a dedicated proxy (such as warcprox or LAP). This architecture is much easier to scale out, e.g. using SQUID as a load balancer.

It also allows us to use a single WARC backend for a wide range of manual or automated archiving processes, and keeps all of the deduplication logic close to the writers and out of the crawlers. (TBA More about how this opens up scaling out on both sides, i.e. army of ghosts)

Quality Assurance

Crawl-level QA

Monitrix

Near Real-Time Playback QA

OpenWayback hooked in to evaluate playback ASAP.

Preserving the Render

Would be good to package the screenshots into WARCs into some standard form. Would also be good to capture the extracted URL lists and the final DOM trees and store those too. This would make it easier to perform QA on future preservation actions.

http://docs.oracle.com/javafx/2/api/javafx/scene/web/WebEngine.html#getDocument()

The Elastic Crawler

To spec.

Components

Crawl Metadata, User Agent etc.
Seeds (text file)
DecideRules
FrontierPreparer
QuotaEnforcer
Canonicalisation
uriUniqFilter
IP/Robots.txt info validity/timeout.
Fetchers
Extractors
PeristLog
FetchHistory
Sheets
Our custom stuff, IP sheets etc.
WARC Writers, disk space monitoring, etc.

Heritrix3 queues, per host. This enforces delay? Queue rotation behaviour?

Coping with millions of queues?

Kafka can’t
RabbitMQ can
Kestrel? which is used with Twitter Storm.

But do we need millions of queues? Do they only exist as a means of helping enforce the crawl rate?

When taking a snapshot Heritrix renames crawl.log

Frontier queue budgets BdbFrontier

SQUID + CARP (Cache Array Routing Protocol) and N * LAP/warcprox OR CrawlBolts keep track of proxies via ZK and route on host hash. IF This is really necessary? Alternative is simply to store duplicate info in Cassandra. In principle, the Archiving Proxies could do much of this:

Crawl-delay
Retries? Probably too complex. Perhaps better to have separate CASSANDRA table like:
( ((hash), SURT, UUID), WARC-Record-Type, WARC, Offset )
Use this to decide duplicates, store or write re-visit, etc.

URIs table used for crawl/frontier:

( ((host), SURT, crawl_time), URI, hash?)

http://commoncrawl.org/common-crawl-move-to-nutch/ A WARC writer added to Nutch: https://github.com/Aloisius/nutch/commit/3ef169ad5402cee35346f566c85c237b5d128495 versus https://github.com/commoncrawl/commoncrawl-crawler

See also StormScraper, and associated slides, e.g. this topology overview.

Not clear complex features like queue rotation are strictly required. May make more sense to start with something simpler. In fact, it probably makes sense to think a bit more radically about the whole thing, and investigate the possibility of switching over to true ‘continuous crawling’.

So, aspects are that e.g. using Storm or similar makes long-running processes easier. Also the need to ‘crawl right now’ needs to sit alongside the ‘has not been crawled since X and is due’ and the ‘big crawl’ more easily.

Balance between the queues and the state (e.g. Cassandra/HBase is less clear now).

IN: system, high-priority and low-priority queues of URLs.
SPOUT: Watch the queues, in priority order, occasionally skipping down to avoid total queue backlog.
DISTRIBUTE: URLs grouped on Host.
BOLT: Pause based on crawl delay (STATE NEEDED)
- URL comes in.
- Look up last known Crawl-Delay and last known crawl timestamp.
- Wait until time has passed before emitting.
BOLT: Download resource (embedded browser OR old-style GET):
- BOLT: Record success in State.
- BOLT: Append to WARC (if old-style).
  - Can’t do that as involves serialising the whole response.
- BOLT: Extract links:
  - BOLT: Check links against State, and DecideRules, and Enqueue if crawl is due.
    - BOLT: Record link discovery in State.

As most of the process is serial and required, little benefit in large numbers of bolts for the sake of it. However, it makes sense to use separate bolts for processes with different I/O dependencies, so these can be tuned independently. e.g. coupling with HBase etc.

IN
- RabbitMQ:
  - system action queue (highest priority for processing)
  - high-priority URL queue
  - standard-priority URL queues by host
  - Initially filled with seeds by a separate process.
- HBase or Cassandra:
  - (‘State’/’State DB’)
  - Contains the URL crawl history.
SPOUT:
- Watch the queues, in priority order, occasionally skipping down to avoid total queue backlog.
- So, spout is doing the quotas and queue rotation.
- Also a command queue that can pause things etc.
- ACTIONS:
  - Start
  - Pause
  - Shutdown
  - Enqueue(url,priority)
  - PauseHost(host)
- BOLT: Crawl URL
  - DISTRIBUTE: URLs grouped on Host. +isSeed
  - Look up last known Crawl-Delay and last known crawl timestamp for Host.
  - Wait until time has passed before emitting.
  - Download resource (embedded browser OR old-style):
  - IF old-style: append Records to WARC file.
  - Extract links.
  - BOLT: Update Crawl History
    - DISTRIBUTE: Random
    - Record outcome in State DB.
    - If successful, includes WARC+offset ID.
  - BOLT: Enqueue(url, priority)
    - DISTRIBUTE: Random
    - Check links against State, and DecideRules, and Enqueue if crawl is due.
    - Record link discovery + decision in State.
  - ACK now that we’ve finished all processing.

Evolving Heritrix

Looking at options for evolving Heritrix. Aside from unexpected (breaking) changes upstream, our main problems appear to be around the use of BDB JE for persistance. We have absolutely massive state directories, and restarting from checkpoints seems brittle and opaque. Sometimes it goes wrong, and if it does, it’s impossible to work out what’s going on.

Looking through the code, it seems the BDB state engine is used in three roles. Crawl state (like the frontier, cookie store), crawler caches (like the server cache), and deduplication support. See https://docs.google.com/spreadsheets/d/1oafpoY5AxBA1OlloKe2bOOYk1FjRvxyXp64YrfOOYlU/edit#gid=0

There is Kristinn’s Lucene-based deduplication engine, and there also appears to be a HBase version inside the H3 codebase. (See HBasePersistProcessor - not clear if this code is in use).

For the frontier, it would be possible to switch to an off-the-shelve queue, like RabbitMQ or Kafka (e.g. RabbitMQ’s QueueingConsumer). However, the Frontier is also responsible for implementing the crawl-delay, via per-host queues that are ‘snoozed’ until the crawl-delay expires (see org.archive.crawler.frontier.WorkQueueFrontier.snoozeQueue(WorkQueue, long, long)). Having millions of distinct queues is not supported by many queue systems (e.g. Kafka), and generally seems to be a rather cumbersome approach.

The Frontier also handles unique-URI filtering which, it turns out, is not handled by a Bloom filter but a full Set implementation, so this may be part of the reason why the state gets so large. As the crawl history we need for deduplication is held in same BDB database as the Frontier, we end up having to maintaining all this state between crawls.

So, it would be desirable to separate the Frontier functionality out into the UniqueUriFiltering (which could be done via a Trie if a Bloom Filter is unacceptable), the queues, and the crawl-delay implementation. This may be possible by extending from WorkQueueFrontier, which implements the snoozing part, and which allows a pluggable unique URI filter. However, because this system ties the queues to the crawl-delays, the underlaying implementation would have to maintain the illusion of a queue per host. This is likely to be workable, but would require testing to make sure, and tuning of the QOS setting (i.e. the number of RabbitMQ messages that are allowed to be ‘in flight’ i.e. in use but not ACKd).

Web Archives