Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

How can I mass-archive Ajax-heavy websites that archive.org can't preserve?

These days, the Internet is becoming more and more dominated by Ajax, and they're a pain for archivers to preserve.

E.g. answer pages on quora.com and the "expand more" button on Facebook. There are scripts that one can write that can click the "expand more" button multiple times, but when there are 1000 "expand more" buttons, it gets to the point where Firefox/Chrome simply can't handle all the RAM anymore.

InquilineKea

Comments

Answer by lechlukasz

You are taking false approach trying to preserve user interface, and not content. They used to be the same in the begins of internet era, now it is no longer the case.

First of all, those what you describe is only the content visible to users. Bots are seeing the test content, to make the site googlable. But that content is not the one to be preserved because of heavy data duplication.

The correct approach is to archive the data dumps that are provided by that site (in lack of them, RSS feeds etc.). If such full-AJAX site doesn't provide such dumps, it usually means they are interested only in financial win and not the value of content generated by users, so you should avoid such sites.

StackExchange, for example, is valuing user contribution. Every content is on open licence and the data dumps are published on regular basis.

Comments

Answer by Andy Jackson

In general, the best option I am aware of is to use something like PhantomJS to view the website, and use that to determine all the URLs the rendering process depends upon, which can then be archived by using something like wget. This is what my flashfreeze prototype does.

However, in your specific example, you would need to go further and explicitly script PhantomJS it so that it repeatedly simulates/invokes mouse clicks on all the 'expand more' buttons, until it runs out of buttons.

Comments

Answer by Chris Adams

First, you'll have to accept that this process is going to be even more volatile than traditional web archiving and factor that into your decisions regarding quality review, access, etc. One alternate tactic might be contacting the site owner to see if they do provide alternate ways for robots to crawl their content.

The key part is using a real browser engine – i.e. something which executes JavaScript and follows the expected browser behaviour for the DOM, AJAX, etc. – and budget time for some sof scripting in situations where content isn't loaded except in response to user action, requires logins to get around quotas for unregistered users (i.e. Quora), etc.

One approach might be to use a capturing proxy like liveweb which records a browser's activity into a WARC file, allowing you to record any browser or application which can be configured to use a proxy.

I would also second Andy's suggestion of considering a webkit-based system like PhantomJS, perhaps in conjunction with a toolkit like pjscrape or casperjs. The web QA community has also developed a number of tools around Selenium which is harder to get started with than PhantomJS but does have the advantage of supporting other browsers if you find yourself needing to archive something like a site which only works in Internet Explorer.

Comments