Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

Are there any libraries that are trying to preserve information in Ajax-heavy websites that archive.org can't preserve?

Archive.org used to be far more useful, but now there is a huge problem - it can't archive a huge number of the Ajax-heavy websites on the Internet now.

InquilineKea

Comments

Answer by dsalo

This is a remarkably difficult problem! The answer is "yes, some are trying... but it's a remarkably difficult problem."

See David Rosenthal, "Harvesting and Preserving the Future Web" for a brief discussion, also The Signal blog on "The Web Archive in Today's World".

Comments

Answer by Ed Summers

While not being directly related to digital preservation there is some evidence that GoogleBot has been executing JavaScript for some time now.

At the 2012 International Internet Preservation Consortium meeting there was some talk about the Internet Archive using PhantomJS as part of its Quality Assurance processes. But so far I haven't heard any news about the Internet Archive (or other web archiving outfits) integrating a headless browser into their archiving process.

I have been seeing opensource projects like pjscrape that lets you crawl the web, while executing JavaScript, while giving you access to the DOM with jQuery. It might be interesting/useful if these tools allowed you to serialize the DOM as a WARC file for use in web archiving contexts...

Comments

Answer by Denis Petrov

Archive.is can store AJAX pages, such as Twitter's or Google Map's.

But there are some limitations:

it saves only snapshot of AJAX-generated page. Interactive elements will be lost after saving because they require the original AJAX-server to interact with.

it uses WebKit to execute javascript and then renders WebKit's state back to HTML and CSS. This approach causes that the layout of a saved page may be broken in non WebKit-based browsers. In most cases, it is ok in Firefox and Opera, but complex saved pages may look broken in MSIE.

perforance is not good at the moment, during to heavy CPU load of browser engine. Only 3-4 pages per second (20Mbit/s of incoming trafic) per server.

Comments

Answer by Andy Jackson

There is a newly released tool called fantomas that attempts to simulate user events to ensure more dependencies are caught. It uses the PhantomJS WebKit engine under the hood, but also goes through the DOM to fire the events it finds, and even does some simulated (random) user clicking. When finished, it outputs the links to all the resources that were found, which can be passed to/through a web archiver tool.

Comments