Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

Converting BagIt files to WARC

In Italy, for an ongoing experiment of digital legal deposit, national libraries are receiving bagit files from publishers (usually harvested from their own FTP or HTTP sites).

The idea is to convert these bags into WARC and store in a common repository of web archives (accessed with the Wayback Machine).

How do you suggest this would be best managed?

The ideal solution should be the RESTful bag server exposing the bag content and crawling it with Heritrix 3.

In the meantime of not having an implementation of the bag server, we could unpack the bag in a static web server (with a stable URL), give it some namespace, and later crawl it.

Another idea could be reading the entries of the bag and make a warc with warc: Python library to work with WARC files.

According to the warc specs a resource can be any URI. Is it safe to give it a URL or is better to use something like an urn?

Furthermore, do you know if Wayback deals with resources that are not URLs?

raffaele messuti

Comments

Answer by Morgan Johnson

Scalable WARC manipulation API, "WSDK": http://webarchivingbucket.com/

Comments

Answer by Andy Jackson

I'm not convinced that bundling all sorts into WARC files is a terribly good idea. We do this in the case where the resources are on the web but for some reason we need to transfer the content directly as files rather than download it all from the site (e.g. large content or servers that are difficult to crawl). However, even in this case, migration to WARC is awkward and error prone, because to rebuild the http playback you need to accurately mock up the server headers. Little things, like getting the MIME type right for formats that are hard to identify automatically like CSS and JavaScript, cause us a lot of headaches.

Therefore, the idea of taking files that were never on the web and mocking up the entire think make me very nervous! I would not be comfortable discarding the BagIt SIP because you are taking files out and discarding and filesystem metadata and the original inventory etc. and your QA procedure will need to be very tight, up to and including verifying playback via Wayback. I'm pretty sure that, by default, Wayback won't support arbitrary URNs, and therefore you'll have a lot of setup to do.

What's wrong with just keeping the bag anyway? Can't this all be tested out for a while as an access layer first? i.e. transform the bag content on access but keep the SIP as your AIP for now?

Comments