Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

Characterization of WARC files contents?

I am looking a tool that given a WARC file produces and presents a characterization of the contents included in this archive.

The goal is to have a quick broad characterization of a collection of web resources.

ssn

Comments

Answer by raffaele messuti

I bookmarked these some months ago, but I've not checked recent updates on their development:

Comments

Answer by Peter Cliff

You might want to take a look at some of the UK Web Archive code. There are examples of converting a WARC to a ZIP and also mounting as a file system using FUSE-J, but these are not used at the moment (the former because the warc-to-zip conversion is not finished, the latter because it is too slow for our purposes). Our current preference is for running the WARC characterisation on Hadoop, and so we use code specific to that platform to dissect the WARC. We also started writing extensions to Apache Tika to make it able to parse WARC files, but these have not yet been polished up and made easily available.

However, if you can unpack the WARC, you can use any characterisation tool to investigate the content of your WARC such as C3PO, one of the Tika wrappers from SPRUCE, or roll your own using FITS or Apache Tika stand-alone.

Comments

Answer by raffaele messuti

how malsane could be the idea to characterize the content payload while being downloaded?

sure it'll be a slow process, but using wget-lua should not be difficult to write some rules to spawn an external process (fits, jhove, whatever) and save back the result inside the warc.

the output should be saved under a resource record, with a namespace or a selfdefined uri

Comments

Answer by Nicholas Clarke

Those 2 links to sbforge.org are not exactly up to date yet (unfortunately).

The JHove2 WARC(/ARC/GZip) modules were part of a one year project funded by the IIPC.

The WARC module which Andy linked to should be production ready. There only remains the minor details of adding some unittests and merging with the main JHove2 repository.

The latest binary is available from here https://bitbucket.org/nclarkekb/jhove2-iipc/downloads

The following link should include some documentation on configuring jhove2 with WARC on your system. https://sbforge.org/display/NAS/JHove2+modules+-+configuration

To be clear, the JHove2 WARC modules offer the characterization you are looking for but only to the extent that there are JHove2 modules to cover the formats. So JHove2 will identify and characterization your WARC files and the content of your WARC files but only for the formats which are supported.

Regards JWAT-Tools, those are more lightweight tools for validating only the WARC(ARC/GZip) files but not their content. The payload is only validated so far as to check digests and other headers which are present in the WARC header.

If you have any other questions I will be happy to clarify the WARC work I have done for the IIPC.

Comments