Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

What are some ways to automatically generate descriptive metadata for warcs?

What are some ways to automatically generate descriptive metadata for warcs, or what are the best tools for parsing warcs?

I'm looking to generate as much descriptive metadata (DC) as possible for a given crawl to then be ingested into a repository.

I've come across the Internet Archive's warc, a Python library, and warc-tools, another Python library.

warc looks like it can put out a fair bit of what could be used as descriptive metadata. But, what about parsing some actual html tags (e.g., <title>foo</title>)?

ruebot

Comments

Answer by raffaele messuti

refer to this discussion on library.stackexchange : Characterization of WARC files contents?

Comments

Answer by raffaele messuti

i played a bit with warc. with the following python script (it's quick and dirty) you can analyse all response records with tika, and save the json output in a directory (files named as record-uuid.json)

for html content the result is good, otherwise images are recognized as application/octet-stream. i guess that is record.payload including also http headers

import warc
import subprocess
import sys

if len(sys.argv) < 2:
    sys.exit('Usage: %s warcfile' % sys.argv[0])

warcfile = sys.argv[1]

f = warc.open(warcfile)
for record in f:
    if record.header.type == "response":
        uuid = record.header.record_id.split(":")[2][:-1]

        process = subprocess.Popen(["tika", "-m", "-j"], 
            stdin=subprocess.PIPE, stdout=subprocess.PIPE)

        process.stdin.write("{}\n".format(record.payload.read()))
        process.stdin.flush()
        process.stdin.close()

        out = open("metadata/{}.json".format(uuid), "w")
        out.write(process.stdout.read())
        print uuid
        out.close
        process.wait()

Comments