Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

What data format and tools exist for archiving tweets and similar microblogging?

I wonder whether there is a data format for archiving messages from twitter and similar microblogging sites, including metadata. The Library of Congress gets all messages from Twitter but I found no information how they store them and how they would make them available for researches. There are several tools to backup your Tweets, in most cases you don't get a download including all data. One can also access Twitter API but then everyone stores his archive in different formats so there is no easy way to exchange and combine archived messages.

On the other hand there is interest in analyzing microblogging and tweets are analyzed by companies and researchers. But are there open and standard tools to collect, archive, and analyze tweets? And which collections of archived microblogs exists in addition to the LoC? I don't want to propagate "MARC for Twitter" but at least something more precise than "just use CSV/JSON". Relying on the custom format of one particular provider (Twitter) at one particular point in time does not look like a reliable solution for long term archiving.

P.S: Ed Summers gave a brief overview of the format used by LoC for archiving Twitter. There are some open questions how do use the format for other services (e.g. Google+ and for particular selections of messages (somehow documented in bag-info.txt). I'd like to see tools to create your own archives in the same format - for instance all tweets by some user accounts or all tweets with a given hashtag like #VenusTransit in a given time - and tools to read and analyze these archive files. There are several closed web applications to do so but they don't provide import/export in such a defined standard format, do they?

Jakob

Comments

Answer by dr0i

I would use RDF and serialize this in turtle or, better yet, json-LD.

Why? Tweets often provide links, hashtags and mentions, every tweet has a URI, e.g. https://twitter.com/jindrichmynarz/status/176326368701853696 , and all in all it's all about graphs: "social" graphs. Retweeting is " :_personX twitter:retweets :_tweetY " . Liking is linking, see https://headtoweb.posterous.com/liking-is-linking, and so on.

Comments

Answer by Ed Summers

I think some public information about how the Twitter data is being stored at LC could possibly be helpful. One of the reasons for the lack thereof is probably because it's quite uninteresting (at the moment). In the interests of transparency I can provide some rudimentary information about how the data is archived, as I have been involved in the little bit of software development LC has done around the Twitter data to date. These remarks are not meant to be an official statement from the Library of Congress, but are merely the reflections of a software developer who has worked on the project.

LC currently receives the Twitter data from a third party data provider Gnip. Gnip packages up each hour's tweet and delete activity using BagIt, which is tarred up, and made available on Amazon S3. The structure of an example bag looks like:

2012050105
|-- bag-info.txt
|-- bagit.txt
|-- data
|   |-- 2012050105_deletes.gz
|   `-- 2012050105_tweets.gz
|-- manifest-md5.txt
`-- manifest-sha1.txt

A simple custom Python/Django application at the Library of Congress periodically polls S3 for new bags to download. When it finds a new one it downloads the tar file, untars it, counts the number of tweets and deletes, verifies the bag, and uses an internal data transfer application to inventory and copy the bag to archival storage, after which the bag is deleted from the local filesystem. This process runs 24/7 in order to keep up.

The tweet and delete data look like the JSON that the Twitter API itself emits. We have made no effort (to date) to normalize the data using another format for several reasons:

  1. LC has received 94 billion tweets so far, and are currently receiving (on average) 13,750,000 tweets per-hour...and the rate is still increasing. It would require a significant amount of computing power (more than the current budgetary and policy constraints currently allow for) to re-process this data.
  2. When researchers are given access to the data it is likely that they will already be familiar with the Twitter JSON format and its documentation, so reformatting it into something else will likely get in the way of understanding.
  3. In the interests of the archival principle of original order it seems wise to archive the data that we receive, since it seems closest to capturing the essence and context of a tweet.
  4. Like you, it isn't clear to me what format would be better than the Tweet JSON and why.
  5. JSON itself is unversioned, and according to its creator, will not change...ever. This is kinda nice, in theory, for preservation purposes.

I hope this brief on-the-ground description of the Twitter archiving activity provides some guidance, and doesn't do a disservice to any of the folks at LC that have been involved in the effort.

Comments

Answer by johan

Although your question is mostly on formats, the title suggests you're interested in tools as well. If so, maybe T, which is a command-line interface to the Twitter API, could be of some interest here:

http://sferik.github.com/t/

Not sure how useful this is for professional archiving, and I must say that I don't have any hands-on experience with this tool myself. However I've spoken to some people who appear to be impressed with it (although they were mainly using it for personal backups). Possibly worth a look.

Comments