What data format and tools exist for archiving tweets and similar microblogging?

I wonder whether there is a data format for archiving messages from twitter and similar microblogging sites, including metadata. The Library of Congress gets all messages from Twitter but I found no information how they store them and how they would make them available for researches. There are several tools to backup your Tweets, in most cases you don't get a download including all data. One can also access Twitter API but then everyone stores his archive in different formats so there is no easy way to exchange and combine archived messages.

On the other hand there is interest in analyzing microblogging and tweets are analyzed by companies and researchers. But are there open and standard tools to collect, archive, and analyze tweets? And which collections of archived microblogs exists in addition to the LoC? I don't want to propagate "MARC for Twitter" but at least something more precise than "just use CSV/JSON". Relying on the custom format of one particular provider (Twitter) at one particular point in time does not look like a reliable solution for long term archiving.

P.S: Ed Summers gave a brief overview of the format used by LoC for archiving Twitter. There are some open questions how do use the format for other services (e.g. Google+ and for particular selections of messages (somehow documented in bag-info.txt). I'd like to see tools to create your own archives in the same format - for instance all tweets by some user accounts or all tweets with a given hashtag like #VenusTransit in a given time - and tools to read and analyze these archive files. There are several closed web applications to do so but they don't provide import/export in such a defined standard format, do they?

Jakob

digital-archiving
web-archiving
digital-preservation
twitter

Comments

Ed Summers: I think when try to normalize and generalize a data format for multiple social networking services you may end up with lossiness. I still think for archival purpose it's best to keep what you receive and then perhaps overlay it with derived data.
Ed Summers: I meant to also mention that you might be interested in the ActivityStreams http://activitystrea.ms/ effort, if you haven't already seen it.

Answer by dr0i

I would use RDF and serialize this in turtle or, better yet, json-LD.

Why? Tweets often provide links, hashtags and mentions, every tweet has a URI, e.g. https://twitter.com/jindrichmynarz/status/176326368701853696 , and all in all it's all about graphs: "social" graphs. Retweeting is " :_personX twitter:retweets :_tweetY " . Liking is linking, see https://headtoweb.posterous.com/liking-is-linking, and so on.

Comments

Jakob: Ok, with RDF the serialization form is less relevant, but then you need to agree on a common set of ontologies and rules which URIs to use. So just replace "data format" with "ontology" in my question it it remains open.
Jakob: I think RDF is a great tool but for analyzing tweets and other messages in a second step after archving them. A common ontology is still required to do so, maybe SIOC?

Answer by Ed Summers

I think some public information about how the Twitter data is being stored at LC could possibly be helpful. One of the reasons for the lack thereof is probably because it's quite uninteresting (at the moment). In the interests of transparency I can provide some rudimentary information about how the data is archived, as I have been involved in the little bit of software development LC has done around the Twitter data to date. These remarks are not meant to be an official statement from the Library of Congress, but are merely the reflections of a software developer who has worked on the project.

LC currently receives the Twitter data from a third party data provider Gnip. Gnip packages up each hour's tweet and delete activity using BagIt, which is tarred up, and made available on Amazon S3. The structure of an example bag looks like:

2012050105
|-- bag-info.txt
|-- bagit.txt
|-- data
|   |-- 2012050105_deletes.gz
|   `-- 2012050105_tweets.gz
|-- manifest-md5.txt
`-- manifest-sha1.txt

A simple custom Python/Django application at the Library of Congress periodically polls S3 for new bags to download. When it finds a new one it downloads the tar file, untars it, counts the number of tweets and deletes, verifies the bag, and uses an internal data transfer application to inventory and copy the bag to archival storage, after which the bag is deleted from the local filesystem. This process runs 24/7 in order to keep up.

The tweet and delete data look like the JSON that the Twitter API itself emits. We have made no effort (to date) to normalize the data using another format for several reasons:

LC has received 94 billion tweets so far, and are currently receiving (on average) 13,750,000 tweets per-hour...and the rate is still increasing. It would require a significant amount of computing power (more than the current budgetary and policy constraints currently allow for) to re-process this data.
When researchers are given access to the data it is likely that they will already be familiar with the Twitter JSON format and its documentation, so reformatting it into something else will likely get in the way of understanding.
In the interests of the archival principle of original order it seems wise to archive the data that we receive, since it seems closest to capturing the essence and context of a tweet.
Like you, it isn't clear to me what format would be better than the Tweet JSON and why.
JSON itself is unversioned, and according to its creator, will not change...ever. This is kinda nice, in theory, for preservation purposes.

I hope this brief on-the-ground description of the Twitter archiving activity provides some guidance, and doesn't do a disservice to any of the folks at LC that have been involved in the effort.

Comments

Jakob: Thanks! This makes sense for archiving. A small sample would be helpful, I guess that all Tweets are just listed in one big JSON array? Identi.ca uses a compatible API, but the original source should be made clear in `bag-info.txt`, so more information about this file would be helpfull too. The LoC archives *all* tweets but other archives would only archive specific tweets, for instance all tweets with a given hashtag. This selection should be documented in `bag-info.txt`.
Ed Summers: Yes more details in a bag-info.txt are added by our inventory system. The tweet and delete files are line-oriented JSON. I can see about making a sample file available.
Ed Summers: Just to follow up, I made two sample bags available in my DropBox to serve as examples of how the data is being processed. The first is an example of one of the hourly bags, and the second is what the retrospective bag looked like. These are not complete bags and contain only the first 1000 lines of the respective JSON payloads.

Answer by johan

Although your question is mostly on formats, the title suggests you're interested in tools as well. If so, maybe T, which is a command-line interface to the Twitter API, could be of some interest here:

http://sferik.github.com/t/

Not sure how useful this is for professional archiving, and I must say that I don't have any hands-on experience with this tool myself. However I've spoken to some people who appear to be impressed with it (although they were mainly using it for personal backups). Possibly worth a look.