Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

What approaches are there for archiving and preserving databases and what tradeoffs are there between different approaches?

Is it ever a good idea to keep databases in their original format? Is it best to try and migrate data to newer kinds of databases? Or, is it a good idea to try and flatten the data as much as possible and make it into a CSV dump or something? I am curious to know what the trade offs are between different approaches and if there are any current standards or best practice documentation on the topic.

Trevor Owens

Comments

Answer by dsalo

SQL dumps are pretty common, I believe. You have to be careful which settings you set (as usual with anything complex), but SQL is handy-dandily text-based and can reconstruct the entire database in the old DBMS or in a new one -- and if there's a hitch on import, the SQL can be massaged until it works.

I've known people who put Access .mdb files in institutional repositories. I think this... unwise, but at least it's a fairly popular format.

Comments

Answer by Ed Summers

SQL is much preferrable to native data dumps. For example, Wikipedia use them quite a bit for making their mediawiki databases available to others. Of course you can create your own custom serialization in XML, JSON or what have you, but it can be tricky/tedious to preserve the relational aspect of the database: primary keys, foreign keys, etc.

RDF has some characteristics that make it fairly nice for serializing relational databases (identifiers cooked in, entity resource orientation), and there have been some attempts such as D2RQ for providing a RDF translation between RDBMS. This work informed the W3C RDB2RDF Working Group which has resulted in two working drafts that are headed towards recommendation status soonish: A Direct Mapping of Relational Data to RDF and R2RML: RDB to RDF Mapping Language. These may not be for the faint of heart, but it is encouraging to see people are working on serialization of relational databases in formats other than SQL. SQL is a standard yes, but some database vendors have their own helpful proprietary extensions which can make it a bit tricky at times.

Comments

Answer by Paul Wheatley

There are a number of database archiving initiatives, mainly focusing on migration/dump to XML, that came out of various European organisations or initiatives. In particular: SIARD (developed by the Swiss National Archives as part of Planets), RODA DBML from KEEPS in Portugal, and MIXED from DANS in the Netherlands. There's an OPF database archiving tools page with links to those and a few others.

Following an OPF hackathon focused on database archiving in early 2012, a database archiving group was formed. I believe there was some interest in bringing together the best elements of some of these solutions. The group has a useful wiki page that describes which organisations are using what tools, and which also lists database archiving formats.

Comments

Answer by nik

An additional approach to those mentioned above is to use emulation/virtualisation. One of the main benefits of this approach is that you end up with a complete system, often including the frontend. In not-so-brilliantly-designed applications this can contain fairly critical business logic that makes sense of the data in the database.

There's an interesting case study on OPF here where some MSSQL dbs were migrated onto virtualised hardware.

Comments