How can i sort large amounts of data?

I've spent the past year backing up old personal media. I've moved floppy disks, CD's, DVD's, hard-drives, Zip disks and Jazz drives all to hard-drive format. All of this data was collected by me over the past 10-15 years and currently amounts to roughly 20TB of data. (Yes. I am a modern-day digital hoarder, i literally collect almost everything i view online and have done since i was about 12 years old)

The problem is, that I've never been particularly good at organised files/folders keeping track of data. A key component of preserving data also relates to finding it afterwards (otherwise it may as well be lost). So. I'm looking for a method to sort through multi-terabyte archives of files so i can then back them up at my discretion.

Issues i'm currently facing:

Wide variety of file types, Audio/Visual/documents/databases/programs/etc
Large amount of data (Approx 20TB)
No Sorting previously done on files
Multiple duplicates of files
Multiple versions of duplicate files
Some sets of files (ie databases) span more than 3-4 TB, which means multiple drives and these drives get mixed up and files get 'lost'

What I have tried:

De-duping files using various program
- To no avail, multiple versions of similar files with different dates/checksums. Also takes an inordinate amount of time to scan and compare every file.
Sorting some files manually
- Is taking to long, especially when reviewing individual files (documents and databases) that have changes in the contents
- Have had limited success with visual sorting (images and videos don't change much and de-duping works)

What i want to do:

How can I minimize the amount of space that my files are taking up currently?
How can I organize the files i have into some kind of structure that's logical (to a human)?
How can I index and search quickly though this size data set?
Is it possible for me to generate 'meta-data' about things like documents automatically for indexing or something? About the contents of the files rather than the existing metadata.

If i need to create some kind of 'baseline' that takes several days to accomplish, that's acceptable if it produces usable results.

Hopefully someone has some good ideas. Thanks!

D3C4FF

storage
digital-archivist

Comments

lechlukasz: I think your question is too broad and too interdisciplinary. The part about organising could pass to library SE, while the indexing and finding duplicates, after splitting to particular content types, would pass on StackOverflow and SuperUser.
D3C4FF: Would it be better if i broke it down into separate questions instead?
mopennock: This is a records management issue, not digital preservation.
walker: I think breaking this down into separate questions would work well. Organization and access strikes me as a key part of preservation, and the problem you're describing is not dissimilar to a state archives receiving a dump of heterogeneous electronic documents from an agency and being tasked with preserving them. That said, @lechlukasz's point is well taken, certain parts of your question may be out of scope for the site - but that will be easier to evaluate once you break the problem down.

Answer by lechlukasz

Finding almost-duplicates is a very challenging tasks, in oposite to exact-duplicates which can be found by checksum indexing. But for multimedia content, the approach that could be helpful in your task in MPEG-7 indexing.

In short, MPEG-7 is the set of descriptors using to index multimedia content and find "similar" images, audios or videos. I had once project based on dominant colors descriptor, which was very effective in finding pictures made in similar scenario. After calculating the descriptors, it's much easier to compare each-with-each (though still requiring a lot of computing power) to find the almost-duplicate candidates. Such could be processed with other algorithms, that will check if, for example, two pictures aren't the same picture, only cadred, or two music files aren't the same file, with a few seconds removed. In worst case, you can check such duplicate candidates manually.

The descriptors, as well, could help you organise multimedia content in groups, but it would require other algorithms, such as centroids.

Comments

D3C4FF: Can you suggest any programs or resources? Also. What is centroids? MPEG-7 sounds awesome...
lechlukasz: Centroid algorithm or K-Means clustering algorithm: http://en.wikipedia.org/wiki/K-means_clustering Sorry, I don't know any tools, I've made only a small project with single MPEG-7 descriptor on Uni, therefore I know about those descriptors and that the can be used for such tasks.