Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

How can i sort large amounts of data?

I've spent the past year backing up old personal media. I've moved floppy disks, CD's, DVD's, hard-drives, Zip disks and Jazz drives all to hard-drive format. All of this data was collected by me over the past 10-15 years and currently amounts to roughly 20TB of data. (Yes. I am a modern-day digital hoarder, i literally collect almost everything i view online and have done since i was about 12 years old)

The problem is, that I've never been particularly good at organised files/folders keeping track of data. A key component of preserving data also relates to finding it afterwards (otherwise it may as well be lost). So. I'm looking for a method to sort through multi-terabyte archives of files so i can then back them up at my discretion.

Issues i'm currently facing:

What I have tried:

What i want to do:

If i need to create some kind of 'baseline' that takes several days to accomplish, that's acceptable if it produces usable results.

Hopefully someone has some good ideas. Thanks!

D3C4FF

Comments

Answer by lechlukasz

Finding almost-duplicates is a very challenging tasks, in oposite to exact-duplicates which can be found by checksum indexing. But for multimedia content, the approach that could be helpful in your task in MPEG-7 indexing.

In short, MPEG-7 is the set of descriptors using to index multimedia content and find "similar" images, audios or videos. I had once project based on dominant colors descriptor, which was very effective in finding pictures made in similar scenario. After calculating the descriptors, it's much easier to compare each-with-each (though still requiring a lot of computing power) to find the almost-duplicate candidates. Such could be processed with other algorithms, that will check if, for example, two pictures aren't the same picture, only cadred, or two music files aren't the same file, with a few seconds removed. In worst case, you can check such duplicate candidates manually.

The descriptors, as well, could help you organise multimedia content in groups, but it would require other algorithms, such as centroids.

Comments