Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

Best practice for digitally redacting information from born-digital documents?

In the case of spreadsheets, databases, and other lengthy content printouts may not be practical - or the organization may be working with a remote patron online. Other born-digital formats just aren't conducive to printing no matter the size.

What's the recommended approach to digitally redacting info from digital documents in such a way that the removed information is not available at all in the bitstream? Most file formats do not exhibit a one to one correspondence between bytes and displayable characters, so my concern is with embedded metadata or data indicating previous states.

walker

Comments

Answer by Adrian Brown

Try the UK National Archive Redaction toolkit.

A summary of their recommendations for digitally redacting:

Windows Bitmap and .csv are selected as the intermediary formats because they contain no provision for storing metadata - edits made before or in these formats should be opaque once the document hits the third or final format.

Comments

Answer by anarchivist

It depends on the motivation for redaction. Adrian's reference to the UK National Archives' redaction toolkit is great, but in our case we've been curious to say if there is a way we can only redact personally-identifiable information in things like databases or textual records.

I haven't tried it personally, but you may want to look at the MITRE Identification Scrubber Toolkit, which was originally written to redact electronic health records.

Comments

Answer by Euan

The commercial service available from Pingar seems quite effective. It can automatically redact digital documents. A demo is available here.

I am not a representative of theirs though and I recommend searching for alternatives also, there are a few out there.

Comments

Answer by Jenn Riley

The BitCurator toolkit includes a utility (currently bulk_extractor) for automatic identification of personally-identifiable information in documents.

Unfortunately I don't know if these tools would address your concern about previous states of information, but there's lots of information on the BitCurator wiki, so perhaps you'd find the answer there.

Comments