Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

Should general-purpose Unicode/UTF text files include a BOM?

When saving text in a Unicode encoding, how should one choose whether or not to include a byte-order mark to indicate the text being Unicode (and what encoding thereof)?

Are there are inherent advantages from a digital preservation point of view to simply adding a UTF-8 BOM to a pure 7-bit-US-ASCII file, over not doing so? What are the potential drawbacks? Keeping in mind that the binary format of 7-bit US ASCII and the corresponding UTF-8 data is exactly the same.

Michael Kjörling

Comments

Answer by Dmitry Brant

I wouldn't see any necessity for adding a BOM to existing 7-bit ASCII files. Technically, as you mention, 7-bit ASCII files can already be treated as UTF-8 without any modification (this was one of the key motivations behind Unicode). So, if an application claims to be compatible with Unicode, it should be able to read these files by definition.

Edit: Brain-fart... as the other answer states, you shouldn't ever need a BOM with UTF-8 encoding. (because it's UTF-8!)

Comments

Answer by db48x

UTF-8 files should never have a byte-order mark (BOM). The BOM is only used in the UTF-16LE and UTF-16BE encodings to differentiate between the big-endian and little-endian forms. Also, don't forget that there may be higher-level requirements on the contents of files, and adding a character at the beginning may make them invalid. Javascript files in particular will break if they contain a BOM, because it's not a valid part of the Javascript grammar.

Comments

Answer by lechlukasz

Most of contemporary tools and libraries doesn't work well with BOMs for UTF-8 files. I had often to remove the BOM using tools like Notepad++. I've never encountered UTF-8 file with BOM in which the byte order would be inverted, anyway.

But when it comes to preservation, in 50 years the BOM itself wouldn't be issue, but rather determining the format. BOM is some kind of marking that the file CAN be UTF-8, so paradoxically, it could help future digiarcheologis to determine the character encoding format. Removing the BOM is trivial.

When storing the data, I would add the README file with ASCII description, what type of character encoding is used and if it is with BOM or not. More important then using BOM or not is to inform the future generations that it is UTF-8 at all, to save them guessing and testing various matches.

Comments