Should general-purpose Unicode/UTF text files include a BOM?

When saving text in a Unicode encoding, how should one choose whether or not to include a byte-order mark to indicate the text being Unicode (and what encoding thereof)?

Are there are inherent advantages from a digital preservation point of view to simply adding a UTF-8 BOM to a pure 7-bit-US-ASCII file, over not doing so? What are the potential drawbacks? Keeping in mind that the binary format of 7-bit US ASCII and the corresponding UTF-8 data is exactly the same.

Michael Kjörling

file-formats
text
unicode

Comments

Answer by Dmitry Brant

I wouldn't see any necessity for adding a BOM to existing 7-bit ASCII files. Technically, as you mention, 7-bit ASCII files can already be treated as UTF-8 without any modification (this was one of the key motivations behind Unicode). So, if an application claims to be compatible with Unicode, it should be able to read these files by definition.

Edit: Brain-fart... as the other answer states, you shouldn't ever need a BOM with UTF-8 encoding. (because it's UTF-8!)

Comments

Answer by db48x

UTF-8 files should never have a byte-order mark (BOM). The BOM is only used in the UTF-16LE and UTF-16BE encodings to differentiate between the big-endian and little-endian forms. Also, don't forget that there may be higher-level requirements on the contents of files, and adding a character at the beginning may make them invalid. Javascript files in particular will break if they contain a BOM, because it's not a valid part of the Javascript grammar.

Comments

Michael Kjörling: Actually, I believe UTF-32 also benefits from a BOM. But I wasn't thinking so much about computer program source code (such as Javascript that you mention) but more along the lines of plain text files intended for consumption by human beings.
Michael Kjörling: Also, even when technically not necessary the BOM can serve as **metadata**: you are unlikely to encounter the byte sequence `EF BB BF` (hexadecimal) at the beginning of a random file, so the existence thereof would be a pretty good indication that the following content is UTF-8 encoded Unicode. UTF-8 can be detected reasonably reliably by analyzing the byte sequence, but e.g. US-ASCII cannot (at least if you want to be able to differentiate it from other 7-bits-in-8-bits encodings).
db48x: If you want to record what encoding a file is in, make another file and write it down explicitly, rather than modifying the original document.
Michael Kjörling: Good point at ingestion into an archive. Not necessarily equally valid if you have the choice at creation time.

Answer by lechlukasz

Most of contemporary tools and libraries doesn't work well with BOMs for UTF-8 files. I had often to remove the BOM using tools like Notepad++. I've never encountered UTF-8 file with BOM in which the byte order would be inverted, anyway.

But when it comes to preservation, in 50 years the BOM itself wouldn't be issue, but rather determining the format. BOM is some kind of marking that the file CAN be UTF-8, so paradoxically, it could help future digiarcheologis to determine the character encoding format. Removing the BOM is trivial.

When storing the data, I would add the README file with ASCII description, what type of character encoding is used and if it is with BOM or not. More important then using BOM or not is to inform the future generations that it is UTF-8 at all, to save them guessing and testing various matches.

Comments

The Diamond Z: DOS Edit (yep it still exists on 32 bit installs of Xp/Win7/w2k8) is a quick way to see and remove the BOM from files.
lechlukasz: Will it be still possible to run in 2050? ;)
Michael Kjörling: UTF-8 is byte ordered, so technically the byte order mark does not function *as such*. However, it *does* function as a rather unique identifier indicating that *the file in question is UTF-8 encoded Unicode data.*