[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-16, the BOM, and media types

Tim Bray wrote:

> Section 4.3.3 of XML 1.0 says
>  "Entities encoded in UTF-16 must begin with the Byte Order Mark described
>   by ISO/IEC 10646 Annex E and Unicode Appendix B (the ZERO WIDTH NO-BREAK
>   SPACE character, #xFEFF)."

That describes entities encoded in the charset called "UTF-16".  It says
nothing about entities encoded in the charsets "UTF-16BE" and "UTF-16LE"
or for that matter charset "x-focs".

> It is good practice, whenever you store anything in UTF-16, to
> put a BOM in,

I don't deny it.  But the letter of the specification permits UTF-16[BL]E,
as long as an explicit encoding declaration is present.

> and XML makes that good practice compulsory,

I am not convinced.

> Martin Duerst, a smart guy whom I respect, invested several hours in
> trying to convince me that the 16[BL]E variants with forbidden-BOM had
> some real-world justification, but I forget what it is...

The main issue is that people actually do use them.  Charset names, like media
types, are intended to permit labeling of what actually exists.  The existence
of a charset name does not mean that anybody thinks the corresponding charset
is a Good Thing.


Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@xxxxxxxxxxxxxxxxx>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,           || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)