[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

UTF-16, the BOM, and media types

At 02:57 PM 3/22/00 -0500, John Cowan wrote:
>> UTF-16le and UTF-16be cannot be used for XML.  XML mandates
>> the BOM for utf-16.  Meanwhile, utf-16le and utf-16be cannot
>> have the BOM.  More about this, see RFC 2781.
>I do not understand this from the text of XML 1.0.  Clause 4.3.3 only says
>that if there is no encoding declaration, then either:

Section 4.3.3 of XML 1.0 says
 "Entities encoded in UTF-16 must begin with the Byte Order Mark described 
  by ISO/IEC 10646 Annex E and Unicode Appendix B (the ZERO WIDTH NO-BREAK 
  SPACE character, #xFEFF)."

Thus in my view the RFC is correct, and thus 16BE and 16LE are not useful
for XML.  It is good practice, whenever you store anything in UTF-16, to 
put a BOM in, and XML makes that good practice compulsory, which is pretty 
painless since it seems that virtually all software that writes UTF-16 does 
so anyhow. The cost of a BOM is zilch.  The benefit in data survival in the 
face of stupid byte order tricks (yes, they still happen), is immense.

Martin Duerst, a smart guy whom I respect, invested several hours in
trying to convince me that the 16[BL]E variants with forbidden-BOM had
some real-world justification, but I forget what it is... and I remain
convinced that they are simply not suitable for use with XML. -Tim