[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
UTF-16, the BOM, and media types
At 02:57 PM 3/22/00 -0500, John Cowan wrote:
>> UTF-16le and UTF-16be cannot be used for XML. XML mandates
>> the BOM for utf-16. Meanwhile, utf-16le and utf-16be cannot
>> have the BOM. More about this, see RFC 2781.
>I do not understand this from the text of XML 1.0. Clause 4.3.3 only says
>that if there is no encoding declaration, then either:
Section 4.3.3 of XML 1.0 says
"Entities encoded in UTF-16 must begin with the Byte Order Mark described
by ISO/IEC 10646 Annex E and Unicode Appendix B (the ZERO WIDTH NO-BREAK
SPACE character, #xFEFF)."
Thus in my view the RFC is correct, and thus 16BE and 16LE are not useful
for XML. It is good practice, whenever you store anything in UTF-16, to
put a BOM in, and XML makes that good practice compulsory, which is pretty
painless since it seems that virtually all software that writes UTF-16 does
so anyhow. The cost of a BOM is zilch. The benefit in data survival in the
face of stupid byte order tricks (yes, they still happen), is immense.
Martin Duerst, a smart guy whom I respect, invested several hours in
trying to convince me that the 16[BL]E variants with forbidden-BOM had
some real-world justification, but I forget what it is... and I remain
convinced that they are simply not suitable for use with XML. -Tim