[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Some text that may be useful for the update of RFC 2376
MURATA Makoto wrote:
> UTF-16le and UTF-16be cannot be used for XML. XML mandates
> the BOM for utf-16. Meanwhile, utf-16le and utf-16be cannot
> have the BOM. More about this, see RFC 2781.
I do not understand this from the text of XML 1.0. Clause 4.3.3 only says
that if there is no encoding declaration, then either:
a BOM is present, and the encoding is UTF-16, or
no BOM is present, and the encoding is UTF-8.
If a proper encoding declaration is present, then any charset may be
used; however, parsers are only required to handle UTF-8 and UTF-16.
(In practice, all parsers known to me also accept US-ASCII and ISO-8859-1.)
For example, a file beginning with the characters
<?xml version='1.0' encoding='x-focs'?>
encoded in Finagle's Own Character Set is perfectly legal, and will be parsed
successfully by any parser with an x-focs conversion table. This is true even if
x-focs is a multi-byte character set.
I see absolutely no reason why UTF-16BE and UTF-16LE should be excluded from
the list of acceptable charsets. It is true that Appendix F claims that a text beginning with the bytes 00 3C 00 3F or 3C 00 3F 00 is "strictly speaking,
in error", but Appendix F is marked "non-normative", and this text is
qualified in E44 anyhow.
> I see no reasons for preserving byte sequences. We only have to
> preserve XML information sets.
Almost, since strictly speaking the charset is part of the information set.
> Existing programming languages do not support Unicode very well, as
> I see it.
Except Java, Javascript, Ada 95, Dylan ....
--
Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@xxxxxxxxxxxxxxxxx>
Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com
Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)