[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Some text that may be useful for the update of RFC 2376

MURATA Makoto wrote:

> UTF-16le and UTF-16be cannot be used for XML.  XML mandates
> the BOM for utf-16.  Meanwhile, utf-16le and utf-16be cannot
> have the BOM.  More about this, see RFC 2781.

I do not understand this from the text of XML 1.0.  Clause 4.3.3 only says
that if there is no encoding declaration, then either:

	a BOM is present, and the encoding is UTF-16, or

	no BOM is present, and the encoding is UTF-8.

If a proper encoding declaration is present, then any charset may be
used; however, parsers are only required to handle UTF-8 and UTF-16.
(In practice, all parsers known to me also accept US-ASCII and ISO-8859-1.)

For example, a file beginning with the characters

	<?xml version='1.0' encoding='x-focs'?>

encoded in Finagle's Own Character Set is perfectly legal, and will be parsed
successfully by any parser with an x-focs conversion table.  This is true even if
x-focs is a multi-byte character set.

I see absolutely no reason why UTF-16BE and UTF-16LE should be excluded from
the list of acceptable charsets.  It is true that Appendix F claims that a text beginning with the bytes 00 3C 00 3F or 3C 00 3F 00 is "strictly speaking,
in error", but Appendix F is marked "non-normative", and this text is
qualified in E44 anyhow.

> I see no reasons for preserving byte sequences.  We only have to
> preserve XML information sets.

Almost, since strictly speaking the charset is part of the information set.

> Existing programming languages do not support Unicode very well, as
> I see it.

Except Java, Javascript, Ada 95, Dylan ....

Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@xxxxxxxxxxxxxxxxx>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,           || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)