[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: XML Guidelines -04
Hello Tim,
At 11:08 02/06/04 -0700, Tim Bray wrote:
In section 5.1 "Character Sets and Encodings"
=========================================
The last para "recommends, for simplicity, that only UTF-8 be
allowed." New news: as of yesterday, the W3C TAG disagreed with a
(clearly related) recommendation in the W3C Charmod draft that a single
encoding be used. See
http://lists.w3.org/Archives/Public/www-tag/2002Jun/0020.html on this.
I still have to look at the above mail in detail, and I can obviously
not speak for the I18N WG here, but I would just like to note that the
Character Model, and that recommendation in particular, is not only
addressed at XML, but any other (potential) formats, too.
In particular, since protocols are going to be read by an XML processor,
and since an XML processor is going to have to be able to read UTF-8 and
UTF-16, the requirement to handle only one of these two actually imposes
extra work - and it's actually hard to see where in the protocol chain
you'd efficiently do that work. Presumably the easy way to design a
protocol is to feed the bits on the wire to an XML processor and deal with
it through SAX or DOM or CLR or some such; are you going to put a filter
in front of the processor to check the char encoding? Or are you going to
ask the processor what encoding it was in so that you can toss it (after
it's been successfully parsed) because you don't like the encding?
Checking that the first two bytes in the input stream are not
FFFE or FEFF to reject UTF-16 can obviously be done very efficiently,
in various places. Checking that the input is in UTF-8 is a bit more
difficult, but a very simple finite state machine does the job.
Another alternative is to use an open-source parser and just very
slightly hack it so that you can ask it for the encoding that came in.
And you don't have to do that after parsing everything, you can do
that at the very start.
Even so, while you have a point with respect to parsing, there are
other aspects that are important:
- The IETF has a clear preference for UTF-8 over UTF-16. UTF-8 is
core to RFC 2277, and is a draft standard (and on it's way to
an IETF standard). UTF-16 is only an informational RFC.
- In many cases, people want to do other things than just parse
the XML. Using telnet to do debugging,... The chances that this
works with UTF-8 are much higher than for UTF-16.
- There are some places where XML could be used where ASCII-compatibility
is crucial. Imagine using a small piece of XML in an http-like header.
- While XML is quite clear about how it addresses endianness problems,
they may still raise their ugly head somewhere. (In particular, you
cannot look at the middle of an ongoing stream of bytes and know
what's going on, something which may sometimes be necessary.)
- During the creation of XML, originally only UTF-8 was required.
But then there was very strong pressure to also include UTF-16.
To the extent I'm aware of, there is now considerably less
such pressure, if there is indeed still any.
This seems like a really egregious violation of "being liberal in what you
accept".
Well, 'being liberal in what you accept' could be interpreted much
more liberally, e.g. accept all kinds of encodings. And given that
for most parsers, it's as difficult (or easy) to instruct them to
take only UTF-8 as it is to instruct them to take exactly UTF-8
and UTF-16, that may be where your argument is heading. But it's
very clear that this doesn't contribute to interoperability, which
is the final goal. If everybody sends UTF-8, that goal is met.
'be liberal in what you accept' is not really XML's motto either,
for very good reasons.
While you have mostly looked at the receiving end, do you think
there any major reason that 'only UTF-8' would put any significant
burdens on the sending side?
Note that popular XML parsers, e.g. expat, give the programmer UTF-8
anyhow regardless of how the input showed up.
Just for the record, many others, and in particular DOM-based ones, give
the programmer UTF-16 anyhow, regardless of how the input showed up :-).
In summary, I think that all the arguments given above together very
clearly support the current wording on character sets and encodings.
Regards, Martin.