[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XML Guidelines -04




Hello Tim,


At 11:08 02/06/04 -0700, Tim Bray wrote:

In section 5.1 "Character Sets and Encodings"

=========================================

The last para "recommends, for simplicity, that only UTF-8 be allowed." New news: as of yesterday, the W3C TAG disagreed with a (clearly related) recommendation in the W3C Charmod draft that a single encoding be used. See http://lists.w3.org/Archives/Public/www-tag/2002Jun/0020.html on this.

I still have to look at the above mail in detail, and I can obviously not speak for the I18N WG here, but I would just like to note that the Character Model, and that recommendation in particular, is not only addressed at XML, but any other (potential) formats, too.


In particular, since protocols are going to be read by an XML processor, and since an XML processor is going to have to be able to read UTF-8 and UTF-16, the requirement to handle only one of these two actually imposes extra work - and it's actually hard to see where in the protocol chain you'd efficiently do that work. Presumably the easy way to design a protocol is to feed the bits on the wire to an XML processor and deal with it through SAX or DOM or CLR or some such; are you going to put a filter in front of the processor to check the char encoding? Or are you going to ask the processor what encoding it was in so that you can toss it (after it's been successfully parsed) because you don't like the encding?

Checking that the first two bytes in the input stream are not FFFE or FEFF to reject UTF-16 can obviously be done very efficiently, in various places. Checking that the input is in UTF-8 is a bit more difficult, but a very simple finite state machine does the job. Another alternative is to use an open-source parser and just very slightly hack it so that you can ask it for the encoding that came in. And you don't have to do that after parsing everything, you can do that at the very start.

Even so, while you have a point with respect to parsing, there are
other aspects that are important:

- The IETF has a clear preference for UTF-8 over UTF-16. UTF-8 is
  core to RFC 2277, and is a draft standard (and on it's way to
  an IETF standard). UTF-16 is only an informational RFC.

- In many cases, people want to do other things than just parse
  the XML. Using telnet to do debugging,... The chances that this
  works with UTF-8 are much higher than for UTF-16.

- There are some places where XML could be used where ASCII-compatibility
  is crucial. Imagine using a small piece of XML in an http-like header.

- While XML is quite clear about how it addresses endianness problems,
  they may still raise their ugly head somewhere. (In particular, you
  cannot look at the middle of an ongoing stream of bytes and know
  what's going on, something which may sometimes be necessary.)

- During the creation of XML, originally only UTF-8 was required.
  But then there was very strong pressure to also include UTF-16.
  To the extent I'm aware of, there is now considerably less
  such pressure, if there is indeed still any.


This seems like a really egregious violation of "being liberal in what you accept".

Well, 'being liberal in what you accept' could be interpreted much more liberally, e.g. accept all kinds of encodings. And given that for most parsers, it's as difficult (or easy) to instruct them to take only UTF-8 as it is to instruct them to take exactly UTF-8 and UTF-16, that may be where your argument is heading. But it's very clear that this doesn't contribute to interoperability, which is the final goal. If everybody sends UTF-8, that goal is met. 'be liberal in what you accept' is not really XML's motto either, for very good reasons.

While you have mostly looked at the receiving end, do you think
there any major reason that 'only UTF-8' would put any significant
burdens on the sending side?


Note that popular XML parsers, e.g. expat, give the programmer UTF-8 anyhow regardless of how the input showed up.

Just for the record, many others, and in particular DOM-based ones, give the programmer UTF-16 anyhow, regardless of how the input showed up :-).


In summary, I think that all the arguments given above together very clearly support the current wording on character sets and encodings.

Regards, Martin.