[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XML Guidelines -04




Martin Duerst wrote:


Hmm... Martin's got some good points. This isn't a drop dead issue either way I think.

I still have to look at the above mail in detail, and I can obviously
not speak for the I18N WG here, but I would just like to note that the
Character Model, and that recommendation in particular, is not only
addressed at XML, but any other (potential) formats, too.

Oops, yes, the TAG comment should have been clear that it was talking about the XML case.


Checking that the first two bytes in the input stream are not
FFFE or FEFF to reject UTF-16 can obviously be done very efficiently,
in various places. Checking that the input is in UTF-8 is a bit more
difficult, but a very simple finite state machine does the job.
Another alternative is to use an open-source parser and just very
slightly hack it so that you can ask it for the encoding that came in.

Actually, since encoding is in the infoset, a parser that doesn't tell you is arguably nonconformant.


- The IETF has a clear preference for UTF-8 over UTF-16. UTF-8 is
  core to RFC 2277, and is a draft standard (and on it's way to
  an IETF standard). UTF-16 is only an informational RFC.

As a (mostly) C programmer, I also have a clear preference for UTF-8, and for a variety of reasons I agree with the IETF. However, the Java programmers have some pain here (yes I know that a java char isn't really a UTF-16 char, but most programmers can pretend it is without causing breakage).


UTF-16 is *not* going away.

- There are some places where XML could be used where ASCII-compatibility
  is crucial. Imagine using a small piece of XML in an http-like header.

Well, if it's got non-ASCII chars you're toast anyhow :) "XML could be used" and "ASCII-compatibility is crucial" feel to me like objectives that are strongly in conflict.


- While XML is quite clear about how it addresses endianness problems,
  they may still raise their ugly head somewhere. (In particular, you
  cannot look at the middle of an ongoing stream of bytes and know
  what's going on, something which may sometimes be necessary.)

If you have to look into the middle of an XML stream, endian-ness is going to be one of the smaller problems :)


- During the creation of XML, originally only UTF-8 was required.
  But then there was very strong pressure to also include UTF-16.
  To the extent I'm aware of, there is now considerably less
  such pressure, if there is indeed still any.

I completely disagree both with the history and the assertion about current trends, but I'm not sure this is relevant.


Well, 'being liberal in what you accept' could be interpreted much
more liberally, e.g. accept all kinds of encodings. And given that
for most parsers, it's as difficult (or easy) to instruct them to
take only UTF-8 as it is to instruct them to take exactly UTF-8
and UTF-16, that may be where your argument is heading. But it's
very clear that this doesn't contribute to interoperability, which
is the final goal. If everybody sends UTF-8, that goal is met.
'be liberal in what you accept' is not really XML's motto either,
for very good reasons.

Actually, XML is *very* liberal in the particular case of character encodings. This seems to be a popular choice. I do *not* believe that, in the context of XML, debarring UTF-16 has any significant effect on interoperability.


While you have mostly looked at the receiving end, do you think
there any major reason that 'only UTF-8' would put any significant
burdens on the sending side?

Yes, for Java programmers. I know the UTF-8 handling is much better than it used to be, but UTF-8 was still a 2nd-class Java citizen last time I looked. I'd be glad to hear I'm wrong; I've been working in C the last couple of years.


In summary, I think that all the arguments given above together very
clearly support the current wording on character sets and encodings.

I can see both sides of it. But at moment saying UTF-8/16 seems like a win on cost-benefit. -Tim