[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: XML Guidelines -04
Martin Duerst wrote:
Hmm... Martin's got some good points. This isn't a drop dead issue
either way I think.
I still have to look at the above mail in detail, and I can obviously
not speak for the I18N WG here, but I would just like to note that the
Character Model, and that recommendation in particular, is not only
addressed at XML, but any other (potential) formats, too.
Oops, yes, the TAG comment should have been clear that it was talking
about the XML case.
Checking that the first two bytes in the input stream are not
FFFE or FEFF to reject UTF-16 can obviously be done very efficiently,
in various places. Checking that the input is in UTF-8 is a bit more
difficult, but a very simple finite state machine does the job.
Another alternative is to use an open-source parser and just very
slightly hack it so that you can ask it for the encoding that came in.
Actually, since encoding is in the infoset, a parser that doesn't tell
you is arguably nonconformant.
- The IETF has a clear preference for UTF-8 over UTF-16. UTF-8 is
core to RFC 2277, and is a draft standard (and on it's way to
an IETF standard). UTF-16 is only an informational RFC.
As a (mostly) C programmer, I also have a clear preference for UTF-8,
and for a variety of reasons I agree with the IETF. However, the Java
programmers have some pain here (yes I know that a java char isn't
really a UTF-16 char, but most programmers can pretend it is without
causing breakage).
UTF-16 is *not* going away.
- There are some places where XML could be used where ASCII-compatibility
is crucial. Imagine using a small piece of XML in an http-like header.
Well, if it's got non-ASCII chars you're toast anyhow :) "XML could be
used" and "ASCII-compatibility is crucial" feel to me like objectives
that are strongly in conflict.
- While XML is quite clear about how it addresses endianness problems,
they may still raise their ugly head somewhere. (In particular, you
cannot look at the middle of an ongoing stream of bytes and know
what's going on, something which may sometimes be necessary.)
If you have to look into the middle of an XML stream, endian-ness is
going to be one of the smaller problems :)
- During the creation of XML, originally only UTF-8 was required.
But then there was very strong pressure to also include UTF-16.
To the extent I'm aware of, there is now considerably less
such pressure, if there is indeed still any.
I completely disagree both with the history and the assertion about
current trends, but I'm not sure this is relevant.
Well, 'being liberal in what you accept' could be interpreted much
more liberally, e.g. accept all kinds of encodings. And given that
for most parsers, it's as difficult (or easy) to instruct them to
take only UTF-8 as it is to instruct them to take exactly UTF-8
and UTF-16, that may be where your argument is heading. But it's
very clear that this doesn't contribute to interoperability, which
is the final goal. If everybody sends UTF-8, that goal is met.
'be liberal in what you accept' is not really XML's motto either,
for very good reasons.
Actually, XML is *very* liberal in the particular case of character
encodings. This seems to be a popular choice. I do *not* believe that,
in the context of XML, debarring UTF-16 has any significant effect on
interoperability.
While you have mostly looked at the receiving end, do you think
there any major reason that 'only UTF-8' would put any significant
burdens on the sending side?
Yes, for Java programmers. I know the UTF-8 handling is much better
than it used to be, but UTF-8 was still a 2nd-class Java citizen last
time I looked. I'd be glad to hear I'm wrong; I've been working in C
the last couple of years.
In summary, I think that all the arguments given above together very
clearly support the current wording on character sets and encodings.
I can see both sides of it. But at moment saying UTF-8/16 seems like a
win on cost-benefit. -Tim