[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: charset use of UTF-16 vs. utf-8




Hello Chris,


I'm sorry that I have to disagree with your logic here.

At 21:06 02/06/05 +0200, Chris Lilley wrote:

Why would the protocol allow only UTF-8?

b) the protocol uses XML. If it mandates UTF-8 only, then the XML
parser has to be specially modified to be non-compliant to the XML
specification so that it can, specially, throw a well-formedness error
if the content is in UTF-16.

No, not really. Assuming that the UTF-16 data passes the lower layers on which the protocol is built on (which is maybe one of the things that Larry is referring to) and arrives well-formed, the XML parser would indeed be totally wrong to throw a well-formedness error.

But as Tim Bray has pointed out, the document information item
of the infoset (http://www.w3.org/TR/xml-infoset/#infoitem.document)
contains a [character encoding scheme] property. So the application
can very simply check this and reject it if it's not UTF-8.
This is not that much different from e.g. rejecting a document
that does contain two attributes on the same element that the
spec says cannot appear together. Such a document is still
well-formed, and (DTD-)valid, but the application nevertheless
is allowed to reject it.

Also, what if we said that IETF XML protocols must accept
exactly UTF-8 and UTF-16 (and nothing else)? Assume that a document
came in in iso-8859-1, and assume that the XML processor understands
iso-8859-1 (although XML processors are not required to do that,
the chance that they do is very high). Would an implementer
have to modify the XML processor so that it throws a
well-formedness error for a document that is very clearly
well-formed, just because the content is in iso-8859-1?


The XML specification requires XML *processors* to accept both UTF-8 and UTF-16. It doesn't require XML applications to accept these encodings. For example, I'm rather sure that in Japan, there are some applications where all participants agree that all data is exchanged in Shift_JIS. [I don't think that's a good idea, but that's not the issue here.] Can you claim that they are using non-well-formed XML? Sure you can't. Can you claim that they reject some well-formed documents? Yes, but then all applications do that, all the time.


I'm not seeing the use case whereby the IETF would see a mandated
deviation from the XML specification as a good thing.

Unless there is something in the XML spec that says "all applications are required to accept every well-formed document", I don't yet see how the IETF would actually deviate from the XML spec in this case.


Regards, Martin.