[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: charset use of UTF-16 vs. utf-8
Hello Chris,
I'm sorry that I have to disagree with your logic here.
At 21:06 02/06/05 +0200, Chris Lilley wrote:
Why would the protocol allow only UTF-8?
b) the protocol uses XML. If it mandates UTF-8 only, then the XML
parser has to be specially modified to be non-compliant to the XML
specification so that it can, specially, throw a well-formedness error
if the content is in UTF-16.
No, not really. Assuming that the UTF-16 data passes the lower
layers on which the protocol is built on (which is maybe one of the
things that Larry is referring to) and arrives well-formed, the XML
parser would indeed be totally wrong to throw a well-formedness
error.
But as Tim Bray has pointed out, the document information item
of the infoset (http://www.w3.org/TR/xml-infoset/#infoitem.document)
contains a [character encoding scheme] property. So the application
can very simply check this and reject it if it's not UTF-8.
This is not that much different from e.g. rejecting a document
that does contain two attributes on the same element that the
spec says cannot appear together. Such a document is still
well-formed, and (DTD-)valid, but the application nevertheless
is allowed to reject it.
Also, what if we said that IETF XML protocols must accept
exactly UTF-8 and UTF-16 (and nothing else)? Assume that a document
came in in iso-8859-1, and assume that the XML processor understands
iso-8859-1 (although XML processors are not required to do that,
the chance that they do is very high). Would an implementer
have to modify the XML processor so that it throws a
well-formedness error for a document that is very clearly
well-formed, just because the content is in iso-8859-1?
The XML specification requires XML *processors* to accept both
UTF-8 and UTF-16. It doesn't require XML applications to accept
these encodings. For example, I'm rather sure that in Japan,
there are some applications where all participants agree that
all data is exchanged in Shift_JIS. [I don't think that's
a good idea, but that's not the issue here.] Can you claim that
they are using non-well-formed XML? Sure you can't. Can you claim
that they reject some well-formed documents? Yes, but then all
applications do that, all the time.
I'm not seeing the use case whereby the IETF would see a mandated
deviation from the XML specification as a good thing.
Unless there is something in the XML spec that says
"all applications are required to accept every well-formed
document", I don't yet see how the IETF would actually
deviate from the XML spec in this case.
Regards, Martin.