[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Additional syntactic restrictions




At 20:20 02/06/17 -0700, Tim Bray wrote:


Martin Duerst wrote:

If a protocol restricts itself to UTF-8, then it's not the parser,
but the application, that must enforce the restriction.

Which is actually nontrivial and there's no standardized way to do it if you're using a standard XML processor. I believe you can tell expat that it has to try to use a particular encoding and catch the error condition when this doesn't work, but it's going to be very difficult to distinguish between an instance that is in a forbidden encoding from one that actually has broken syntax. -Tim

Well, yes, but: Assume a protocol is defined as accepting only UTF-8 and UTF-16 (I understand that that's what you and Chris would prefer). There may be some XML parsers that understand exactly these two and nothing else, but your average XML parser understands more character encodings, starting with iso-8859-1. And as you say above, there is no standard way to enforce the restriction to UTF-8 and UTF-16, and you may be able to tell a parser, but then you can't distinguish between a forbidden encoding and broken syntax.

So whether a protocol says 'UTF-8 only' or 'only UTF-8 and UTF-16',
it's all just the same.

Regards, Martin.