[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: XML Guidelines -04
On Wed, 5 Jun 2002, Martin Duerst wrote:
>
> has to be reissued as utf-8, but not in the case of
>
> <?xml version='1.0' ?> ? [even though, given no external info,
>
> the later has to be processed as UTF-8 as well.
>
It can be even worse, the declaration just says that the current document
is in the UTF-8 encoding.
So if there is a program using XML parser for its input/output handling,
there is no connection between input and output which is contained in the
incoming document. So let say that there is a parser which by default uses
UTF-16. It receives UTF-8 or other encoding on the input, it parses the
data and sends it to the application, the application makes some changes
and uses the same parser for serialization. If the application author
does not take care of the remembering the original state, the processor is
free to emit data in either UTF-16 or UTF-8, because in both ways the
output is perfectly well formed XML. You can of course specify output
encoding in your API, but this is something which must be remembered by
programmer and it is so easy to forget (in most cases it does not
matter which encoding you are using).
At this moment the possibility of such a program sitting in the way may
seem academic and hypothetical but I can imagine several user cases.
For instance, in near future there may be ubiquitous domain specific XML
protocols. The junior admin is given a task, every time a request for
blabla is received (protocol RFC Foo3) add an ID attribute to the root
element so we can keep track of requests. There is just a few requests per
hour so performance is not important. XSLT is a very convenient solution
for
this task as it can be implemented in no time (you will just call one
extension function to get ID and the rest is done in 10 lines of XSLT
stylesheet. The currently used XML processor outputs UTF-8 by default and
so nobody suspects a problem. Next year there is a upgrade of the server
and other XML processor is installed. But it outputs UTF-16 by default.
How long it will take to locate the problem. I am afraid that if there is
a good match between the quality of program and the skill of the new
programmer it can take quite a few days :)
In my opinion it is really not a good idea to permit rejection of well
formed documents (and rejection of UTF-16 XML documents is such a case)
>
> >2) there can be a substantial penalty for Asian and other
> >communities not using ASCII related sets. I have seen an estimation that
> >an average Chinese text uses about 3 bytes per one UTF-8 character
> >and so the size of data to be transmitted can rise by 50% just by using
> >UTF-8 instead of UTF-16, and I suppose that this penalty may be much worse
> >for some other language groups. As I expect that XML protocols will be
> >often used for transfer of textual data, which can be quite large, this
> >can be a very important criterion.
>
> The penalty is 3 bytes or 50% (when compared to UTF-16, but 200%
> when compared to legacy encodings) for scripts such as Thai, Georgian,
> Devanagari,..., and 4 bytes or 0% (when compared to UTF-16; potentially
> 300% when compared to imaginary legacy encodings) for scripts such as
> Old Italic, Deseret, and very rare ideographs.
>
> Assuming that in an IETF-defined protocol, the element and attribute
> names and quite a bit of the attribute values are ASCII, my expectation
> is that the average 'XML Protocol' will easily have an ASCII content
> of around or above 50% even if it's e.g. purely Chinese. Because the
> penalty for ASCII is 100% when moving from UTF-8 to UTF-16, there is
> nothing much to be gained from using UTF-16 in such cases.
>
> But this of course depends on the nature of the protocol.
>
We are talking about XML future and some protocols can be used to transfer
large quantities of data.
And if you have only ASCII files, you can write something like: for
efficiency reasons it is recommended that UTF-8
encoding is used but the very useful XML well-formedness concept
should
stay intact.
> Regards, Martin.
>
--
******************************************
<firstName> Miloslav </firstName>
<surname> Nic </surname>
<mail> nicmila@xxxxxxxxxxxx </mail>
<support> http://www.zvon.org </support>