[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XML Guidelines -04



On Wed, 5 Jun 2002, Martin Duerst wrote:
> 
> has to be reissued as utf-8, but not in the case of
> 
> <?xml version='1.0' ?> ? [even though, given no external info,
> 
> the later has to be processed as UTF-8 as well.
> 

It can be even worse, the declaration just says that the current document 
is in the UTF-8 encoding. 

So if there is a program using XML parser for its input/output handling, 
there is no connection between input and output which is contained in the 
incoming document. So let say that there is a parser which by default uses 
UTF-16. It receives UTF-8 or other encoding on the input, it parses the 
data and sends it to the application, the application makes some changes 
and uses the same parser for serialization. If the application author 
does not take care of the remembering the original state, the processor is 
free to emit data in either UTF-16 or UTF-8, because in both ways the 
output is perfectly well formed XML. You can of course specify output 
encoding in your API, but this is something which must be remembered by 
programmer and it is so easy to forget (in most cases it does not 
matter which encoding you are using).

At this moment the possibility of such a program sitting in the way may 
seem academic and hypothetical but I can imagine several user cases.

For instance, in near future there may be ubiquitous domain specific XML 
protocols. The junior admin is given a task, every time a request for 
blabla is received  (protocol RFC Foo3) add an ID attribute to the root 
element so we can keep track of requests. There is just a few requests per 
hour so performance is not  important. XSLT is a very convenient solution 
for 
this task as it can be implemented in no time (you will just call one 
extension function to get ID and the rest is done in 10 lines of XSLT 
stylesheet. The currently used XML processor outputs UTF-8 by default and 
so nobody suspects a problem. Next year there is a upgrade of the server 
and other XML processor is installed. But it outputs UTF-16 by default. 
How long it will take to locate the problem. I am afraid that if there is 
a good match between the quality of program and the skill of the new 
programmer it can take quite a few days :)

In my opinion it is really not a good idea to permit rejection of well 
formed documents (and rejection of UTF-16 XML documents is such a case)    



> 
> >2) there can be a substantial penalty for Asian  and other
> >communities not using ASCII related sets. I have seen an estimation that
> >an average Chinese text uses about  3 bytes per one UTF-8 character
> >and so the size of data to be transmitted can rise by 50% just by using
> >UTF-8 instead of UTF-16, and I suppose that this penalty may be much worse
> >for some other language groups. As I expect that XML protocols will be
> >often used for transfer of textual data, which can be quite large, this
> >can be a very important criterion.
> 
> The penalty is 3 bytes or 50% (when compared to UTF-16, but 200%
> when compared to legacy encodings) for scripts such as Thai, Georgian,
> Devanagari,..., and 4 bytes or 0% (when compared to UTF-16; potentially
> 300% when compared to imaginary legacy encodings) for scripts such as
> Old Italic, Deseret, and very rare ideographs.
> 
> Assuming that in an IETF-defined protocol, the element and attribute
> names and quite a bit of the attribute values are ASCII, my expectation
> is that the average 'XML Protocol' will easily have an ASCII content
> of around or above 50% even if it's e.g. purely Chinese. Because the
> penalty for ASCII is 100% when moving from UTF-8 to UTF-16, there is
> nothing much to be gained from using UTF-16 in such cases.
> 
> But this of course depends on the nature of the protocol.
> 


We are talking about XML future and some protocols can be used to transfer 
large quantities of data. 

And if you have only ASCII files, you can write something like: for 
efficiency reasons it is recommended that  UTF-8
encoding is used but the very useful XML well-formedness concept 
should 
stay intact.



> Regards,    Martin.
> 

-- 
******************************************
<firstName> Miloslav </firstName>
<surname>   Nic      </surname>

<mail>    nicmila@xxxxxxxxxxxx    </mail>
<support> http://www.zvon.org  </support>