[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XML Guidelines -04




At 11:07 02/06/05 +0200, Miloslav Nic wrote:



Just a short comment about UTF8 x UTF16 :

In my personal opinion I would consider
the possibility of not supporting UTF-16 unfortunate.

I can see these problems:

1) let say there is a communication where  end points are UTF-8
restricted processors. But there is no guaranty that on the way is not
an intermediary which is using a standard XML
processor, accepting the message and re-emitting it again, maybe enhanced
with some other information. If there is no XML declaration specifying
encoding , the processor can choose which one to use for re-emitted result.

There is no spec for an 'intermediary XML processor'. In particular, it is completely unclear to me where you got the 'if there is no XML declaration specifying the encoding' from. Would you say that something starting with

<?xml version='1.0' encoding='utf-8' ?>

has to be reissued as utf-8, but not in the case of

<?xml version='1.0' ?> ? [even though, given no external info,

the later has to be processed as UTF-8 as well.


2) there can be a substantial penalty for Asian  and other
communities not using ASCII related sets. I have seen an estimation that
an average Chinese text uses about  3 bytes per one UTF-8 character
and so the size of data to be transmitted can rise by 50% just by using
UTF-8 instead of UTF-16, and I suppose that this penalty may be much worse
for some other language groups. As I expect that XML protocols will be
often used for transfer of textual data, which can be quite large, this
can be a very important criterion.

The penalty is 3 bytes or 50% (when compared to UTF-16, but 200% when compared to legacy encodings) for scripts such as Thai, Georgian, Devanagari,..., and 4 bytes or 0% (when compared to UTF-16; potentially 300% when compared to imaginary legacy encodings) for scripts such as Old Italic, Deseret, and very rare ideographs.

Assuming that in an IETF-defined protocol, the element and attribute
names and quite a bit of the attribute values are ASCII, my expectation
is that the average 'XML Protocol' will easily have an ASCII content
of around or above 50% even if it's e.g. purely Chinese. Because the
penalty for ASCII is 100% when moving from UTF-8 to UTF-16, there is
nothing much to be gained from using UTF-16 in such cases.

But this of course depends on the nature of the protocol.

Regards, Martin.