[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XML Guidelines -04




Martin Duerst wrote:


2) there can be a substantial penalty for Asian and other
communities not using ASCII related sets.
...

The penalty is 3 bytes or 50% (when compared to UTF-16, but 200%
when compared to legacy encodings) for scripts such as Thai, Georgian,
Devanagari,..., and 4 bytes or 0% (when compared to UTF-16; potentially
300% when compared to imaginary legacy encodings) for scripts such as
Old Italic, Deseret, and very rare ideographs.

Assuming that in an IETF-defined protocol, the element and attribute
names and quite a bit of the attribute values are ASCII, my expectation
is that the average 'XML Protocol' will easily have an ASCII content
of around or above 50% even if it's e.g. purely Chinese. Because the
penalty for ASCII is 100% when moving from UTF-8 to UTF-16, there is
nothing much to be gained from using UTF-16 in such cases.

First, as Martin points out, UTF-16 wins, for content that is mostly Asian, UTF-8 for content that is mostly ASCII.


Martin predicts that IETF protocol content will be "mostly ASCII", which implies that (a) it's in a European language, or (b) the density of markup is very high, and markup names are constrained to be in European languages. This seems like a really risky prediction to me as I look around the world, but maybe I'm missing something. Is the reason obvious to others?

-Tim