Martin Duerst wrote:
...2) there can be a substantial penalty for Asian and other communities not using ASCII related sets.
The penalty is 3 bytes or 50% (when compared to UTF-16, but 200% when compared to legacy encodings) for scripts such as Thai, Georgian, Devanagari,..., and 4 bytes or 0% (when compared to UTF-16; potentially 300% when compared to imaginary legacy encodings) for scripts such as Old Italic, Deseret, and very rare ideographs.
First, as Martin points out, UTF-16 wins, for content that is mostly Asian, UTF-8 for content that is mostly ASCII.
Yes. Reading my text more carefully, you can also interpret it as saying 'if you really are concerned about the length of Asian texts, only UTF-8 and UTF-16 may not be enough'. This in particular applies for the Indian subcontinent and adjacent countries.
Martin predicts that IETF protocol content will be "mostly ASCII", which implies that (a) it's in a European language, or (b) the density of markup is very high, and markup names are constrained to be in European languages. This seems like a really risky prediction to me as I look around the world, but maybe I'm missing something. Is the reason obvious to others?
I think both Larry and Gavin have supported my oppinion. Anyway, I would be fine with some text that said: Make it UTF-8 only if it's mostly data, but allow UTF-16 (or also other encodings) if you expect a lot of text.