[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: printable wide character (was "multibyte") encodings



>As for UTF-2...I suggest that this WG define two 10646/Unicode charsets:
>
>1) "flat": canonical form is to transmit each n-bit character as n/8
>octets, in order from most significant octet first to least significant
>octet last.
>
>2) "UTF-2":  canonical form is a UTF-2 stream.

Well, one is always better than two, of course, but in general this
strikes me as plausible.

However, I think charset (1) may need a bit more thought.  Is n variable
from character to character?  How do you know the value of n?  My impression
is that the "flat" 10646 codes are not self-describing in any way, so you
need external information to know how many octets constitute a character.
There are at least two values of n -- 16 and 32 -- which are likely to be
either popular or politically required:  16 because it will be what almost
everyone will use, 32 because technically 10646 is a 32-bit standard and
not everyone is happy with the first 16-bit plane's contents (aka Unicode).

                                         Henry Spencer at U of Toronto Zoology
                                          henry@zoo.toronto.edu   utzoo!henry