[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: printable wide character (was "multibyte") encodings
>As for UTF-2...I suggest that this WG define two 10646/Unicode charsets:
>
>1) "flat": canonical form is to transmit each n-bit character as n/8
>octets, in order from most significant octet first to least significant
>octet last.
>
>2) "UTF-2": canonical form is a UTF-2 stream.
Well, one is always better than two, of course, but in general this
strikes me as plausible.
However, I think charset (1) may need a bit more thought. Is n variable
from character to character? How do you know the value of n? My impression
is that the "flat" 10646 codes are not self-describing in any way, so you
need external information to know how many octets constitute a character.
There are at least two values of n -- 16 and 32 -- which are likely to be
either popular or politically required: 16 because it will be what almost
everyone will use, 32 because technically 10646 is a 32-bit standard and
not everyone is happy with the first 16-bit plane's contents (aka Unicode).
Henry Spencer at U of Toronto Zoology
henry@zoo.toronto.edu utzoo!henry