[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: printable wide character (was "multibyte") encodings
>Issues of nationalistic pride aside, one real strength of Unicode
>(the 16-bit subset of 10646) is that it strives in large part to
>map characters one-to-one with code points *without* "out of
>band" codes such as ... character-set-switching escape sequences ...
>The importance of a one-to-one mapping (in the case of Unicode,
>between characters and 16-bit quantities) becomes apparent when
>additional processing steps are imposed. It's nice always to be
>able to know where the individual character boundaries are, and
>not to misinterpret partial bytes which aren't full characters...
(Steve's later discussion makes it clear that he knows more about
UTF-2 than this would suggest, but for the benefit of others...)
UTF-2 does not use escape sequences. It is merely a variable-width
encoding of 10646, as opposed to the obvious fixed-width encoding.
Moreover, unlike the original UTF encoding, UTF-2 is completely
unambiguous -- there is *no* uncertainty about the location of
character boundaries, and no doubt about whether you are seeing a
full character or partial bytes.
Note that there is already considerable experience with use of UTF-2
in Bell Lab's "Plan Nine" operating system, and it has generally been
quite favorable. Processing is not a problem. A paper on the topic
is appearing in the Usenix conference at the end of this month.
>Richtext only barely meshes with ISO-2022-JP because 2022-JP is
>sometimes 8 bits per character and sometimes 16. Since a
>richtext parser isn't likely to understand that distinction, it
>can get confused when an 8-bit half of a 16-bit character happens
>to match the bit pattern for '<'. The solution, as Rhys
>Weatherly has proposed, is to further encode an 8-bit half with
>that value as <lt>...
An easier solution is to use UTF-2, in which this situation never
occurs, by design. The bit pattern for '<' occurs in UTF-2 only as
the character '<'. This is not an accident.
I used to take a very dim view of >8-bit character sets, since I
couldn't see a graceful transition path that didn't involve a lot of
pain and incompatibility. Many of Steve's comments about problems
with encodings, *in general*, make considerable sense. But none of
his objections apply to UTF-2 in particular. This particular encoding,
and no other, is what has convinced me that the transition can be made
without massive pain, without massive duplication of code, without
massive incompatibility. It is, purely and simply, *well designed*.
I see no reason why the failings of poorly-designed encodings should
be generalized into a desire to avoid encodings. It seems eminently
reasonable to me to state that only well-designed encodings merit
support and consideration. I do not think we should avoid this approach
because later consideration of poorly-designed encodings might cause
problems; the solution to that is not to consider them.
It is a mistake to view UTF-2 as a content-transfer-encoding. It is
an alternate way to represent 10646 characters, a variable-width form
as contrasted to the fixed-width form that 10646 uses. It is a mistake
to think of it as something layered on top of 10646. 10646 could just
as easily have defined the UTF-2 representations as the canonical form,
with a fixed-width alternate form to simplify some kinds of processing.
They are properly thought of as peers, and it is reasonable and proper
to consider something like "10646-UTF-2" to be a character set, one
which has a somewhat clumsy and misleading name for historical reasons.
The tremendous advantage of the 10646-UTF-2 character set is that it
breaks almost nothing in our existing software corpus. Apart from the
obvious issues of displaying non-ASCII characters, which have to be
dealt with regardless, only software that expects characters to be
one-to-one with octets (by, e.g., counting characters by counting octets,
or indexing into a character array by indexing into an octet array) is
likely to need attention. Software that merely moves character strings
around without having to analyze their structure -- and this covers an
awful lot of software -- can remain ignorant of the change. Even many
kinds of superficial analysis, e.g. richtext parsing, should work without
Although there are several possible paths out of the 8-bit-character world,
this one is *overwhelmingly* the path of least resistance. This is the
one people are going to use. Certainly it is the only one I would take
seriously for my own software; the alternatives are just too painful.
Henry Spencer at U of Toronto Zoology