[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Unicode...
>Very sorry you're so overwhelmingly negative on Unicode. ...
In a probably fruitless attempt to be a voice of moderation on this
one...
(1) The timing of the public emergence of the Unicode proposal relative
to the DIS balloting is *very* unfortunate. I don't think it is part of
a conspiracy.
(2) There seems to be general agreement that neither is what one would
call a perfect proposal. It may be the perfection is too much to hope
for in this area.
(3) 10646 comes, to a considerable extent, out of a "codes and character
sets" and data communications community, and a portion of that community
that is influenced by ISO procedures, membership and voting rules, etc.
It shows. That is not necessarily bad.
(4) Unicode comes, to a considerable extent, out of typographic
(according to the notes that I dug out last night, the first version
originated at Xerox) and programming languages communities, and portions
of those communities that are influenced by the workings of multivendor
consortia. And it shows. That is not necessarily bad either
In some ways, the two are complementary, not competitive. Different
primary objectives came into play when the compromises had to be made.
In the email context, all other things being equal (and, ignoring for
the moment, the Han [non]unification problem), I'd rather deal with
10646 if I were writing an MTA, and would rather deal with Unicode if I
were doing a UA or the related tools.
Typographic folks often prefer overstrike sequences to "more
characters". Makes for smaller fonts, and, "if it looks the same on the
page, it is the same, isn't it?". And, for some purposes, it is.
Speaking as a programming language person (former convenor of
JTC1/SC22/WG11 if the credentials are necessary), I *loathe* the idea of
a variable-length character set. 7 bits is fine. 8 bits is fine. 16
is ok, too. If it has to be 32, I hope storage is real cheap. But
variable length and shifts are a catastrophe--they foul up everything I
want to know about how to allocate storage, overlaying things, insertion
into strings, equality comparisons and so on. From that point of view,
Unicode started out a candidate for the next logical member of the
series that starts with ISO646 IRV and moves through 8859 (one more
bit); Unicode is 8 more bits than that. Fixed character set, one code
position equals one character. The biggest argument against it from
that standpoint is that, because of the overstriking, it fails in that
objective. But not by very much. (The "how would you like to write C
when every other character is null" question is bogus--a Unicode-
supporting C, properly defined, would just add, e.g., "long char" and
handle all of those as 16 bit quantities, so the string terminating "
null" might suddenly become 0x0000. If you don't ask them to invent
automatic conversion rules between Volkswagens and apples, compiler
writers are very good at this sort of thing.)
I could imagine contexts in which I'd be happy using son-of-DIS10646 for
communications between machines and then converting to daughter-of-
Unicode for local processing. Converting back would, of course, require
that I know the context for dealing with the mapping of a single
Unicode character positions into (typographically) non-unique 10646
characters. Mostly, I suspect I'd know.
I would expect, for example, that someone in Japan trying to print a
document that contained (in Kanji and kana) a translation and commentary
on a traditional Chinese text (with the Han characters appearing in the
text) would really prefer Unicode. Or would use 10646 and cheat. It
really depends on the application.
My earlier personally-expressed objections to one is also an objection
to the other:
-- they are moving targets
-- common, garden-variety, terminals and UAs are not set up to handle
character environments involving > 8 bits, partially because the
programming languages in which they are written aren't.
The latter will change, and the targets will stop moving. But I'd
prefer to be able to send and receive mail with "international"
characters sooner than I expect the change to stop and the programming
environments to be widely available.
--john
-------