[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode...



>Very sorry you're so overwhelmingly negative on Unicode.  ...

In a probably fruitless attempt to be a voice of moderation on this 
one...

(1) The timing of the public emergence of the Unicode proposal relative
to the DIS balloting is *very* unfortunate.  I don't think it is part of 
a conspiracy.

(2) There seems to be general agreement that neither is what one would 
call a perfect proposal.  It may be the perfection is too much to hope 
for in this area.

(3) 10646 comes, to a considerable extent, out of a "codes and character
sets" and data communications community, and a portion of that community
that is influenced by ISO procedures, membership and voting rules, etc. 
It shows.  That is not necessarily bad. 

(4) Unicode comes, to a considerable extent, out of typographic 
(according to the notes that I dug out last night, the first version 
originated at Xerox) and programming languages communities, and portions
of those communities that are influenced by the workings of multivendor
consortia.  And it shows.  That is not necessarily bad either

In some ways, the two are complementary, not competitive.  Different 
primary objectives came into play when the compromises had to be made.  
In the email context, all other things being equal (and, ignoring for 
the moment, the Han [non]unification problem), I'd rather deal with 
10646 if I were writing an MTA, and would rather deal with Unicode if I 
were doing a UA or the related tools.

Typographic folks often prefer overstrike sequences to "more 
characters".  Makes for smaller fonts, and, "if it looks the same on the 
page, it is the same, isn't it?".  And, for some purposes, it is.

Speaking as a programming language person (former convenor of 
JTC1/SC22/WG11 if the credentials are necessary), I *loathe* the idea of 
a variable-length character set.  7 bits is fine.  8 bits is fine.  16 
is ok, too.  If it has to be 32, I hope storage is real cheap.  But 
variable length and shifts are a catastrophe--they foul up everything I 
want to know about how to allocate storage, overlaying things, insertion 
into strings, equality comparisons and so on.  From that point of view, 
Unicode started out a candidate for the next logical member of the 
series that starts with ISO646 IRV and moves through 8859 (one more 
bit); Unicode is 8 more bits than that.  Fixed character set, one code 
position equals one character.  The biggest argument against it from 
that standpoint is that, because of the overstriking, it fails in that 
objective.  But not by very much. (The "how would you like to write C 
when every other character is null" question is bogus--a Unicode-
supporting C, properly defined, would just add, e.g., "long char" and 
handle all of those as 16 bit quantities, so the string terminating "
null" might suddenly become 0x0000.  If you don't ask them to invent 
automatic conversion rules between Volkswagens and apples, compiler 
writers are very good at this sort of thing.)

I could imagine contexts in which I'd be happy using son-of-DIS10646 for 
communications between machines and then converting to daughter-of-
Unicode for local processing.  Converting back would, of course, require 
that I know the context for dealing with the mapping of a single 
Unicode character positions into (typographically) non-unique 10646 
characters.  Mostly, I suspect I'd know.

I would expect, for example, that someone in Japan trying to print a 
document that contained (in Kanji and kana) a translation and commentary
on a traditional Chinese text (with the Han characters appearing in the 
text) would really prefer Unicode.  Or would use 10646 and cheat.  It 
really depends on the application.

My earlier personally-expressed objections to one is also an objection 
to the other:
  -- they are moving targets
  -- common, garden-variety, terminals and UAs are not set up to handle 
character environments involving > 8 bits, partially because the 
programming languages in which they are written aren't.
  The latter will change, and the targets will stop moving.  But I'd 
prefer to be able to send and receive mail with "international" 
characters sooner than I expect the change to stop and the programming 
environments to be widely available.

    --john
-------