[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode...



I regret that I have to be overwhelmingly negative once again.

> In case people didn't catch this one last week in the NY Times business
> section, see the attached.  Unicode is a pretty serious contender to
> replace ASCII 'round the world...

I've seen the proposal.  It's not "serious" in any hitherto defined
meaning of the term.

> === attachment ===

> San Francisco, Feb. 19 -- A group of leading computer companies today
> announced an ambitious effort to develop a lingua franca for the
> electronics age, a universal digital code that could be used by
> computers to represent letters and characters in all the world's
> languages.

There is another one called ISO DIS 10646.

> A consortium has been formed to develop and promote the new code, known
> as Unicode.  Its 12 members include many top computer companies that
> are often fierce rivals, like I.B.M., Apple Computer, Microsoft, Sun
> Microsystems and Xerox.

Anything IBM does with character sets is goddamn _EVIL_.  I got a new
printer today, and it has 9 different IBM "standard" character sets,
but nothing remotely similar to any international standard.  I fried
a whole array of salescreep, technical personell, the Scandinavian
importer, European headquarters, and managed to produce a letter to
IBM and all the people I talked to, giving a few pointers to how things
should be done, and that they were pretty dumb not to do it that way.

None of them KNEW about ISO 8859-1, or ISO Latin 1, or any other coded
character sets.  A _long_ list of references to ISO standards were
appended to the letter.  (It looked just as nice as I thought it would
look, so the printer is OK.)  Unicode is not going to hit the market.
Its 16-bit-ness is going to be too much for vendors and users.

> If the code becomes a worldwide standard, it would be easier for people
> in different countries to communicate by electronic mail.  The code
> would also make it easier for software companies to develop programs
> that can work in different languages.

Yeah, right.  Vendors are not prepared to produce 65536 glyphs in a
terminal to accomodate for all the symbols of Unicode, and there are no
easily defined subsets.  The "different languages" stuff is misleading.

> Right now, for instance, an American computer often cannot understand
> the codes used by a French computer to represent accented characters,
> so a message sent electronically from France to the United States might
> arrive without the accents or with mistaken characters.

A gross oversimplifiation and strictly speaking it's even wrong.

> But with the new code, any computer anywhere could understand and
> display everything from French accents to Chinese ideographs, not to
> mention letters in Bengali, Hebrew, Arabic and other languages.

Right, 65536 glyps in you VT200000 terminal.

> The most widespread system, the American Standard Code for Information
> Interchange, or Ascii, was approved as a standard in 1967.  Ascii
> (pronounced AS-kee) represents each letter and symbol as a sequence of
> eight zeros and ones.  The letter Y, for instance, is represented by
> the sequence 10111001.

Y, incidentally, is 1011001, in 7 bits.

> But the International Business Machines Corporation has used a
> different code in some computers, in which Y is represented by
> 11101000.  That means messages sent from a non-I.B.M. computer to an
> I.B.M. machine must be translated.

An understatement par excellence!

> And because Ascii cannot handle special characters used in other
> languages, other countries have had to design their own codes.  Europe
> has its own 8-bit code, and Asian countries like Japan and China have
> their own codes to represent the thousands of different characters in
> their languages.

I'm not aware of any "European" 8-bit code.  The Japanese character sets
follow ISO 2022, unlike one major vendor's character sets.

> Ascii cannot be used to represent characters in all these languages
> because there are only 256 different 8-bit sequences of zeros and
> ones.

ASCII is 7 bits.  You could have states in the character code and
achieve an infinite number of characters with only 8-bit sequences
as the basic unit.  By the way, Unicode is not strictly 16 bits wide.
Diacritics are floating _behind_ the letter, and the set of combinations
is open-ended.  Please be kind to your favorite vendor, he's about to
die of shock.

> The proposed standard, Unicode, would represent letters and symbols by
> a sequence of 16 zeroes and ones, instead of eight, allowing for 65,536
> different combinations.  That is enough to give each character used in
> all the living languages of the world its own unique sequence, with
> enough combinations left over to eventually include obsolete scripts
> like cuneiform and hieroglyphs as well.

There are upwards of 190 000 distinct glyphs in use in the world today.

> With each character having a unique code, software programmers would no
> longer have to worry about which standard was being used or have to
> translate from one system to another.  That would make it easier, for
> instance, to develop a word-processing program that works in many
> languages.

Except the problem with the floating diacritics, which for some Greek
letter result in 64-bit characters.  The problem is even worse in
Arabic.  The contender, ISO 10646, encodes these Greek letters in
16 bits uniformly, or varying between 8 and 24 bits.

> He said the American computer and software companies were able to put
> aside their differences to work on the standard because all of them see
> their overseas markets becoming more important.  "How many things are
> there in the world where you can get Sun, I.B.M., Microsoft and Apple
> to agree?" he asked.

Anything which is sufficiently stupid.

> Unicode has been under development since 1989 by an informal group, and
> the proposed standard, which now includes sequences for 27,000
> characters, is expected to be completed this spring.

"An informal group"!  Couldn't have been said better.

> Unicode's developers said they have done extensive research and
> consulted with linguists.  One challenge was trying to represent all
> the Chinese ideographs that are also used in Korean and Japanese.

That's another understatement!

> But Unicode researchers found that the languages have more than 11,000
> symbols that are the same, allowing Unicode to represent all the
> Chinese ideographs with only 18,739 unique characters, instead of the
> more than 31,000 that would otherwise have been required.

Except that the Chinese (mainland _and_ Taiwan), the Korean and the
Japanese oppose the unification and will never use it.  Great work!

> The proposed code still faces hurdles.  The consortium needs to attract
> support from foreign companies, and an international standards
> organization is developing a competing code.  Using 16 bits instead of
> eight to represent each character would also mean that computers would
> require more memory and disk-storage capacity.

Well, kind of 100% increase in everything, and even beyond that if you
have heavily accented letters which require more than one 16-bit unit.
I like the reference to "AN International Standards Organization".

> Problems could also arise in achieving compatibility between computers
> using Unicode and those using older codes.

He said, and ran away.  I mean, "could"?  How does _your_ C software
handle a string in which every other byte is a NUL?

I'm not entirely in favor of the current ISO DIS 10646 for many reasons,
but Unicode is not worth considering seriously.  It's not a contender,
and won't be.  That IBM is behind this shit, will just make it harder
for the good solution to win.  ASCII won over EBCDIC, and ISO 10646
is going to win over any IBM-product, even the PC "character set" crap.

Yes, I'm quite irritated at the dilusion of effort that Unicode means,
as it will entail several more years of confused customers, several
millions of dollars in bad investments, and another schizofrenic world.

[Erik]