[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: restrictions when defining charsets



In <>, John Klensin writes:
> ...is it not the case that
>   (i) By making the decision that idiographic characters that "look the
> same" (i.e., have the same glyphs) are coded the same way, IS 10646
> becomes a "glyph standard" not a "character standard" for the subset of
> languages involved?
>   (ii) By maintaining a distinction between code positions for
> characters with the same appearance in most alphabetic languages, IS
> 10646 really is a "character [set] standard" for those languages...

We could probably argue forever over shades of meaning of "glyph"
and "character" (and I gather that the ISO working groups on
character sets do so), but I would hope that for the purposes of
IETF we could duck a few of those issues and accept the fruits of
the Unicode and ISO10646 "unification" efforts.

I'm no expert on international character sets, and all I know
about Unicode is what I read in the book [1].  On the other hand,
that gives me an external and (I hope) less biased perspective,
which I here offer, for what it's worth.

Unicode represents an excellent unification effort.  Nevertheless,
it is tempered by pragmatism, which in some cases resulted in
decisions being made which might seem to be at odds with the
theoretical ideal of complete unification.  For example, U+0041
LATIN CAPITAL LETTER A, U+0391 GREEK CAPITAL LETTER ALPHA, and
U+0410 CYRILLIC CAPITAL LETTER A retain distinct code points even
though their glyphs are largely identical.  I will not attempt to
justify such decisions: they weren't mine, and although I have a
few suspicions as to their rationale, dragging them out here now
would serve no other purpose than inevitably to reopen some
tedious discussion.

I gather that the complaints still hovering around Unicode,
several of which have recently cropped up on this list and in the
Usenet newsgroup comp.std.internat, involve individuals who feel
that their particular language, country, and/or culture has been
slighted by one of the asymmetrical unification decisions.  That
is, a text containing the character U+0B85 TAMIL LETTER A is
almost certainly part of a script written in the Tamil language,
and a text containing U+0391 GREEK CAPITAL LETTER ALPHA is
probably written in Greek (unless it's a technical usage...), but
a text containing one of the unified Chinese/Japanese/Korean
ideographs might be written (obviously) in either Chinese,
Japanese, or Korean, a text containing U+00C4 LATIN CAPITAL
LETTER A DIAERESIS might be written in one of several European
languages, and a text containing U+0041 LATIN CAPITAL LETTER A
could be written in almost anything.

If the complaints have to do with unification having been done at
all (rather than with its having perhaps been done less than
impartially), the Unicode standard itself provides an excellent
defense for the process:

	There is some concern that unifying the Han characters
	can lead to confusion because they are sometimes used
	differently in the three languages [Chinese, Japanese,
	and Korean].  Computationally, Han character unification
	presents no more problems than having a single character
	set for the Roman alphabet that is used to write
	languages as different as English and Vietnamese.
	Programmers do not expect the characters "c" "h" "a" and
	"t" alone to tell us whether "chat" is a French word for
	"cat" or an English word meaning "informal talk."
	Likewise, we depend on context to identify the American
	hood (of a car) with the British bonnet.  Few computer
	users are confused by the fact that ASCII can also be
	used to represent such words as the Welsh word "ynghyd,"
	which are strange looking to English eyes.  Although it
	would be convenient to identify words by language for
	programs such as spell-checkers, it is neither practical
	nor productive to encode a separate Latin character set
	for every language which uses it.
	[1, sec. 3.4, p. 112]

Whatever shape/character/meaning information Unicode characters
do convey, language information shouldn't be thought to be one of
them, and the fact that some assumptions about language can be
inferred from some of the characters should be viewed as an
accident.  If any interchange standard desires to transmit
language information, it should not rely on the character set,
but should instead use an explicit field in a header of some
kind.

For those who have use of an "even more unified" Unicode, perhaps
in order to display Unicode characters on equipment which
provides a single glyph which is assumed to be suitable for Latin
and Cyrillic A as well as Greek Alpha, I am preparing a set of
typographical equivalence tables which I will be distributing
once they're finished.

In <>, Keld Simonsen writes:
> I am not an expert on Han characters, but I believe that the
> unification is done at the character level. This means that Chinese
> (PRC), Taiwanese, Japanese and Korean character sets have been
> tabled and characters having the same origin and almost the same
> appearance have been said to be equivalent, and each of these
> characters relates to a single Unihan character. This relation is
> defined in ISO 10646 (at least it was in the DIS2).
>
> So it is still the meaning that is coded, it is not the shape.
> The shape may be different for different languages, there may be
> a Chinese Unihan font and a Japanese Unihan font which may
> differ significantly in many places.

This is essentially my understanding as well, although I would
not go so far as to say that it is absolute meaning which is
encoded, "meaning" being an impossible to define as well as (in
the present context) emotionally laden term.  Suffice it to say
that the Han unification was not done capriciously; the
ideographs which have been unified possess demonstrable and
documented aspects in common (involving derivation, shape, and
usage) which warrant that unification.

It is also worth noting that

	The validity of [the ideographic unification] effort was
	verified in 1991 by an independent team of East Asian
	experts at the University of Toronto.  (See the Unicode
	CJK Unification Verification Project Final Report, Kazuko
	Nakajima, Project Leader, Associate Professor, Department
	of East Asian Studies, University of Toronto.)
	[1, sec. 3.4, p. 115]

In <>, Masataka Ohta writes:
> So, I want character code be informative enough so that I can produce
> state-of-the-art quality shape of Japanese characters and Chinese
> characters without requiring external profiling information.

There is no question but that Unicode does not attempt to support
this goal.  My own feeling is that it is not the purpose of a
character set to do so, and that where language-specific
processing is desired, an explicit indication (if that means
"external profiling information," so be it) of language is both
necessary and appropriate.

> CAUTION: Don't be confused by the fact that Unicode gives unique mapping
> of a byte stream to glyphs of almost all *EUROPEAN* languages without
> requiring external profiling informaiton.

This is a curious usage of "almost all."  If we want a code point
out of a character set to convey language information, about the
only European language for which Unicode does so is Greek.  In
particular, all of the languages which use variants of the Latin
alphabet have been unified (via ASCII and the ISO8859 variants)
for far longer than Unicode has been around, and distinctions
between languages which are coded in these scripts are completely
demolished.

If the concern is merely that the display fonts being used by
Chinese and Japanese speakers tend to differ more significantly
than those used for, say, English and German, this seems like a
comparatively minor issue.

There may be some nuance of Masataka Ohta's complaint which I am
missing, but I know I am not completely insensitive to the issue.
Unicode, like ISO 8859-1 before it, assigns a single code point
to latin small letter o with diaeresis.  One cannot tell based on
the character set alone whether it is o-diaeresis used in English
or o-umlaut used in German.  This is not an academic concern; I
am working on software to display multinational characters on
possibly-restricted equipment (Markus Kuhn, and doubtless many
others, are working on similar projects), and the appropriate
transliterations end up depending on language.  On equipment with
a limited character set repertoire, lacking diacritics, the
English word coo"perate should be rendered as cooperate, but the
German word sho"n can be rendered as shoen.  For this reason, I
would like to see some means of explicitly specifying language,
which would help to address Masataka's concern as well, but
that's a proposal for another day.

						Steve Summit
						scs@adam.mit.edu

[1] The Unicode Consortium, The Unicode Standard -- Worldwide
Character Encoding -- Version 1.0, Volume 1, Addison-Wesley,
1990, 1991, ISBN 0-201-56788-1.