From: Clive D.W. Feather (clive@demon.net)
Date: Fri May 24 2002 - 09:59:43 CDT
Charles Lindsey said:
> OK, I have prefixed the 2nd paragraph with
>
> "In order to allow newsgroup-names containing Non-ASCII characters,
> this section relies heavily on the provisions of the Unicode standard.
> All references to "Unicode" mean [UNICODE 3.1] or any ..."
>
> Observe the removal of mention of "references to the latest version of the
> Unicode Standard", since we did not, in fact, make any such references
> anywhere.
And today [:-(] I discovered that the current version of the Unicode
standard is Unicode 3.2, not 3.1.
There are two significant changes. Firstly, the definition of UTF-8 is
tightened so that an implementation (other than a special cleaner-upper)
MUST NOT accept illegal sequences, but must instead raise an error. The
first Note of 4.4.1 definitely needs rewording.
Secondly, they have introduced a concept of "variation selector". This is
used *in certain situations* to force a specific appearance on a character.
For example, the character U+2268 (less than but not equal to) would
normally be written in the form on the left, but if followed by U+FE00
(variation selector 1) MUST be written in the form on the right:
/ /
/ /
/ /
/ /
< <
\ \
\ \
\ \
\ \
/ |
------/---- -----+-----
/ |
----/------ -----+-----
/ |
My first, second, and third thoughts are to recommend that we forbid the
use of variation selectors in newsgroup names. However, there are two
concerns.
(1) There is no easy way to determine what is and isn't a variation
selector (they are all category Mn and combining class 0, but so are other
characters). Thus we would have to enumerate them.
(2) There are two kinds of variation selectors. Firstly, there are 16
generic selectors (U+FE00 to U+FE0F), with the only sanctioned uses so far
being various maths symbols followed by U+FE00. Secondly, there are three
"Mongolian Free Selectors" (U+180B to U+180D). If I'm understanding this
correctly, some Mongolian letters have up to four different forms, and it
is a matter of knowledge which form to use in any given word (that is, it
is *NOT* grammatical or algorithmic rules, but more akin to the English
question of when to use "ie" or "ei" in words like "fiend" and "seize", or
when to use "c", "k", or "ck" to indicate a /k/ sound). To give software
a fighting chance of getting this right, such letters are followed by
nothing or one of the three Free Selectors.
I would recommend that, in 5.5 numbered item 1, we explicitly exclude the
16 Variation Selectors but not the Mongolian ones.
-- Clive D.W. Feather | Work: <clive@demon.net> | Tel: +44 20 8371 1138 Internet Expert | Home: <clive@davros.org> | Fax: +44 870 051 9937 Demon Internet | WWW: http://www.davros.org | Mobile: +44 7973 377646 Thus plc | | NOTE: fax number change