Re: Unicode and draft 07

New Message Reply About this list Date view Thread view Subject view Author view

From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Mon May 27 2002 - 08:23:08 CDT


In <20020524145943.GG74758@finch-staff-1.demon.net> "Clive D.W. Feather" <clive@demon.net> writes:

>And today [:-(] I discovered that the current version of the Unicode
>standard is Unicode 3.2, not 3.1.

>There are two significant changes. Firstly, the definition of UTF-8 is
>tightened so that an implementation (other than a special cleaner-upper)
>MUST NOT accept illegal sequences, but must instead raise an error. The
>first Note of 4.4.1 definitely needs rewording.

Eh? That Note in 4.4.1 still seems correct. The sequences excluded in that
Note are exactly the ones now excluded by Unicode 3.2 (and also those
excluded in the upcoming RFC 2279bis). (If RFC 2279bis should be accepted
by the time our draft is finally published, references to RFC 2279 could
be changed).

What might have needed changing is section 2.4.2, but fortunately the
wording there concerning not generating and not interpreting the invalid
sequences is already as strong as that in Unicode 3.2.

>Secondly, they have introduced a concept of "variation selector". This is
>used *in certain situations* to force a specific appearance on a character.
>For example, the character U+2268 (less than but not equal to) would
>normally be written in the form on the left, but if followed by U+FE00
>(variation selector 1) MUST be written in the form on the right:

> / /
> / /
> / /
> / /
> < <
> \ \
> \ \
> \ \
> \ \
> / |
> ------/---- -----+-----
> / |
> ----/------ -----+-----
> / |

>My first, second, and third thoughts are to recommend that we forbid the
>use of variation selectors in newsgroup names. However, there are two
>concerns.

>(1) There is no easy way to determine what is and isn't a variation
>selector (they are all category Mn and combining class 0, but so are other
>characters). Thus we would have to enumerate them.

Well you can look in the PropList, where you will see that all of them
have the property "Other_Default_Ignorable_Code_Point".

>(2) There are two kinds of variation selectors. Firstly, there are 16
>generic selectors (U+FE00 to U+FE0F), with the only sanctioned uses so far
>being various maths symbols followed by U+FE00. Secondly, there are three
>"Mongolian Free Selectors" (U+180B to U+180D). If I'm understanding this
>correctly, some Mongolian letters have up to four different forms, and it
>is a matter of knowledge which form to use in any given word (that is, it
>is *NOT* grammatical or algorithmic rules, but more akin to the English
>question of when to use "ie" or "ei" in words like "fiend" and "seize", or
>when to use "c", "k", or "ck" to indicate a /k/ sound). To give software
>a fighting chance of getting this right, such letters are followed by
>nothing or one of the three Free Selectors.

>I would recommend that, in 5.5 numbered item 1, we explicitly exclude the
>16 Variation Selectors but not the Mongolian ones.

The "Other_Default_Ignorable_Code_Point" property seems to suggest that
either they all be ignored, or none of them.

However, the 16 generic selectors are only allowed to occur in designated
places, which up till now are all in connection with mathematical symbols.
Since we have already deprecated all Symbols, I don't really think it
matters whether we do anything with them or not. As for the Mongolian
ones, I think it better to leave it to the administrators of the Mongolian
hierarchy, since they are more likely to make a correct decision than we
are.

In any case, all these Marks (categories Mn and Mc) are simply intended to
convey extra information about the preceding base character, and I don't
see that these variations really alter that.

So my present inclination is to leave things exactly as they are, except
to refer to Unicode 3.2 in place of 3.1 as the minimum level addressed by
our draft.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl@clw.cs.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.