From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Tue May 28 2002 - 13:07:29 CDT
Some of the following is off-list email between Clive and myself. I am
copying this to the whole list to keep people up to date.
On Tue, 28 May 2002 10:38:09 +0100
"Clive D.W. Feather" <clive@demon.net> said...
>> Concerning the first NOTE in 4.4.1
> However, can we change the start to just:
>
> UTF-8 is an encoding for the Unicode character set with ...
>
> or
>
> UTF-8 is an encoding for large character sets with ...
>
> since it's in no way limited to, or based on, 16 bits.
Eh? That's why the NOTE in 4.4.1 includes the words "(and even 32 bit)".
The new draft for UTF-8 claims it is suitable for UCS-2 and UCS-4,
though I doubt anyone in the real world uses UCS-4.
Anyway, I now propose:
NOTE: UTF-8 is an encoding of the 16bit UCS-2 (and even the 32bit
UCS-4) character sets ...
>
> > Even the semantics I have written
> > for 2.4.2 seems OK. You MUST NOT generate the illegal stuff, you MUST
> > NOT interpret it as having any valid meaning, but you MAY pass it on
> > if you can't be bothered to check it (as relayers certainly won't with
> > stuff that they are not interested in).
>
> Don't you need to add "or replaced by other characters" at the end of that
> paragraph ?
Why? Perhaps it should say they MUST NOT be decoded, and perhaps "valid"
should be changed to "meaningful". So I now have:
...but they MUST NOT ever be decoded or otherwise interpreted as
meaningful characters.
On Tue, 28 May 2002 10:54:00 +0100
"Clive D.W. Feather" <clive@demon.net> said...
>
> Charles Lindsey said:
> >> (1) There is no easy way to determine what is and isn't a variation
> >> selector (they are all category Mn and combining class 0, but so are other
> >> characters). Thus we would have to enumerate them.
> > Well you can look in the PropList, where you will see that all of them
> > have the property "Other_Default_Ignorable_Code_Point".
>
> Yes, but that's a bit of a hack - ODICP is basically a way of building the
> Default_Ignorable_Code_Point property.
Indeed. The whole of the properties list seems to suffer from some
confusing and inconsistent terminology, so I would be glad not to get
involved with it.
>
> > So my present inclination is to leave things exactly as they are, except
> > to refer to Unicode 3.2 in place of 3.1 as the minimum level addressed by
> > our draft.
>
> I agree that we can ignore these selectors.
OK, so no change to our draft on that score.
> However, see my other email
> concerning changes to 5.5 to remove combining classes.
Your other email follows:
On Tue, 28 May 2002 10:54:07 +0100
"Clive D.W. Feather" <clive@demon.net> said...
>
> Charles Lindsey said:
> > So our definitions of combiner-extended and combiner-mark are
> > essentially wrong.
>
> After a lot of study of the Unicode standard, I think I agree.
>
> This sort of thing is *not* explained clearly, but I think that the
> difference is that the combining classes (numbers) affect the normalisation
> algorithm, whereas marks affect the grapheme.
Yes, I spent best part of a day trying to find where we had got that
syntax from, and I eventually tracked it down in Unicode TR 15, which
deals with normalization. Essentially, our present syntax follows the
features necessary for the understanding of the two normalizations NFC
and NFKC.
>
> > What we should have said is that a combiner-extended
> > is everything excluding Cc, Cf, Cs, Zs, and Zp (as at present) plus the
> > Marks Mn, Mc and Me (no need to mention combining classes).
> > Then, our definition of combiner-mark should have been just the Marks
> > Mn, Mc and Me (whatever their combining class).
>
> Right, except that the names are now wrong.
>
> > So no component glyph can start with a Mark (I think that is a
> > fundamental Unicode property).
>
> Almost. There is a definition (D17a) for a defective sequence that does so.
Eh? I don't see that. D17A is right in the middle of the Hangul range.
>
> > Marks are reserved for modifying the
> > things preceding them (and thus variations selectors are properly
> > classified as marks). We can then choose to deprecate some marks (as
> > we currently do for Me), except where hierarchy admins (who know their
> > languages better than we do) choose otherwise.
> >
> > In which case, I still think there is no point in deprecating these new
> > variation selectors.
>
> I think I agree. The relevant text in 5.5 should change to:
>
> component-glyph = combiner-base *combiner-mark
> combiner-base = combiner-ASCII / combiner-extended
> combiner-ASCII = DIGIT / ALPHA / "+" / "-" / "_"
> combiner-extended = <any character with a Unicode code value of
> | 0080 or greater,
> but excluding any character in Unicode
> | categories Cc, Cf, Cs, M, and Z>
> combiner-mark = <any character with a Unicode code value of
> | 0080 or greater and in Unicode category M>
>
> and in the following note change "(Zs, Zl, Zp)" to "(Z)".
>
> I've deliberately used M and Z instead of listing the minor categories,
> just in case anything new gets added.
Right. That is one possibility. Its effect would be to exclude certain
currently allowed component glyphs that began with a Mark (i.e.
something that affected the following base-character instead of the
previous one). According to our understanding, such beasts are not
supposed to exist, so in fact there should be no change.
However, there is a less dangerous solution which I have adopted for
now, and that is simply to rename those syntax rules so that it is
clear they derive from the Normalization requirements, and to add an
explanatory note. So I propose to rename
combiner-base as glyph-starter
combiner-mark as glyph-marker
combiner-ASCII as ASCII-starter
combiner-extended as extended-starter
and to leave the syntax rules otherwise as they stand.
Then the following NOTE would be changed to:
NOTE: a component-glyph (a glyph-starter followed by some number
of glyph-markers) is the basic unit upon which the Unicode
normalizations NFC and NFKC operate; it is, to all intents and
purposes, what a user might regard as a single "character" as
displayed on his screen, though it might be transmitted as several
actual characters (e.g. q-circumflex is two characters). Note also
that, in some writing schemes, several component-glyphs will merge
into one visible object of variable size.
The characters excluded within an extended-starter are control
characters (Cc), format control characters (Cf), surrogates (Cs),
and separators (Zs, Zl, Zp). In particular, this excludes all
whitespace characters.
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl
Email: chl@clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5