From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Wed May 29 2002 - 12:17:51 CDT
On Wed, 29 May 2002 11:15:54 +0100
"Clive D.W. Feather" <clive@demon.net> said...
> >>> Concerning the first NOTE in 4.4.1
> > Anyway, I now propose:
> >
> > NOTE: UTF-8 is an encoding of the 16bit UCS-2 (and even the 32bit
> > UCS-4) character sets ...
>
> I really think that's too much detail for our document. We don't care about
> UCS-2 or UCS-4 (which aren't Unicode terms anyway). We mostly talk about
> just Unicode, so I continue to think that:
I am pretty sure they are ISO 10646 terms, though (or so says RFC 2279 bis).
>
> UTF-8 is an encoding for the Unicode character set with ...
>
> is far clearer.
I now have
NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
(in both its 16 and 32 bit forms) with the property that any
octet less than 128 immediately represents the corresponding
US-ASCII character, thus ensuring upwards compatibility with
previous practice. ...
>
>
> In a recent discussion I suggested that a bad UTF-8 sequence could be
> replaced by You, I think, rejected that idea, so I was trying to
> rule it out. However, it's actually *suggested* by the Unicode specs:
> <http://www.unicode.org/unicode/faq/utf_bom.html#15>
> so your wording is right, I think.
Yes, I agree that reading agents might well display U+FFFD in place of
characters thet they cannot display (always supposing they have a way of
displaying U+FFFD :-( ).
>
>
> Delving into my email archive (fascinating in itself) I find that this bit
> of the syntax was introduced to ensure that you didn't put accents on the
> dots between components. At one point I proposed the words:
>
> Names are restricted to those that are invariant under Unicode
> normalization NFC; each component must furthermore begin with
> a character with a combining class of 0.
>
> Then you pointed out that this sort of thing was better done in syntax than
> in semantic wording. I agreed.
>
> >>> So no component glyph can start with a Mark (I think that is a
> >>> fundamental Unicode property).
> >> Almost. There is a definition (D17a) for a defective sequence that does so.
> > Eh? I don't see that. D17A is right in the middle of the Hangul range.
>
> No, not U+D17A, but Unicode Definition 17a, page 43 (chapter 3) of the
> Unicode standard (only available in PDF).
Ah! I haven't got a copy of that.
>
> >>> Marks are reserved for modifying the
> >>> things preceding them
>
> >> I think I agree. The relevant text in 5.5 should change to:
> >>
> >> component-glyph = combiner-base *combiner-mark
> >> combiner-base = combiner-ASCII / combiner-extended
> >> combiner-ASCII = DIGIT / ALPHA / "+" / "-" / "_"
> >> combiner-extended = <any character with a Unicode code value of
> >> | 0080 or greater,
> >> but excluding any character in Unicode
> >> | categories Cc, Cf, Cs, M, and Z>
> >> combiner-mark = <any character with a Unicode code value of
> >> | 0080 or greater and in Unicode category M>
> >>
> >
> > Right. That is one possibility. Its effect would be to exclude certain
> > currently allowed component glyphs that began with a Mark (i.e.
> > something that affected the following base-character instead of the
> > previous one).
>
> Right. More precisely, it includes marks as part of the component glyph
> they follow; this affects the interpretation of the length limits further
> down the section.
Yes, I find that argument a persuasive one.
>
> > According to our understanding, such beasts are not
> > supposed to exist, so in fact there should be no change.
>
> Right.
>
> > However, there is a less dangerous solution
>
> I don't see why it's "less dangerous" or a "solution".
I meant that it was "dangerous" to be making unnecessary syntax changes
so close to last IESG call. However, I think you have convinced me that
your change is both safe and desirable.
>
> The problem is that, in many cases, other markers are part of the single
> glyph. One issue we discussed last year was the enclosing-circle; we
> accepted that there were reasons why a group name might use this, it's a
> Mark, we don't want it at the start of a component, it doesn't alter the
> amount of screen space the character takes up (to a first approximation),
> but its combining class is zero.
Although we did deprecate it.
>
> The syntax changes were done while we were still understanding
> normalisation and character construction. I really suggest (re-)reading
> items D13 to D17a on page 43 of the Unicode Standard. It's clear that a
> basic grapheme (a "combining character sequence") is a character not of
> category M (a "base character") followed by zero or more characters of
> category M ("combining characters"). Therefore the correct test is
> category M/not-M, not class 0/not-0.
Right. So here is what I now have. Note that I have changed 'glyph' to
'grapheme'. I note that they now also talk about 'grapheme clusters'. I thin
these are adjacent characters/graphemes that are to be considered joined
together (for the purposes of some particular language). I reckon they should
still count as two when considering your length restrictions, however.
header =/ Newsgroups-header
Newsgroups-header = "Newsgroups" ":" SP Newsgroups-content
*( ";" other-parameter )
Newsgroups-content = [FWS] newsgroup-name
*( [FWS] ng-delim [FWS] newsgroup-name )
[FWS]
newsgroup-name = component *( "." component )
component = 1*component-grapheme
ng-delim = ","
component-grapheme = combiner-base *combiner-mark
combiner-base = combiner-ASCII / combiner-extended
combiner-ASCII = DIGIT / ALPHA / "+" / "-" / "_"
combiner-extended = <any character with a Unicode code value of
0080 or greater but excluding any character
in Unicode categories Cc, Cf, Cs, M* and Z*>
combiner-mark = <any character with a Unicode code value of
0080 or greater and in Unicode category M*>
NOTE: the excluded characters in a combiner-extended are control
characters (Cc), format control characters (Cf), surrogates
(Cs), Marks (M*) and separators (Z*). In particular, this
excludes all whitespace characters. To all intents and
purposes, a component-grapheme is what a user might regard as a
single "character" as displayed on his screen, though it might
be transmitted as several actual characters (e.g. q-circumflex
is two characters). Note also that, in some writing schemes,
several component-graphemes will merge into one visible object
of variable size.
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl
Email: chl@clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5