Re: Unicode and draft 07

New Message Reply About this list Date view Thread view Subject view Author view

From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Wed May 29 2002 - 12:17:51 CDT


        On Wed, 29 May 2002 11:15:54 +0100
        "Clive D.W. Feather" <clive@demon.net> said...

> >>> Concerning the first NOTE in 4.4.1

> > Anyway, I now propose:
> >
> > NOTE: UTF-8 is an encoding of the 16bit UCS-2 (and even the 32bit
> > UCS-4) character sets ...
>
> I really think that's too much detail for our document. We don't care about
> UCS-2 or UCS-4 (which aren't Unicode terms anyway). We mostly talk about
> just Unicode, so I continue to think that:

I am pretty sure they are ISO 10646 terms, though (or so says RFC 2279 bis).
>
> UTF-8 is an encoding for the Unicode character set with ...
>
> is far clearer.

I now have

        NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
        (in both its 16 and 32 bit forms) with the property that any
        octet less than 128 immediately represents the corresponding
        US-ASCII character, thus ensuring upwards compatibility with
        previous practice. ...
>
>
> In a recent discussion I suggested that a bad UTF-8 sequence could be
> replaced by You, I think, rejected that idea, so I was trying to
> rule it out. However, it's actually *suggested* by the Unicode specs:
> <http://www.unicode.org/unicode/faq/utf_bom.html#15>
> so your wording is right, I think.

Yes, I agree that reading agents might well display U+FFFD in place of
characters thet they cannot display (always supposing they have a way of
displaying U+FFFD :-( ).
>
>
> Delving into my email archive (fascinating in itself) I find that this bit
> of the syntax was introduced to ensure that you didn't put accents on the
> dots between components. At one point I proposed the words:
>
> Names are restricted to those that are invariant under Unicode
> normalization NFC; each component must furthermore begin with
> a character with a combining class of 0.
>
> Then you pointed out that this sort of thing was better done in syntax than
> in semantic wording. I agreed.
>
> >>> So no component glyph can start with a Mark (I think that is a
> >>> fundamental Unicode property).
> >> Almost. There is a definition (D17a) for a defective sequence that does so.
> > Eh? I don't see that. D17A is right in the middle of the Hangul range.
>
> No, not U+D17A, but Unicode Definition 17a, page 43 (chapter 3) of the
> Unicode standard (only available in PDF).

Ah! I haven't got a copy of that.

>
> >>> Marks are reserved for modifying the
> >>> things preceding them
>
> >> I think I agree. The relevant text in 5.5 should change to:
> >>
> >> component-glyph = combiner-base *combiner-mark
> >> combiner-base = combiner-ASCII / combiner-extended
> >> combiner-ASCII = DIGIT / ALPHA / "+" / "-" / "_"
> >> combiner-extended = <any character with a Unicode code value of
> >> | 0080 or greater,
> >> but excluding any character in Unicode
> >> | categories Cc, Cf, Cs, M, and Z>
> >> combiner-mark = <any character with a Unicode code value of
> >> | 0080 or greater and in Unicode category M>
> >>
> >
> > Right. That is one possibility. Its effect would be to exclude certain
> > currently allowed component glyphs that began with a Mark (i.e.
> > something that affected the following base-character instead of the
> > previous one).
>
> Right. More precisely, it includes marks as part of the component glyph
> they follow; this affects the interpretation of the length limits further
> down the section.

Yes, I find that argument a persuasive one.

>
> > According to our understanding, such beasts are not
> > supposed to exist, so in fact there should be no change.
>
> Right.
>
> > However, there is a less dangerous solution
>
> I don't see why it's "less dangerous" or a "solution".

I meant that it was "dangerous" to be making unnecessary syntax changes
so close to last IESG call. However, I think you have convinced me that
your change is both safe and desirable.

>
> The problem is that, in many cases, other markers are part of the single
> glyph. One issue we discussed last year was the enclosing-circle; we
> accepted that there were reasons why a group name might use this, it's a
> Mark, we don't want it at the start of a component, it doesn't alter the
> amount of screen space the character takes up (to a first approximation),
> but its combining class is zero.

Although we did deprecate it.
>
> The syntax changes were done while we were still understanding
> normalisation and character construction. I really suggest (re-)reading
> items D13 to D17a on page 43 of the Unicode Standard. It's clear that a
> basic grapheme (a "combining character sequence") is a character not of
> category M (a "base character") followed by zero or more characters of
> category M ("combining characters"). Therefore the correct test is
> category M/not-M, not class 0/not-0.

Right. So here is what I now have. Note that I have changed 'glyph' to
'grapheme'. I note that they now also talk about 'grapheme clusters'. I thin
these are adjacent characters/graphemes that are to be considered joined
together (for the purposes of some particular language). I reckon they should
still count as two when considering your length restrictions, however.

      header =/ Newsgroups-header
      Newsgroups-header = "Newsgroups" ":" SP Newsgroups-content
                                    *( ";" other-parameter )
      Newsgroups-content = [FWS] newsgroup-name
                               *( [FWS] ng-delim [FWS] newsgroup-name )
                               [FWS]
      newsgroup-name = component *( "." component )
      component = 1*component-grapheme
      ng-delim = ","
      component-grapheme = combiner-base *combiner-mark
      combiner-base = combiner-ASCII / combiner-extended
      combiner-ASCII = DIGIT / ALPHA / "+" / "-" / "_"
      combiner-extended = <any character with a Unicode code value of
                             0080 or greater but excluding any character
                             in Unicode categories Cc, Cf, Cs, M* and Z*>
      combiner-mark = <any character with a Unicode code value of
                             0080 or greater and in Unicode category M*>

        NOTE: the excluded characters in a combiner-extended are control
        characters (Cc), format control characters (Cf), surrogates
        (Cs), Marks (M*) and separators (Z*). In particular, this
        excludes all whitespace characters. To all intents and
        purposes, a component-grapheme is what a user might regard as a
        single "character" as displayed on his screen, though it might
        be transmitted as several actual characters (e.g. q-circumflex
        is two characters). Note also that, in some writing schemes,
        several component-graphemes will merge into one visible object
        of variable size.

Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl
Email: chl@clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.