Re: Unicode and draft 07

New Message Reply About this list Date view Thread view Subject view Author view

From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Thu May 30 2002 - 13:59:46 CDT


In <20020529101554.GD15398@finch-staff-1.demon.net> "Clive D.W. Feather" <clive@demon.net> writes:

I replied to this once last night, and I've got the syslogs to prove it,
but it doesn't seem to have made it to Landfield :-( .

>Charles Lindsey said:

>>>> Concerning the first NOTE in 4.4.1

>> Anyway, I now propose:
>>
>> NOTE: UTF-8 is an encoding of the 16bit UCS-2 (and even the 32bit
>> UCS-4) character sets ...

>I really think that's too much detail for our document. We don't care about
>UCS-2 or UCS-4 (which aren't Unicode terms anyway). We mostly talk about
>just Unicode, so I continue to think that:

> UTF-8 is an encoding for the Unicode character set with ...

>is far clearer.

According to RFC2279bis, the terms UCS-2 and UCS-4 are certainly defined
in ISO 10646, if not in Unicode. Anyway, I now have:

        NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
        (in both its 16 and 32 bit forms) with the property that any
        octet less than 128 immediately represents the corresponding
        US-ASCII character, thus ensuring upwards compatibility with
        previous practice. ...

>In a recent discussion I suggested that a bad UTF-8 sequence could be
>replaced by U+FFFD. You, I think, rejected that idea, so I was trying to
>rule it out. However, it's actually *suggested* by the Unicode specs:
><http://www.unicode.org/unicode/faq/utf_bom.html#15>
>so your wording is right, I think.

Yes, I agree that is a good character for user agents to display if they
cannot display the genuine Unicode character (but how many of them can
display U+FFFD :-( ?)

>Delving into my email archive (fascinating in itself) I find that this bit
>of the syntax was introduced to ensure that you didn't put accents on the
>dots between components. At one point I proposed the words:

> Names are restricted to those that are invariant under Unicode
> normalization NFC; each component must furthermore begin with
> a character with a combining class of 0.

>Then you pointed out that this sort of thing was better done in syntax than
>in semantic wording. I agreed.

>>>> So no component glyph can start with a Mark (I think that is a
>>>> fundamental Unicode property).
>>> Almost. There is a definition (D17a) for a defective sequence that does so.
>> Eh? I don't see that. D17A is right in the middle of the Hangul range.

>No, not U+D17A, but Unicode Definition 17a, page 43 (chapter 3) of the
>Unicode standard (only available in PDF).

Ah! I have not actually got that document.

>>>> Marks are reserved for modifying the
>>>> things preceding them

>>> I think I agree. The relevant text in 5.5 should change to:
>>>

snipped (see below)

>>
>> Right. That is one possibility. Its effect would be to exclude certain
>> currently allowed component glyphs that began with a Mark (i.e.
>> something that affected the following base-character instead of the
>> previous one).

>Right. More precisely, it includes marks as part of the component glyph
>they follow; this affects the interpretation of the length limits further
>down the section.

Yes, I find that a persuasive argument.

>> However, there is a less dangerous solution

>I don't see why it's "less dangerous" or a "solution".

I meant dangerous in the sense that changing syntax on the brink of an
IESG Last Call is always dangerous. However, I now think your proposed
change is both safe and useful.

>The problem is that, in many cases, other markers are part of the single
>glyph. One issue we discussed last year was the enclosing-circle; we
>accepted that there were reasons why a group name might use this, it's a
>Mark, we don't want it at the start of a component, it doesn't alter the
>amount of screen space the character takes up (to a first approximation),
>but its combining class is zero.

>The syntax changes were done while we were still understanding
>normalisation and character construction. I really suggest (re-)reading
>items D13 to D17a on page 43 of the Unicode Standard. It's clear that a
>basic grapheme (a "combining character sequence") is a character not of
>category M (a "base character") followed by zero or more characters of
>category M ("combining characters"). Therefore the correct test is
>category M/not-M, not class 0/not-0.

OK, here is what I have now got in 5.5. Note that I have taken the
opportunity to replace "glyph" by "grapheme", which seems to agree with
how Unicode uses that word.

      header =/ Newsgroups-header
      Newsgroups-header = "Newsgroups" ":" SP Newsgroups-content
                                    *( ";" other-parameter )
      Newsgroups-content = [FWS] newsgroup-name
                               *( [FWS] ng-delim [FWS] newsgroup-name )
                               [FWS]
      newsgroup-name = component *( "." component )
      component = 1*component-grapheme
      ng-delim = ","
      component-grapheme = combiner-base *combiner-mark
      combiner-base = combiner-ASCII / combiner-extended
      combiner-ASCII = DIGIT / ALPHA / "+" / "-" / "_"
      combiner-extended = <any character with a Unicode code value of
                             0080 or greater but excluding any character
                             in Unicode categories Cc, Cf, Cs, M* and Z*>

      combiner-mark = <any character with a Unicode code value of
                             0080 or greater and in Unicode category M*>

        NOTE: the excluded characters in a combiner-extended are control
        characters (Cc), format control characters (Cf), surrogates
        (Cs), Marks (M*) and separators (Z*). In particular, this
        excludes all whitespace characters. To all intents and
        purposes, a component-grapheme is what a user might regard as a
        single "character" as displayed on his screen, though it might
        be transmitted as several actual characters (e.g. q-circumflex
        is two characters). Note also that, in some writing schemes,
        several component-graphemes will merge into one visible object
        of variable size.

Note that the Unicode documents use "M*" meaning "all the Ms", so we are
safe there.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl@clw.cs.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.