From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Thu May 30 2002 - 13:59:46 CDT
In <20020529101554.GD15398@finch-staff-1.demon.net> "Clive D.W. Feather" <clive@demon.net> writes:
I replied to this once last night, and I've got the syslogs to prove it,
but it doesn't seem to have made it to Landfield :-( .
>Charles Lindsey said:
>>>> Concerning the first NOTE in 4.4.1
>> Anyway, I now propose:
>>
>> NOTE: UTF-8 is an encoding of the 16bit UCS-2 (and even the 32bit
>> UCS-4) character sets ...
>I really think that's too much detail for our document. We don't care about
>UCS-2 or UCS-4 (which aren't Unicode terms anyway). We mostly talk about
>just Unicode, so I continue to think that:
> UTF-8 is an encoding for the Unicode character set with ...
>is far clearer.
According to RFC2279bis, the terms UCS-2 and UCS-4 are certainly defined
in ISO 10646, if not in Unicode. Anyway, I now have:
NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
(in both its 16 and 32 bit forms) with the property that any
octet less than 128 immediately represents the corresponding
US-ASCII character, thus ensuring upwards compatibility with
previous practice. ...
>In a recent discussion I suggested that a bad UTF-8 sequence could be
>replaced by U+FFFD. You, I think, rejected that idea, so I was trying to
>rule it out. However, it's actually *suggested* by the Unicode specs:
><http://www.unicode.org/unicode/faq/utf_bom.html#15>
>so your wording is right, I think.
Yes, I agree that is a good character for user agents to display if they
cannot display the genuine Unicode character (but how many of them can
display U+FFFD :-( ?)
>Delving into my email archive (fascinating in itself) I find that this bit
>of the syntax was introduced to ensure that you didn't put accents on the
>dots between components. At one point I proposed the words:
> Names are restricted to those that are invariant under Unicode
> normalization NFC; each component must furthermore begin with
> a character with a combining class of 0.
>Then you pointed out that this sort of thing was better done in syntax than
>in semantic wording. I agreed.
>>>> So no component glyph can start with a Mark (I think that is a
>>>> fundamental Unicode property).
>>> Almost. There is a definition (D17a) for a defective sequence that does so.
>> Eh? I don't see that. D17A is right in the middle of the Hangul range.
>No, not U+D17A, but Unicode Definition 17a, page 43 (chapter 3) of the
>Unicode standard (only available in PDF).
Ah! I have not actually got that document.
>>>> Marks are reserved for modifying the
>>>> things preceding them
>>> I think I agree. The relevant text in 5.5 should change to:
>>>
snipped (see below)
>>
>> Right. That is one possibility. Its effect would be to exclude certain
>> currently allowed component glyphs that began with a Mark (i.e.
>> something that affected the following base-character instead of the
>> previous one).
>Right. More precisely, it includes marks as part of the component glyph
>they follow; this affects the interpretation of the length limits further
>down the section.
Yes, I find that a persuasive argument.
>> However, there is a less dangerous solution
>I don't see why it's "less dangerous" or a "solution".
I meant dangerous in the sense that changing syntax on the brink of an
IESG Last Call is always dangerous. However, I now think your proposed
change is both safe and useful.
>The problem is that, in many cases, other markers are part of the single
>glyph. One issue we discussed last year was the enclosing-circle; we
>accepted that there were reasons why a group name might use this, it's a
>Mark, we don't want it at the start of a component, it doesn't alter the
>amount of screen space the character takes up (to a first approximation),
>but its combining class is zero.
>The syntax changes were done while we were still understanding
>normalisation and character construction. I really suggest (re-)reading
>items D13 to D17a on page 43 of the Unicode Standard. It's clear that a
>basic grapheme (a "combining character sequence") is a character not of
>category M (a "base character") followed by zero or more characters of
>category M ("combining characters"). Therefore the correct test is
>category M/not-M, not class 0/not-0.
OK, here is what I have now got in 5.5. Note that I have taken the
opportunity to replace "glyph" by "grapheme", which seems to agree with
how Unicode uses that word.
header =/ Newsgroups-header
Newsgroups-header = "Newsgroups" ":" SP Newsgroups-content
*( ";" other-parameter )
Newsgroups-content = [FWS] newsgroup-name
*( [FWS] ng-delim [FWS] newsgroup-name )
[FWS]
newsgroup-name = component *( "." component )
component = 1*component-grapheme
ng-delim = ","
component-grapheme = combiner-base *combiner-mark
combiner-base = combiner-ASCII / combiner-extended
combiner-ASCII = DIGIT / ALPHA / "+" / "-" / "_"
combiner-extended = <any character with a Unicode code value of
0080 or greater but excluding any character
in Unicode categories Cc, Cf, Cs, M* and Z*>
combiner-mark = <any character with a Unicode code value of
0080 or greater and in Unicode category M*>
NOTE: the excluded characters in a combiner-extended are control
characters (Cc), format control characters (Cf), surrogates
(Cs), Marks (M*) and separators (Z*). In particular, this
excludes all whitespace characters. To all intents and
purposes, a component-grapheme is what a user might regard as a
single "character" as displayed on his screen, though it might
be transmitted as several actual characters (e.g. q-circumflex
is two characters). Note also that, in some writing schemes,
several component-graphemes will merge into one visible object
of variable size.
Note that the Unicode documents use "M*" meaning "all the Ms", so we are
safe there.
-- Charles H. Lindsey ---------At Home, doing my own thing------------------------ Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl Email: chl@clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K. PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5