From: Clive D.W. Feather (clive@demon.net)
Date: Fri May 31 2002 - 05:46:41 CDT
Charles Lindsey said:
>>> NOTE: UTF-8 is an encoding of the 16bit UCS-2 (and even the 32bit
>>> UCS-4) character sets ...
>> I really think that's too much detail for our document. We don't care about
>> UCS-2 or UCS-4 (which aren't Unicode terms anyway).
> According to RFC2279bis, the terms UCS-2 and UCS-4 are certainly defined
> in ISO 10646, if not in Unicode.
Yes, but we don't really talk about Unicode v ISO (and nor should we).
> Anyway, I now have:
>
> NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
> (in both its 16 and 32 bit forms) with the property that any
> octet less than 128 immediately represents the corresponding
> US-ASCII character, thus ensuring upwards compatibility with
> previous practice. ...
I can live with that, I suppose.
Um, ISO or Unicode in the brackets ?
>> No, not U+D17A, but Unicode Definition 17a, page 43 (chapter 3) of the
>> Unicode standard (only available in PDF).
> Ah! I have not actually got that document.
You can get the PDFs from <http://www.unicode.org/unicode/uni2book/u2.html>.
> Yes, I find that a persuasive argument.
Phew :-)
> I meant dangerous in the sense that changing syntax on the brink of an
> IESG Last Call is always dangerous.
Oh.
> OK, here is what I have now got in 5.5. Note that I have taken the
> opportunity to replace "glyph" by "grapheme", which seems to agree with
> how Unicode uses that word.
>
> header =/ Newsgroups-header
> Newsgroups-header = "Newsgroups" ":" SP Newsgroups-content
> *( ";" other-parameter )
> Newsgroups-content = [FWS] newsgroup-name
> *( [FWS] ng-delim [FWS] newsgroup-name )
> [FWS]
> newsgroup-name = component *( "." component )
> component = 1*component-grapheme
> ng-delim = ","
> component-grapheme = combiner-base *combiner-mark
> combiner-base = combiner-ASCII / combiner-extended
> combiner-ASCII = DIGIT / ALPHA / "+" / "-" / "_"
> combiner-extended = <any character with a Unicode code value of
> 0080 or greater but excluding any character
> in Unicode categories Cc, Cf, Cs, M* and Z*>
>
> combiner-mark = <any character with a Unicode code value of
> 0080 or greater and in Unicode category M*>
>
> NOTE: the excluded characters in a combiner-extended are control
> characters (Cc), format control characters (Cf), surrogates
> (Cs), Marks (M*) and separators (Z*). In particular, this
> excludes all whitespace characters. To all intents and
> purposes, a component-grapheme is what a user might regard as a
> single "character" as displayed on his screen, though it might
> be transmitted as several actual characters (e.g. q-circumflex
> is two characters). Note also that, in some writing schemes,
> several component-graphemes will merge into one visible object
> of variable size.
Apart from the blank line before combiner-mark, that's fine.
-- Clive D.W. Feather | Work: <clive@demon.net> | Tel: +44 20 8371 1138 Internet Expert | Home: <clive@davros.org> | Fax: +44 870 051 9937 Demon Internet | WWW: http://www.davros.org | Mobile: +44 7973 377646 Thus plc | | NOTE: fax number change