Re: Unicode and draft 07

New Message Reply About this list Date view Thread view Subject view Author view

From: Clive D.W. Feather (clive@demon.net)
Date: Fri May 31 2002 - 05:46:41 CDT


Charles Lindsey said:
>>> NOTE: UTF-8 is an encoding of the 16bit UCS-2 (and even the 32bit
>>> UCS-4) character sets ...
>> I really think that's too much detail for our document. We don't care about
>> UCS-2 or UCS-4 (which aren't Unicode terms anyway).

> According to RFC2279bis, the terms UCS-2 and UCS-4 are certainly defined
> in ISO 10646, if not in Unicode.

Yes, but we don't really talk about Unicode v ISO (and nor should we).

> Anyway, I now have:
>
> NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
> (in both its 16 and 32 bit forms) with the property that any
> octet less than 128 immediately represents the corresponding
> US-ASCII character, thus ensuring upwards compatibility with
> previous practice. ...

I can live with that, I suppose.

Um, ISO or Unicode in the brackets ?

>> No, not U+D17A, but Unicode Definition 17a, page 43 (chapter 3) of the
>> Unicode standard (only available in PDF).
> Ah! I have not actually got that document.

You can get the PDFs from <http://www.unicode.org/unicode/uni2book/u2.html>.

> Yes, I find that a persuasive argument.

Phew :-)

> I meant dangerous in the sense that changing syntax on the brink of an
> IESG Last Call is always dangerous.

Oh.

> OK, here is what I have now got in 5.5. Note that I have taken the
> opportunity to replace "glyph" by "grapheme", which seems to agree with
> how Unicode uses that word.
>
> header =/ Newsgroups-header
> Newsgroups-header = "Newsgroups" ":" SP Newsgroups-content
> *( ";" other-parameter )
> Newsgroups-content = [FWS] newsgroup-name
> *( [FWS] ng-delim [FWS] newsgroup-name )
> [FWS]
> newsgroup-name = component *( "." component )
> component = 1*component-grapheme
> ng-delim = ","
> component-grapheme = combiner-base *combiner-mark
> combiner-base = combiner-ASCII / combiner-extended
> combiner-ASCII = DIGIT / ALPHA / "+" / "-" / "_"
> combiner-extended = <any character with a Unicode code value of
> 0080 or greater but excluding any character
> in Unicode categories Cc, Cf, Cs, M* and Z*>
>
> combiner-mark = <any character with a Unicode code value of
> 0080 or greater and in Unicode category M*>
>
> NOTE: the excluded characters in a combiner-extended are control
> characters (Cc), format control characters (Cf), surrogates
> (Cs), Marks (M*) and separators (Z*). In particular, this
> excludes all whitespace characters. To all intents and
> purposes, a component-grapheme is what a user might regard as a
> single "character" as displayed on his screen, though it might
> be transmitted as several actual characters (e.g. q-circumflex
> is two characters). Note also that, in some writing schemes,
> several component-graphemes will merge into one visible object
> of variable size.

Apart from the blank line before combiner-mark, that's fine.

-- 
Clive D.W. Feather  | Work:  <clive@demon.net>   | Tel:  +44 20 8371 1138
Internet Expert     | Home:  <clive@davros.org>  | Fax:  +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
Thus plc            |                            | NOTE: fax number change


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.