Re: Newsgroup names and Unicode, attempt 3

New Message Reply About this list Date view Thread view Subject view Author view

From: Clive D.W. Feather (clive@demon.net)
Date: Mon Jul 09 2001 - 05:08:03 CDT


Charles Lindsey said:
>> * We put in a NOTE that there is insufficient experience in this area,
>> and implementers should be aware that a future version of this
>> document might change it to NFC.
> Yes, but combined with the warning to implementors below.

Fine.

>> * We *forbid* injection, serving, and relay agents to apply any other
>> change to names.
> Servers and relayers MUST NOT change anything. Injectors and posters MAY
> reject (well, so can servers and relayers, I suppose). Only posting agents
> MAY correct (e.g. by lowercasing). I would not be averse to removing even
> that.
>> * We say that posting agents SHOULD NOT (or Ought Not) apply any other
>> change to names.
> Hmm! I've lost count of exactly what "other" means here. I should leave
> well alone now (they will do what they want anyway :-( ).

I'm saying that posting agents SHOULD either do nothing, or apply NF[K]C
only, but no other change.

> And finally, I now use the syntactic object "component-glyph". It suggests
> a concept that people might feel comfortable with but, being clearly a
> technical term, people should realise they need to go back to the syntax
> if they are really bothered about its meaning. You can have 30 of them in
> a component (unless your hierarchy has decided otherwise).

Okay.

> 0.1. Newsgroups
>
> The Newsgroups header's content specifies the newsgroup(s) in which
> the article is intended to appear. It is an inheritable header
> (4.2.2.2) which then becomes the default Newsgroups header of any
> followup, unless a Followup-To header is present to prescribe
> otherwise.
>
> References to "Unicode" or "the latest version of the Unicode
> Standard" mean [UNICODE 3.1] or any standard that supersedes it. That
> document contains guarantees of strict future upwards compatibility
> (e.g. no character will be removed or change classification).
> Implementors should be aware that currently unassigned code points
> (Unicode category Cn) may become valid characters in future versions
> of Unicode. Since the poster of an article might have access to a
> newer version of that standard, relaying and serving agents MUST
> accept such characters, but posting agents (and indeed all agents)
> MUST NOT generate them.

Um, this needs adjusting somehow. If I reply to an article using a Cn
character in the name (perhaps in a group name that the local server
doesn't carry, but I got this through a cross-post), my posting agent
should assume that it's correct. That "generate" needs qualifying somehow.

> component-glyph = combiner-base *combiner-mark
> combiner-base = combiner-ASCII / combiner-extended
> combiner-ASCII = "0"-"9" / %x41-5A / %x61-7A / "+" / "-" / "_"
> combiner-extended = <any character with a Unicode code value of
> 0080 or greater and a combining class of 0,
> but excluding any character in Unicode
> categories Cc, Cf, Cs, Zs, Zl, and Zp>
> combiner-mark = <any character with a Unicode code value of
> 0080 or greater and a combining class other
> than 0>
>
> NOTE: the excluded characters are control characters (Cc),
> format control characters (Cf), surrogates (Cs), and separators
> (Zs, Zl, Zp). In particular, this excludes all whitespace
> characters.
>
> Each component MUST be invariant under Unicode normalization NFKC
> (cf. the weaker normalization requirement for other headers in
> section 4.4.1 which specified no more than normalization NFC).

Following other comments, it might be worth adding a NOTE here that
explains things for Unicode-virgins. Something like:

    NOTE: informally, each component is a sequence of component-glyphs.
    Each component-glyph consists of a base character - which can be a
    letter, number, punctuation mark, or symbol - plus zero or more
    combining marks (such as accents, diacritics, and similar). The
    requirement that the component is invariant under normalization
    means that where the same component-glyph could be written in more
    than one way, only one particular one is allowed.

> NOTE: Alternatively, this restriction could have been expressed
[...]

No, this is specific to NFKC. You've also omitted the "may change".
Let me try:

          NOTE: The requirement that names be invariant under NFKC, rather
          than NFC, means that all characters with a "compatibility
          decomposition" are forbidden (Unicode provides the property
          NFKC_NO to make this test easier). The effect is to exclude
          variant forms of characters, such as superscripts and subscripts,
          wide and narrow forms, font variants, encircled forms, ligatures,
          and so on, as their use could cause confusion.

          There is insufficient experience in this area to determine
          whether this is the right long-term solution. Implementers
          should therefore be aware that a future version of this document
          might change the requirement to "Each component MUST be
          invariant under Unicode normalization NFC".

> As a result of of this restriction, a name has only one valid
> form. Implementations can assume that a straight comparison of
> characters or octets is sufficient to compare two newsgroup-
> names.

Move this paragraph up to under my "Unicode-virgins" paragraph.

> NOTE: An implementation is not required to apply NFKC, or any
> other normalization, to newsgroup names. Only agencies that
> create new groups need to be careful to obey this restriction
> (7.1). However, if a posting agent neglects to normalize a
> newsgroup-name entered manually, this may lead to the user
> posting to a non-existent group without understanding why.

This can stay here.

> [3] Traditionally, newsgroup-names have been written in lowercase.
> Posting agents MAY convert these characters to the
> corresponding lowercase forms.
> [That may be better left unsaid, or rewritten]

I think it should be reversed:

     Posting agents Ought Not convert these characters to the corresponding
     lowercase forms except under the explicit instructions of the user.

See also the proposed wording below.

> NOTE: To all intents and purposes, a component-glyph is what a
> user might regard as a single "character" as displayed on his
> screen, though it might be transmitted as several actual
> characters (e.g. q-circumflex is two characters).

This might be better under the "virgins" paragraph as well. You might also
add:

    Those used to European alphabets should note that in some other
    writing schemes several component-glyphs will merge into one visible
    object of variable size.

> Since future extensions to this standard and the Unicode standard,
> plus any relaxations of the default restrictions introduced by
> specific hierarchies, might invalidate some such checks, warnings,
> and adjustments, implementations MUST incorporate means to disable
> them.

I would like to add:

    Furthermore, implementations Ought Not to adjust any name - other than
    by applying NFC normalization - without the specific agreement of the
    user.

> In particular, implementations must be prepared for a
> relaxation of the normalization requirements (e.g. from NFKC down to
> NFC), which have been made rather stringent due to a lack of
> practical experience in this area.

I think this is too far away, which is why I made the suggested text
further up. This text can remain even with my suggested text. However,
change it to:

   In particular, implementations must be prepared for a
   relaxation of the normalization requirement from NFKC down to
   NFC; the existing requirement was made rather stringent due to a
   lack of practical experience in this area.

-- 
Clive D.W. Feather  | Work:  <clive@demon.net>   | Tel:  +44 20 8371 1138
Internet Expert     | Home:  <clive@davros.org>  | Fax:  +44 20 8371 1037
Demon Internet      | WWW: http://www.davros.org | DFax: +44 20 8371 4037
Thus plc            |                            | Mobile: +44 7973 377646


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.