Re: Newsgroup names and Unicode, attempt 3

New Message Reply About this list Date view Thread view Subject view Author view

From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Tue Jul 10 2001 - 08:26:14 CDT


In <20010709110803.M16728@demon.net> "Clive D.W. Feather" <clive@demon.net> writes:

>Charles Lindsey said:

>> 0.1. Newsgroups
>>
>> ... Since the poster of an article might have access to a
>> newer version of that standard, relaying and serving agents MUST
>> accept such characters, but posting agents (and indeed all agents)
>> MUST NOT generate them.

>Um, this needs adjusting somehow. If I reply to an article using a Cn
>character in the name (perhaps in a group name that the local server
>doesn't carry, but I got this through a cross-post), my posting agent
>should assume that it's correct. That "generate" needs qualifying somehow.

I have added "(though they might well follow up to newsgroup-names
containing them)"

>>
>> Each component MUST be invariant under Unicode normalization NFKC
>> (cf. the weaker normalization requirement for other headers in
>> section 4.4.1 which specified no more than normalization NFC).

>Following other comments, it might be worth adding a NOTE here that
>explains things for Unicode-virgins. Something like:

> NOTE: informally, each component is a sequence of component-glyphs.
> Each component-glyph consists of a base character - which can be a
> letter, number, punctuation mark, or symbol - plus zero or more
> combining marks (such as accents, diacritics, and similar). The
> requirement that the component is invariant under normalization
> means that where the same component-glyph could be written in more
> than one way, only one particular one is allowed.

No, we explain this in section 4.4.1, which is first place that NFC is
mentioned. I have rewritten the NOTE there to say:

        NOTE: Unicode allows for composite characters made up of a starter
        character - which can be a letter, number, punctuation mark, or
        symbol - plus zero or more combining marks (such as accents,
        diacritics, and similar). The requirement that a composite be
        invariant under normalization NFC means that, where it could be
        written in more than one way, only one particular one is allowed
        (for example, the single character E-acute is preferred over E
        followed by a non-spacing acute accent, and A-ring is preferred
        over the Angstrom symbol). At least for the main European
        languages, for which all the needed composites are already
        available as single characters, it is unlikely that posting agents
        will need to take any special steps to ensure normalization.

And I have inserted a poiner from here to that NOTE.

>> NOTE: Alternatively, this restriction could have been expressed
>[...]

>No, this is specific to NFKC. You've also omitted the "may change".
>Let me try:

> NOTE: The requirement that names be invariant under NFKC, rather
> than NFC, means that all characters with a "compatibility
> decomposition" are forbidden (Unicode provides the property
> NFKC_NO to make this test easier). The effect is to exclude
> variant forms of characters, such as superscripts and subscripts,
> wide and narrow forms, font variants, encircled forms, ligatures,
> and so on, as their use could cause confusion.

OK so far

> There is insufficient experience in this area to determine
> whether this is the right long-term solution. Implementers
> should therefore be aware that a future version of this document
> might change the requirement to "Each component MUST be
> invariant under Unicode normalization NFC".

It now says:
        There is insufficient experience in this area to determine
        whether this is the right long-term solution. Implementers should
        therefore be aware that a future version of this standard might
        reduce the requirement in the direction of NFC as opposed to NFKC.

>> As a result of of this restriction, a name has only one valid
>> form. Implementations can assume that a straight comparison of
>> characters or octets is sufficient to compare two newsgroup-
>> names.

>Move this paragraph up to under my "Unicode-virgins" paragraph.

Yes, or rather to where the "Unicode-virgins" would have been.

>> [3] Traditionally, newsgroup-names have been written in lowercase.
>> Posting agents MAY convert these characters to the
>> corresponding lowercase forms.
>> [That may be better left unsaid, or rewritten]

>I think it should be reversed:

> Posting agents Ought Not convert these characters to the corresponding
> lowercase forms except under the explicit instructions of the user.

Yes, that is better. But should it be "SHOULD NOT" now?

>See also the proposed wording below.

>> NOTE: To all intents and purposes, a component-glyph is what a
>> user might regard as a single "character" as displayed on his
>> screen, though it might be transmitted as several actual
>> characters (e.g. q-circumflex is two characters).

>This might be better under the "virgins" paragraph as well.

I have added it to the existing note following the syntax.

> You might also add:

> Those used to European alphabets should note that in some other
> writing schemes several component-glyphs will merge into one visible
> object of variable size.

Or somesuch.

>> Since future extensions to this standard and the Unicode standard,
>> plus any relaxations of the default restrictions introduced by
>> specific hierarchies, might invalidate some such checks, warnings,
>> and adjustments, implementations MUST incorporate means to disable
>> them.

>I would like to add:

> Furthermore, implementations Ought Not to adjust any name - other than
> by applying NFC normalization - without the specific agreement of the
> user.

What I have now is:

        "... posting agents MAY attempt to correct them (but only with the
        explicit agreement of the poster for anything more than NK(K)C
        normalization)."
(other agents are not allowed to correct at all).

>> In particular, implementations must be prepared for a
>> relaxation of the normalization requirements (e.g. from NFKC down to
>> NFC), which have been made rather stringent due to a lack of
>> practical experience in this area.

>I think this is too far away, which is why I made the suggested text
>further up. This text can remain even with my suggested text. However,
>change it to:

> In particular, implementations must be prepared for a
> relaxation of the normalization requirement from NFKC down to
> NFC; the existing requirement was made rather stringent due to a
> lack of practical experience in this area.

No point in repeating what has already been said above. I have in fact now
shortened that paragraph to:

   "Since future extensions to this standard and the Unicode standard,
   including a possible relaxation of the NFKC normalization, plus any
   relaxations of the default restrictions introduced by specific
   hierarchies, might invalidate some such checks, warnings, and adjustments,
   implementations MUST incorporate means to disable them."

OK, after all that, here is (most of) the whole section again:

0.1. Newsgroups
 
   The Newsgroups header's content specifies the newsgroup(s) in which
   the article is intended to appear. It is an inheritable header
   (4.2.2.2) which then becomes the default Newsgroups header of any
   followup, unless a Followup-To header is present to prescribe
   otherwise.

   References to "Unicode" or "the latest version of the Unicode
   Standard" mean [UNICODE 3.1] or any standard that supersedes it. That
   document contains guarantees of strict future upwards compatibility
   (e.g. no character will be removed or change classification).
   Implementors should be aware that currently unassigned code points
   (Unicode category Cn) may become valid characters in future versions
   of Unicode. Since the poster of an article might have access to a
   newer version of that standard, relaying and serving agents MUST
   accept such characters, but posting agents (and indeed all agents)
   MUST NOT generate them (though they might well follow up to
   newsgroup-names containing them).

      Newsgroups-content = newsgroup-name
                               *( *FWS ng-delim *FWS newsgroup-name )
                               *FWS
      newsgroup-name = component *( "." component )
      component = 1*component-glyph
      ng-delim = ","
      component-glyph = combiner-base *combiner-mark
      combiner-base = combiner-ASCII / combiner-extended
      combiner-ASCII = "0"-"9" / %x41-5A / %x61-7A / "+" / "-" / "_"
      combiner-extended = <any character with a Unicode code value of
                             0080 or greater and a combining class of 0,
                             but excluding any character in Unicode
                             categories Cc, Cf, Cs, Zs, Zl, and Zp>
      combiner-mark = <any character with a Unicode code value of
                             0080 or greater and a combining class other
                             than 0>

        NOTE: the excluded characters are control characters (Cc),
        format control characters (Cf), surrogates (Cs), and separators
        (Zs, Zl, Zp). In particular, this excludes all whitespace
        characters. To all intents and purposes, a component-glyph is
        what a user might regard as a single "character" as displayed on
        his screen, though it might be transmitted as several actual
        characters (e.g. q-circumflex is two characters). Note also
        that, in some writing schemes, several component-glyphs will
        merge into one visible object of variable size.

   Each component MUST be invariant under Unicode normalization NFKC
   (cf. the weaker normalization requirement for other headers in
   section 4.4.1 which specified no more than normalization NFC, and see
   also the explanatory NOTE in that section).

        NOTE: As a result of of this restriction, a name has only one
        valid form. Implementations can assume that a straight
        comparison of characters or octets is sufficient to compare two
        newsgroup-names.

        The requirement that names be invariant under NFKC, rather than
        NFC, means that all characters with a "compatibility
        decomposition" are forbidden (Unicode provides the property
        "NFKC_NO" to make this test easier). The effect is to exclude
        variant forms of characters, such as superscripts and
        subscripts, wide and narrow forms, font variants, encircled
        forms, ligatures, and so on, as their use could cause confusion.

        There is insufficient experience in this area to determine
        whether this is the right long-term solution. Implementers
        should therefore be aware that a future version of this standard
        might reduce the requirement in the direction of NFC as opposed
        to NFKC.

        NOTE: An implementation is not required to apply NFKC, or any
        other normalization, to newsgroup names. Only agencies that
        create new groups need to be careful to obey this restriction
        (7.1). However, if a posting agent neglects to normalize a
        newsgroup-name entered manually, this may lead to the user
        posting to a non-existent group without understanding why.

   Newsgroup-names containing non-ASCII characters MUST be encoded in
   UTF-8 and not according to [RFC 2047].

   Components beginning with underline ("_") are reserved for use by
   future versions of this standard and MUST NOT occur in newsgroup
   names (whether in Newsgroup headers or in newgroup control messages
   (7.1)). However, such names MUST be accepted.

   Components beginning with "+" or "-" are reserved for use by
   implementations and MUST NOT occur in newsgroup names (whether in
   Newsgroup headers or in newgroup control messages). Implementors may
   assume that this rule will not change in any future version of this
   standard.

        NOTE: For example, implementors may safely use leading "+" and
        "-" to "escape" other entities within something that looks like
        a newsgroup-name.

   Agencies responsible for the administration of particular hierarchies
   Ought to place additional restrictions on the characters they allow
   in newsgroup-names within those hierarchies (such as to accord with
   the languages commonly used within those hierarchies, or to avoid
   perceived ambiguities pertinent to those languages). Where there is
   no such specific policy, the following restrictions SHOULD be applied
   to newsgroup names.

        NOTE: These restrictions are intended to reflect existing
        practice, with some additions to accomodate foreseeable
        enhancements, and are intended both to avoid certain technical
        difficulties and to avoid unnecessary confusion. It may well be
        that experience will allow future extensions to this standard to
        relax some or all of these restrictions.

   The specific restrictions (to be applied in the absence of
   established policies to the contrary) are:

   1. The following characters are forbidden, subject to the comments
      and notes at the end of the list:
 
      characters in category Cn (Other, Not assigned) [1]
      characters in category Co (Other, Private Use) [2]
      characters in category Lt (Letter, Titlecase) [3]
      characters in category Lu (Letter, Uppercase) [3]
      characters in category Me (Mark, Enclosing) [4]
      characters in category Pd (Punctuation, Dash) [4][5]
      characters in category Pe (Punctuation, Close) [4]
      characters in category Pf (Punctuation, Final quote) [4]
      characters in category Pi (Punctuation, Initial quote) [4]
      characters in category Po (Punctuation, Other) [4]
      characters in category Ps (Punctuation, Open) [4]
      characters in category Sc (Symbol, Currency) [4]
      characters in category Sk (Symbol, Modifier) [4]
      characters in category Sm (Symbol, Math) [4][5]
      characters in category So (Symbol, Other) [4]

      [1] As new characters are added to Unicode, the code point moves
          from category Cn to some other category. As stated above,
          implementors should be prepared for this.

      [2] Specific private use characters can be used within a hierarchy
          or co-operating subnet that has agreed meanings for them.

      [3] Traditionally, newsgroup-names have been written in lowercase.
          Posting agents Ought Not to convert uppercase or titlecase
          characters to the corresponding lowercase forms except under
          the explicit instructions of the poster.

      [4] Traditionally newsgroup names have only used letters, digits,
          and the three special characters "+", "-" and "_". These
          categories correspond to characters outside that set.

      [5] Although the characters "+" and "-" are within categories Pd
          and Sm, they are not forbidden.

   2. A component name is forbidden to consist entirely of digits.

        NOTE: This requirement was in [RFC 1036] but nevertheless
        several such groups have appeared in practice and implementors
        should be prepared for them. A common implementation technique
        uses each component as the name of a directory and uses numeric
        filenames for each article within a group. Such an
        implementation needs to be careful when this could cause a clash
        (e.g. between article 123 of group xxx.yyy and the directory for
        group xxx.yyy.123).
[Open issue a number of people think this should not be a default
requirement but simply be a NOTE; wording for such is further down.]

   3. A component is limited to 30 component-glyphs and a newsgroup-name
      to 71 component-glyphs. Whilst there is no longer any technical
      reason to limit the length of a component (formerly, it was
      limited to 14 octets) nor of a newsgroup-name, it should be noted
      that these names are also used in the newsgroups line (7.1.2)
      where an overall policy limit applies and, moreover, excessively
      long names can be exceedingly inconvenient in practical use.
 
   Serving and relaying agents MUST accept any newsgroup-name that meets
   the above requirements, even if they violate one or more of the
   policy restrictions. Posting and injecting agents MAY reject articles
   containing newsgroup-names that do not meet these restrictions, and
   posting agents MAY attempt to correct them (but only with the
   explicit agreement of the poster for anything more than NKC or NKKC
   normalization). However, because of the large and changing tables
   required to do these checks and corrections throughout the whole of
   Unicode, this standard does not require them to do so. Rather, the
   onus is placed on those who create new newsgroups (7.1) to check the
   mandatory requirements, to consider the effects of relaxing the other
   restrictions, and to consider how all this may affect propagation of
   the group.

   Since future extensions to this standard and the Unicode standard,
   including a possible relaxation of the NFKC normalization, plus any
   relaxations of the default restrictions introduced by specific
   hierarchies might invalidate some such checks, warnings, and
   adjustments, implementations MUST incorporate means to disable them.

[Alternative text for Open issue]

        NOTE: Components composed entirely of digits were forbidden by
        [RFC 1036] but have nevertheless been used in practice, and are
        therefore permitted by this specification. A common
        implementation technique uses each component as the name of a
        directory and uses numeric filenames for each article within a
        group. Such an implementation needs to be careful when this
        could cause a clash (e.g. between article 123 of group xxx.yyy
        and the directory for group xxx.yyy.123).
[Open issue: delete the above text if we retain the default requirement
above.]

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl@clw.cs.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.