From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Thu Jul 05 2001 - 11:42:24 CDT
In <7R2snW2UwkQ7EwTI@romana.davros.org> "Clive D. W. Feather" <clive@on-the-train.demon.co.uk> writes:
New text for Newsgroups follows. I have dealt with various issues raised
as follows:
>Can I propose a compromise, which might also address other issues ?
>* All names must be invariant in NFKC.
Yes.
>* We put in a NOTE that there is insufficient experience in this area,
> and implementers should be aware that a future version of this
> document might change it to NFC.
Yes, but combined with the warning to implementors below.
>* We allow agents to normalize names to NFC.
Not explicitly said. They are not _required_ to normalize anything, though
the warning about manually entered names pointing to non-existent groups
is still there. Neither are they forbidden to normalize, but they are
still subject to the requirement to disable it if the standard changes.
>* We *forbid* injection, serving, and relay agents to apply any other
> change to names.
Servers and relayers MUST NOT change anything. Injectors and posters MAY
reject (well, so can servers and relayers, I suppose). Only posting agents
MAY correct (e.g. by lowercasing). I would not be averse to removing even
that.
>* We say that posting agents SHOULD NOT (or Ought Not) apply any other
> change to names.
Hmm! I've lost count of exactly what "other" means here. I should leave
well alone now (they will do what they want anyway :-( ).
>* We add "uses compatibility characters" to the list of things that
> could be warned against but should be disableable if necessary.
>This effectively puts us in an intermediate position: no compatibility
>characters allowed, but implementers strongly warned not to assume that
>as an eternal truth.
Yes. See above.
And finally, I now use the syntactic object "component-glyph". It suggests
a concept that people might feel comfortable with but, being clearly a
technical term, people should realise they need to go back to the syntax
if they are really bothered about its meaning. You can have 30 of them in
a component (unless your hierarchy has decided otherwise).
Oh! And there is STILL one open issue.
0.1. Newsgroups
The Newsgroups header's content specifies the newsgroup(s) in which
the article is intended to appear. It is an inheritable header
(4.2.2.2) which then becomes the default Newsgroups header of any
followup, unless a Followup-To header is present to prescribe
otherwise.
References to "Unicode" or "the latest version of the Unicode
Standard" mean [UNICODE 3.1] or any standard that supersedes it. That
document contains guarantees of strict future upwards compatibility
(e.g. no character will be removed or change classification).
Implementors should be aware that currently unassigned code points
(Unicode category Cn) may become valid characters in future versions
of Unicode. Since the poster of an article might have access to a
newer version of that standard, relaying and serving agents MUST
accept such characters, but posting agents (and indeed all agents)
MUST NOT generate them.
Newsgroups-content = newsgroup-name
*( *FWS ng-delim *FWS newsgroup-name )
*FWS
newsgroup-name = component *( "." component )
component = 1*component-glyph
ng-delim = ","
component-glyph = combiner-base *combiner-mark
combiner-base = combiner-ASCII / combiner-extended
combiner-ASCII = "0"-"9" / %x41-5A / %x61-7A / "+" / "-" / "_"
combiner-extended = <any character with a Unicode code value of
0080 or greater and a combining class of 0,
but excluding any character in Unicode
categories Cc, Cf, Cs, Zs, Zl, and Zp>
combiner-mark = <any character with a Unicode code value of
0080 or greater and a combining class other
than 0>
NOTE: the excluded characters are control characters (Cc),
format control characters (Cf), surrogates (Cs), and separators
(Zs, Zl, Zp). In particular, this excludes all whitespace
characters.
Each component MUST be invariant under Unicode normalization NFKC
(cf. the weaker normalization requirement for other headers in
section 4.4.1 which specified no more than normalization NFC).
NOTE: Alternatively, this restriction could have been expressed
by saying:
o All characters with a compatibility decomposition are
forbidden;
or else
o All characters with property NFKC-NO are forbidden.
The effect is to exclude variant forms of characters, such as
superscripts and subscripts, wide and narrow forms, font
variants, encircled forms, ligatures, and so on, as their use
could cause confusion.
As a result of of this restriction, a name has only one valid
form. Implementations can assume that a straight comparison of
characters or octets is sufficient to compare two newsgroup-
names.
NOTE: An implementation is not required to apply NFKC, or any
other normalization, to newsgroup names. Only agencies that
create new groups need to be careful to obey this restriction
(7.1). However, if a posting agent neglects to normalize a
newsgroup-name entered manually, this may lead to the user
posting to a non-existent group without understanding why.
Newsgroup-names containing non-ASCII characters MUST be encoded in
UTF-8 and not according to [RFC 2047].
Components beginning with underline ("_") are reserved for use by
future versions of this standard and MUST NOT occur in newsgroup
names (whether in Newsgroup headers or in newgroup control messages
(7.1)). However, such names MUST be accepted.
Components beginning with "+" or "-" are reserved for use by
implementations and MUST NOT occur in newsgroup names (whether in
Newsgroup headers or in newgroup control messages). Implementors may
assume that this rule will not change in any future version of this
standard.
NOTE: For example, implementors may safely use leading "+" and
"-" to "escape" other entities within something that looks like
a newsgroup-name.
Agencies responsible for the administration of particular hierarchies
Ought to place additional restrictions on the characters they allow
in newsgroup-names within those hierarchies (such as to accord with
the languages commonly used within those hierarchies, or to avoid
perceived ambiguities pertinent to those languages). Where there is
no such specific policy, the following restrictions SHOULD be applied
to newsgroup names.
NOTE: These restrictions are intended to reflect existing
practice, with some additions to accomodate foreseeable
enhancements, and are intended both to avoid certain technical
difficulties and to avoid unnecessary confusion. It may well be
that experience will allow future extensions to this standard to
relax some or all of these restrictions.
The specific restrictions (to be applied in the absence of
established policies to the contrary) are:
1. The following characters are forbidden, subject to the comments
and notes at the end of the list:
characters in category Cn (Other, Not assigned) [1]
characters in category Co (Other, Private Use) [2]
characters in category Lt (Letter, Titlecase) [3]
characters in category Lu (Letter, Uppercase) [3]
characters in category Me (Mark, Enclosing) [4]
characters in category Pd (Punctuation, Dash) [4][5]
characters in category Pe (Punctuation, Close) [4]
characters in category Pf (Punctuation, Final quote) [4]
characters in category Pi (Punctuation, Initial quote) [4]
characters in category Po (Punctuation, Other) [4]
characters in category Ps (Punctuation, Open) [4]
characters in category Sc (Symbol, Currency) [4]
characters in category Sk (Symbol, Modifier) [4]
characters in category Sm (Symbol, Math) [4][5]
characters in category So (Symbol, Other) [4]
[1] As new characters are added to Unicode, the code point moves
from category Cn to some other category. As stated above,
implementors should be prepared for this.
[2] Specific private use characters can be used within a hierarchy
or co-operating subnet that has agreed meanings for them.
[3] Traditionally, newsgroup-names have been written in lowercase.
Posting agents MAY convert these characters to the
corresponding lowercase forms.
[That may be better left unsaid, or rewritten]
[4] Traditionally newsgroup names have only used letters, digits,
and the three special characters "+", "-" and "_". These
categories correspond to characters outside that set.
[5] Although the characters "+" and "-" are within categories Pd
and Sm, they are not forbidden.
2. A component name is forbidden to consist entirely of digits.
NOTE: This requirement was in [RFC 1036] but nevertheless
several such groups have appeared in practice and implementors
should be prepared for them. A common implementation technique
uses each component as the name of a directory and uses numeric
filenames for each article within a group. Such an
implementation needs to be careful when this could cause a clash
(e.g. between article 123 of group xxx.yyy and the directory for
group xxx.yyy.123).
[Open issue a number of people think this should not be a default
requirement but simply be a NOTE; wording for such is further down.]
3. A component is limited to 30 component-glyphs and a newsgroup-name
to 71 component-glyphs. Whilst there is no longer any technical
reason to limit the length of a component (formerly, it was
limited to 14 octets) nor of a newsgroup-name, it should be noted
that these names are also used in the newsgroups line (7.1.2)
where an overall policy limit applies and, moreover, excessively
long names can be exceedingly inconvenient in practical use.
NOTE: To all intents and purposes, a component-glyph is what a
user might regard as a single "character" as displayed on his
screen, though it might be transmitted as several actual
characters (e.g. q-circumflex is two characters).
Serving and relaying agents MUST accept any newsgroup-name that meets
the above requirements, even if they violate one or more of the
policy restrictions. Posting and injecting agents MAY reject articles
containing newsgroup-names that do not meet these restrictions, and
posting agents MAY attempt to correct them (e.g. by lowercasing).
However, because of the large and changing tables required to do
these checks and corrections throughout the whole of Unicode, this
standard does not require them to do so. Rather, the onus is placed
on those who create new newsgroups (7.1) to check the mandatory
requirements, to consider the effects of relaxing the other
restrictions, and to consider how all this may affect propagation of
the group.
Since future extensions to this standard and the Unicode standard,
plus any relaxations of the default restrictions introduced by
specific hierarchies, might invalidate some such checks, warnings,
and adjustments, implementations MUST incorporate means to disable
them. In particular, implementations must be prepared for a
relaxation of the normalization requirements (e.g. from NFKC down to
NFC), which have been made rather stringent due to a lack of
practical experience in this area.
[Alternative text for Open issue]
NOTE: Components composed entirely of digits were forbidden by
[RFC 1036] but have nevertheless been used in practice, and are
therefore permitted by this specification. A common
implementation technique uses each component as the name of a
directory and uses numeric filenames for each article within a
group. Such an implementation needs to be careful when this
could cause a clash (e.g. between article 123 of group xxx.yyy
and the directory for group xxx.yyy.123).
[Open issue delete the above text if we retain the default
requirement above.]
NOTE: The newsgroup-name as encoded in UTF-8 should be regarded
as the canonical form. Reading agents may convert it to whatever
character set they are able to display (see 4.4.1) and serving
agents may possibly need to convert it to some form more
suitable as a filename. Simple algorithms for both kinds of
conversion are readily available. Observe that the syntax does
not allow comments within the Newsgroups header; this is to
simplify processing by relaying and serving agents which have a
requirement to process this header extremely rapidly.
The inclusion of folding white space within a Newsgroups-content is a
newly introduced feature in this standard. It MUST be accepted by all
conforming implementations (relaying agents, serving agents and
reading agents). Posting agents should be aware that such postings
may be rejected by overly-critical old-style relaying agents. When a
sufficient number of relaying agents are in conformance, posting
agents SHOULD generate such whitespace in the form of <CRLF WS> so as
to keep the length of lines in the relevant headers (notably
Newsgroups and Followup-To) to no more than than 79 characters (or
other agreed policy limit - see 4.5). Before such critical mass
occurs, injecting agents MAY reformat such headers by removing
whitespace inserted by the posting agent, but relaying agents MUST
NOT do so.
Posters SHOULD use only the names of existing newsgroups in the
Newsgroups header. However, it is legitimate to cross-post to
newsgroup(s) which do not exist on the posting agent's host, provided
that at least one of the newsgroups DOES exist there, and followup
agents SHOULD accept this (posting agents MAY accept it, but Ought at
least to alert the poster to the situation and request confirmation).
Relaying agents MUST NOT rewrite Newsgroups headers in any way, even
if some or all of the newsgroups do not exist on the relaying agent's
host. Serving agents MUST NOT create new newsgroups simply because an
unrecognised newsgroup-name occurs in a Newsgroups header (see 7.1
for the correct method of newsgroup creation).
The Newsgroups header is intended for use in Netnews articles rather
than in mail messages. It MAY be used in a mail message to indicate
that it is a copy also posted to the listed newsgroups, but it SHOULD
NOT be used in a mail-only reply to a Netnews article (thus the
"inheritable" property of this header applies only to followups to a
newsgroup, and not to followups to the poster). Moreover, if a
newsgroup-name contains any non-ASCII character, it MAY be encoded
using the mechanism defined in [RFC 2047] when sent by mail but, if
it is subsequently returned to the Netnews environment, it MUST then
be re-encoded into UTF-8.
[RFC 1036] M. Horton and R. Adams, "Standard for Interchange of
USENET Messages", RFC 1036, December 1987.
[RFC 2047] K. Moore, "MIME (Multipurpose Internet Mail Extensions)
Part Three: Message Header Extensions for Non-ASCII Text", RFC
2047, November 1996.
[UNICODE 3.0] The Unicode Consortium, "The Unicode Standard - Version
3.0", Addison-Wesley, 2000.
[UNICODE 3.1] The Unicode Consortium, "The Unicode Standard - Version
3.1, being an amendment to [UNICODE 3.0]", Unicode Standard
Annex #27 <http://www.unicode.org/unicode/reports/tr27>, 2001.
>--
>Clive D.W. Feather | Internet Expert | Work: <clive@demon.net>
>Tel: +44 20 8371 1138 | Demon Internet | Home: <clive@davros.org>
>Fax: +44 20 8371 1037 | Thus plc | Web: <http://www.davros.org>
>Written on my laptop; please observe the Reply-To address
>Good signature made 2001-07-03 22:49 GMT by key:
> 2048 bits, Key ID D3159AE1, Created 1996-07-15
> "Clive D.W. Feather <clive@demon.net>"
> "Clive D.W. Feather <clive@linx.net>"
> "Clive D.W. Feather <clive@davros.org>"
>WARNING: The signing key is not trusted to belong to:
>Clive D.W. Feather <clive@linx.net>
>WARNING: The signing key is not trusted to belong to:
>Clive D.W. Feather <clive@davros.org>
-- Charles H. Lindsey ---------At Home, doing my own thing------------------------ Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl Email: chl@clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K. PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5