From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Fri Jul 05 2002 - 13:40:27 CDT
My proposal seems broadly acceptable, so I have now embarked on the
consequential changes.
1. In 2.4.2
Original:
The syntax for UTF8-xtra-char excludes those redundant sequences of
octets which cannot occur in UTF-8, as defined by [RFC 2279], either
because they would not be the shortest possible encodings of some UCS
character [ISO/IEC 10646], or they would represent one of the
characters D800 through DFFF, disallowed in UCS because of their
surrogate use in the UTF-16 encoding. These sequences MUST NOT be
generated by posting agents. Where they occur inadvertently, they MAY
be passed on untouched by other agents, but they MUST NOT ever be
decoded or otherwise interpreted as meaningful characters.
New version:
The syntax for UTF8-xtra-char excludes those redundant sequences of
octets which cannot occur in UTF-8, as defined by [RFC 2279], either
because they would not be the shortest possible encodings of some UCS
character [ISO/IEC 10646], or they would represent one of the
characters D800 through DFFF, disallowed in UCS because of their
surrogate use in the UTF-16 encoding. These sequences MUST NOT be
generated by posting agents. Where they occur inadvertently, they
SHOULD be passed on untouched by other agents, but attempts to
interpret them as malformed UTF-8 MUST NOT be made. However, if there
is reason to suppose they are representations of some other character
set they MAY, as suggested in section 4.4.1, be interpreted as such.
Note the change from "MAY pass on untouched" to SHOULD "pass on
untouched". Is that correct?
2. In 4.4.2. Character Sets within Article Bodies
Original:
Within article bodies, characters are represented as octets according
to the encoding scheme implied by any Content-Transfer-Encoding- and
Content-Type-headers [RFC 2045]. In the absence of such headers,
reading agents cannot be relied upon to display correctly more than
the US-ASCII characters, though they MUST display at least those.
NOTE: Observe that reading agents are not forbidden to "guess",
or to interpret as UTF-8 regardless, which would be the simplest
course for them to take.
New version of the NOTE:
NOTE: Observe that reading agents are not forbidden to "guess"
when confronted with unannounced non-ASCII characters, and in
particular it would be reasonable at least to test whether they
were in the form of valid UTF-8 (see also the suggestion for
such a test in 4.4.1).
3. 6.9. Posted-And-Mailed
Original:
.... All other headers defined in this standard (excluding
variant headers, but including specifically the Message-ID-header)
MUST be identical in both the posted and mailed versions of the
article, and so MUST the body.
New version:
.... All other headers defined in this standard (excluding
variant headers, but including specifically the Message-ID-header)
MUST be identical in both the posted and mailed versions of the
article, except that headers rendered in UTF-8 in the posted version
MAY be encoded according to [RFC 2047] in the emailed version. The
bodies MUST be identical in both, apart from a possible change of
Content-Transfer-Encoding.
4. 6.21.2.2. Message/rfc822
Original:
NOTE: It is likely, though not guaranteed, that headers
containing UTF8-xtra-chars will pass safely through email
transports supporting 8BITMIME if the "message/rfc822" object is
sent as an attachment (i.e. as a part of a multipart) rather
than as the top-level body of the email message. Moreover, it is
anticipated that future extensions to the Email standards will
permit headers containing UTF8-xtra-chars to be carried without
further ado over conforming transports.
New version:
NOTE: It is likely, though not guaranteed, that headers
containing UTF8-xtra-chars will pass safely through email
transports supporting 8BITMIME if the "message/rfc822" object is
sent as an attachment (i.e. as a part of a multipart) rather
than as the top-level body of the email message.
5. 8.8.1. Duties of an Outgoing Gateway
Original:
Where the format of the news article is incompatible with that of the
target medium, it may be necessary to apply transformations. In
particular, the presence of UTF8-xtra-chars in headers may be a
source of such incompatibility when gatewaying into Email. On the
other hand, some email systems (especially those supporting the
8BITMIME extensions [RFC 2821]) may well transport such material
correctly, and some user agents may even display it.
..................
o Encapsulating the whole article as a message/rfc822 (6.21.2.2) may
make it less likely to be mutilated during transport, especially
where 8BITMIME is supported. Alternatively, encapsulating as an
application/news-transmission (6.21.6.1) will guarantee correct
transmission and is the method of choice where the intent is to
gateway it back into Netnews later on.
o Encoding words containing UTF8-xtra-chars according to [RFC 2047],
where permitted by that standard (i.e. within phrases and
unstructured headers), and preferably using the charset utf-8,
should ensure their correct display upon arrival. Indeed, many
user agents will display this encoding correctly in contexts not
allowed by [RFC 2047].
o In particular, treating a newsgroup-name as an encoded word
according to [RFC 2047] is recommended (see also 5.5). Even if it
is not decoded at the far end, it is preferable to display the
encoded form than to display nothing at all. Note, however, that
such encoded newsgroup-names MUST be restored to their canonical
form before reinjection into any Netnews system.
o Parameters whose values contain UTF8-xtra-chars may use the
encoding defined in [RFC 2231], again preferably using the charset
utf-8.
New version:
Where the format of the news article is incompatible with that of the
target medium, it may be necessary to apply transformations.
..................
o Transporting headers containing non-ASCII characters without first
encoding them is contrary to the current Email standards [RFC
2821] and [RFC 2822]. This applies both to the top-level headers
of the email, and also to headers contained within any embedded
message or multipart Content-Types (and so recursively). However,
it is well known that most mail transport agents will in fact
convey these characters intact, especially for non-top-level
headers in the case of transports which support the 8BITMIME
extension, and it is to be expected that the prevalence of this
ability will increase in the future (and may even be compliant
with future versions of the Email standards). Moreover, many mail
user agents will also display such characters correctly, or at
least adequately. Therefore, some implementors of gateways may
consider it an acceptable risk not to transform these headers in
any way, especially in the case of the lower-level ones.
NOTE: It is not the purpose of this standard either to condemn
or to condone behaviours which may be non-compliant with other
standards. That is a matter for those implementors.
o Where an implementor considers the risk too high for the top-level
headers, encapsulating the whole article as a message/rfc822
(6.21.2.2) may make it less likely to be mutilated during
transport, especially where 8BITMIME is supported. Alternatively,
encapsulating as an application/news-transmission (6.21.6.1) will
guarantee correct transmission in all cases and is the method of
choice where the intent is to gateway it back into Netnews later
on.
o To ensure full compliance with the Email standards it is necessary
to encode words containing UTF8-xtra-chars according to [RFC 2047]
(but only where permitted by that standard, i.e. within phrases
and unstructured headers, although many user agents will display
this encoding correctly in other contexts also). Likewise, within
parameters the proper encoding is that defined in [RFC 2231]. In
both cases, it is preferable to encode using the charset UTF-8,
although it might be wise first to cconfirm that that is indeed the
charset which had been used (see 4.4.1).
o In particular, treating a newsgroup-name ...
[OK, that is waiting for the newsgroup-name algorithm to be finalized.]
6. AND FINALLY ....
Here is the latest version of the revised text within 4.4.1:
[Alternative proposal for the above three paragraphs:]
In the particular case of newsgroup-names (see 5.5) there are more
stringent requirements regarding the normalization and other usages
of Unicode.
Where the use of non-ASCII characters is permitted as above, they MAY
be encoded in UTF-8 and they MAY be encoded using the MIME mechanisms
defined in [RFC 2047] and [RFC 2231], but only in those contexts
explicitly mentioned in those documents (unstructured headers,
phrases and comments in the one, quoted-strings within parameters in
the other).
Encoding by other means is not compliant with this standard.
Nevertheless, encoding using other character sets (with no indication
of which one beyond the user's ability to guess based upon other
clues in the article, or custom within the newsgroup) has been in use
in some hierarchies, and such usage may be expected to continue for
some period after the introduction of this standard. Reading agents
MUST support the use of UTF-8, [RFC 2047] and [RFC 2231] in headers
and they MAY, when it is detected that none of these has been used,
attempt to interpet the header according to whatever other character
set can be deduced, or has been configued as a default by the reader.
NOTE: It is possible to determine, with a high degree of
accuracy, when a given text containing octets with the 8th bit
set was not encoded using UTF-8, and using this test to recover
such non-compliant texts is therefore commended where no other
harm could arise.
Exceptionally, Newsgroups-headers (5.5) MUST use UTF-8 in order to
ensure that they appear in their canonical form (in any case, a
Newsgroups-header is not one of the acceptable contexts of [RFC
2047]). Certain exceptions to this rule are provided (8.7 and 8.8.1)
for use when mailing to moderators and other gatewaying applications.
NOTE: The choice between UTF-8 and [RFC 2047] when posting
depends on various factors. Some reading agents do not recogize
[RFC 2047], and some are incapable of decoding UTF-8 (though
there in an increasing tendency for modern reading agents to
understand, or to be configurable to understand, both). Since
headers encoded in UTF-8 are currently prohibited in Email,
special consideration needs to be given to articles that are
both posted and mailed (6.9) or which are mailed to moderators
(see 8.2.2). Posters and implementors of posting agents need to
take account of all these factors when deciding which method to
use.
[End of alternative text.]
Observe that I have removed from that last NOTE the sentence which read:
It is the intention that, ultimately, UTF-8 will become the method of
choice, and future versions of this standard are likely to indicate
that it SHOULD be used.
That leaves the matter of the phrase "Encoding by other means is not
compliant with this standard." Some people have asked for an explicit "MUST
NOT generate" instead (perhaps they feel we have been liberal enough by
suggesting that usage MAY be accepted). I am half-inclined to agree,
though I would like to head further opinions before actually doing it.
-- Charles H. Lindsey ---------At Home, doing my own thing------------------------ Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl Email: chl@clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K. PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5