Re: UTF-8 and RFC 2047

New Message Reply About this list Date view Thread view Subject view Author view

From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Fri Jul 05 2002 - 13:40:27 CDT


My proposal seems broadly acceptable, so I have now embarked on the
consequential changes.

1. In 2.4.2

Original:
   The syntax for UTF8-xtra-char excludes those redundant sequences of
   octets which cannot occur in UTF-8, as defined by [RFC 2279], either
   because they would not be the shortest possible encodings of some UCS
   character [ISO/IEC 10646], or they would represent one of the
   characters D800 through DFFF, disallowed in UCS because of their
   surrogate use in the UTF-16 encoding. These sequences MUST NOT be
   generated by posting agents. Where they occur inadvertently, they MAY
   be passed on untouched by other agents, but they MUST NOT ever be
   decoded or otherwise interpreted as meaningful characters.

New version:
   The syntax for UTF8-xtra-char excludes those redundant sequences of
   octets which cannot occur in UTF-8, as defined by [RFC 2279], either
   because they would not be the shortest possible encodings of some UCS
   character [ISO/IEC 10646], or they would represent one of the
   characters D800 through DFFF, disallowed in UCS because of their
   surrogate use in the UTF-16 encoding. These sequences MUST NOT be
   generated by posting agents. Where they occur inadvertently, they
   SHOULD be passed on untouched by other agents, but attempts to
   interpret them as malformed UTF-8 MUST NOT be made. However, if there
   is reason to suppose they are representations of some other character
   set they MAY, as suggested in section 4.4.1, be interpreted as such.

Note the change from "MAY pass on untouched" to SHOULD "pass on
untouched". Is that correct?

2. In 4.4.2. Character Sets within Article Bodies

Original:
   Within article bodies, characters are represented as octets according
   to the encoding scheme implied by any Content-Transfer-Encoding- and
   Content-Type-headers [RFC 2045]. In the absence of such headers,
   reading agents cannot be relied upon to display correctly more than
   the US-ASCII characters, though they MUST display at least those.

        NOTE: Observe that reading agents are not forbidden to "guess",
        or to interpret as UTF-8 regardless, which would be the simplest
        course for them to take.

New version of the NOTE:
        NOTE: Observe that reading agents are not forbidden to "guess"
        when confronted with unannounced non-ASCII characters, and in
        particular it would be reasonable at least to test whether they
        were in the form of valid UTF-8 (see also the suggestion for
        such a test in 4.4.1).

3. 6.9. Posted-And-Mailed

Original:
   .... All other headers defined in this standard (excluding
   variant headers, but including specifically the Message-ID-header)
   MUST be identical in both the posted and mailed versions of the
   article, and so MUST the body.

New version:
   .... All other headers defined in this standard (excluding
   variant headers, but including specifically the Message-ID-header)
   MUST be identical in both the posted and mailed versions of the
   article, except that headers rendered in UTF-8 in the posted version
   MAY be encoded according to [RFC 2047] in the emailed version. The
   bodies MUST be identical in both, apart from a possible change of
   Content-Transfer-Encoding.

4. 6.21.2.2. Message/rfc822

Original:
        NOTE: It is likely, though not guaranteed, that headers
        containing UTF8-xtra-chars will pass safely through email
        transports supporting 8BITMIME if the "message/rfc822" object is
        sent as an attachment (i.e. as a part of a multipart) rather
        than as the top-level body of the email message. Moreover, it is
        anticipated that future extensions to the Email standards will
        permit headers containing UTF8-xtra-chars to be carried without
        further ado over conforming transports.

New version:
        NOTE: It is likely, though not guaranteed, that headers
        containing UTF8-xtra-chars will pass safely through email
        transports supporting 8BITMIME if the "message/rfc822" object is
        sent as an attachment (i.e. as a part of a multipart) rather
        than as the top-level body of the email message.

5. 8.8.1. Duties of an Outgoing Gateway

Original:
   Where the format of the news article is incompatible with that of the
   target medium, it may be necessary to apply transformations. In
   particular, the presence of UTF8-xtra-chars in headers may be a
   source of such incompatibility when gatewaying into Email. On the
   other hand, some email systems (especially those supporting the
   8BITMIME extensions [RFC 2821]) may well transport such material
   correctly, and some user agents may even display it.

   ..................

    o Encapsulating the whole article as a message/rfc822 (6.21.2.2) may
      make it less likely to be mutilated during transport, especially
      where 8BITMIME is supported. Alternatively, encapsulating as an
      application/news-transmission (6.21.6.1) will guarantee correct
      transmission and is the method of choice where the intent is to
      gateway it back into Netnews later on.
    o Encoding words containing UTF8-xtra-chars according to [RFC 2047],
      where permitted by that standard (i.e. within phrases and
      unstructured headers), and preferably using the charset utf-8,
      should ensure their correct display upon arrival. Indeed, many
      user agents will display this encoding correctly in contexts not
      allowed by [RFC 2047].
    o In particular, treating a newsgroup-name as an encoded word
      according to [RFC 2047] is recommended (see also 5.5). Even if it
      is not decoded at the far end, it is preferable to display the
      encoded form than to display nothing at all. Note, however, that
      such encoded newsgroup-names MUST be restored to their canonical
      form before reinjection into any Netnews system.
    o Parameters whose values contain UTF8-xtra-chars may use the
      encoding defined in [RFC 2231], again preferably using the charset
      utf-8.

New version:
   Where the format of the news article is incompatible with that of the
   target medium, it may be necessary to apply transformations.

   ..................

    o Transporting headers containing non-ASCII characters without first
      encoding them is contrary to the current Email standards [RFC
      2821] and [RFC 2822]. This applies both to the top-level headers
      of the email, and also to headers contained within any embedded
      message or multipart Content-Types (and so recursively). However,
      it is well known that most mail transport agents will in fact
      convey these characters intact, especially for non-top-level
      headers in the case of transports which support the 8BITMIME
      extension, and it is to be expected that the prevalence of this
      ability will increase in the future (and may even be compliant
      with future versions of the Email standards). Moreover, many mail
      user agents will also display such characters correctly, or at
      least adequately. Therefore, some implementors of gateways may
      consider it an acceptable risk not to transform these headers in
      any way, especially in the case of the lower-level ones.

        NOTE: It is not the purpose of this standard either to condemn
        or to condone behaviours which may be non-compliant with other
        standards. That is a matter for those implementors.

    o Where an implementor considers the risk too high for the top-level
      headers, encapsulating the whole article as a message/rfc822
      (6.21.2.2) may make it less likely to be mutilated during
      transport, especially where 8BITMIME is supported. Alternatively,
      encapsulating as an application/news-transmission (6.21.6.1) will
      guarantee correct transmission in all cases and is the method of
      choice where the intent is to gateway it back into Netnews later
      on.
    o To ensure full compliance with the Email standards it is necessary
      to encode words containing UTF8-xtra-chars according to [RFC 2047]
      (but only where permitted by that standard, i.e. within phrases
      and unstructured headers, although many user agents will display
      this encoding correctly in other contexts also). Likewise, within
      parameters the proper encoding is that defined in [RFC 2231]. In
      both cases, it is preferable to encode using the charset UTF-8,
      although it might be wise first to cconfirm that that is indeed the
      charset which had been used (see 4.4.1).
    o In particular, treating a newsgroup-name ...
[OK, that is waiting for the newsgroup-name algorithm to be finalized.]

6. AND FINALLY ....

Here is the latest version of the revised text within 4.4.1:

[Alternative proposal for the above three paragraphs:]
 
   In the particular case of newsgroup-names (see 5.5) there are more
   stringent requirements regarding the normalization and other usages
   of Unicode.

   Where the use of non-ASCII characters is permitted as above, they MAY
   be encoded in UTF-8 and they MAY be encoded using the MIME mechanisms
   defined in [RFC 2047] and [RFC 2231], but only in those contexts
   explicitly mentioned in those documents (unstructured headers,
   phrases and comments in the one, quoted-strings within parameters in
   the other).

   Encoding by other means is not compliant with this standard.
   Nevertheless, encoding using other character sets (with no indication
   of which one beyond the user's ability to guess based upon other
   clues in the article, or custom within the newsgroup) has been in use
   in some hierarchies, and such usage may be expected to continue for
   some period after the introduction of this standard. Reading agents
   MUST support the use of UTF-8, [RFC 2047] and [RFC 2231] in headers
   and they MAY, when it is detected that none of these has been used,
   attempt to interpet the header according to whatever other character
   set can be deduced, or has been configued as a default by the reader.

        NOTE: It is possible to determine, with a high degree of
        accuracy, when a given text containing octets with the 8th bit
        set was not encoded using UTF-8, and using this test to recover
        such non-compliant texts is therefore commended where no other
        harm could arise.

   Exceptionally, Newsgroups-headers (5.5) MUST use UTF-8 in order to
   ensure that they appear in their canonical form (in any case, a
   Newsgroups-header is not one of the acceptable contexts of [RFC
   2047]). Certain exceptions to this rule are provided (8.7 and 8.8.1)
   for use when mailing to moderators and other gatewaying applications.

        NOTE: The choice between UTF-8 and [RFC 2047] when posting
        depends on various factors. Some reading agents do not recogize
        [RFC 2047], and some are incapable of decoding UTF-8 (though
        there in an increasing tendency for modern reading agents to
        understand, or to be configurable to understand, both). Since
        headers encoded in UTF-8 are currently prohibited in Email,
        special consideration needs to be given to articles that are
        both posted and mailed (6.9) or which are mailed to moderators
        (see 8.2.2). Posters and implementors of posting agents need to
        take account of all these factors when deciding which method to
        use.
[End of alternative text.]

Observe that I have removed from that last NOTE the sentence which read:

   It is the intention that, ultimately, UTF-8 will become the method of
   choice, and future versions of this standard are likely to indicate
   that it SHOULD be used.

That leaves the matter of the phrase "Encoding by other means is not
compliant with this standard." Some people have asked for an explicit "MUST
NOT generate" instead (perhaps they feel we have been liberal enough by
suggesting that usage MAY be accepted). I am half-inclined to agree,
though I would like to head further opinions before actually doing it.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl@clw.cs.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.