[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fwd: I-D ACTION:draft-hoffman-utf8headers-00.txt




On Mon, 22 Dec 2003 09:14:25 -0800, Paul Hoffman / IMC <phoffman@xxxxxxx> wrote:


Er, any comments at all?

OK, I'll bite. It has taken me a few days to get onto this mailing list, and some time to digest the document.


First let me introduce myself as the Editor of the Usefor Working Group. As some of you likely know, Usefor had intended UTF-8 headers to become the norm in Usenet, allowing for internationalized newsgroup-names. But it got bogged down in gatewaying into email, and forwards/backwards compatibility arguments, and "why didn't we invent yet another 8bit->7bit encoding". So our arms were twisted and we have now agreed to remove all that from the draft and, instead, produce an Experimental Protocol to deal with I18N issues. In the meantime, Usenet will have to get by with RFC 2047 and RFC 2231.

So I was delighted to see this proposal, because it it gets accepted for Email, there is a greater chance that our Experimental I18N Protocol will be able to build upon it.

Now I note that the proposal comes in two parts. How to deal with local parts, and how to introduce UTF-8 in headers. These are somewhat orthogonal issues, so I will reserve my comments to the UTF-8 part, though I do have some concerns about local-parts too.

So here is your section 3, with my (indented) remarks.

3.1 UTF-8-HEADERS extension

[snip]

The terminal SMTP server is responsible for knowing whether or not the
message store can handle UTF-8 headers. A terminal SMTP server MUST NOT
advertise the UTF-8-HEADERS extension if the message store for which it
is responsible cannot
handle UTF-8 headers.

If an SMTP client does not see the UTF-8-HEADERS extension advertised
by an SMTP server, the SMTP client MUST downgrade the
non-ASCII contents of all header bodies before continuing to send
the message. The SMTP client SHOULD send the message with the downgraded
header bodies as a normal message.
If any header body cannot be downgraded, the SMTP client
MUST bounce the message with an error code of 558.

    No, I don't think that works, since the concept of a "terminal" server
    is not well defined. A typical SMTP server might be able to do
    UTF-8-HEADERS for some addresses, but not others. It needs to see a
    RCPT TO before it really knows. For example, if it is acting as the
    secondary MX relay for another server it might be able to accept the
    message, but not so it if was for local delivery, or maybe only for
    certain known users. Again, if the server is a 'smarthost' willing to
    deliver mail anywhere worldwide (but also to some local users/stores),
    what is it to say?

So I think you need to say something more like:

    A server which advtertizes the UTF-8-HEADERS extension accepts
    responsibility for forwarding to other servers with that capability,
    or to enabled POP3/IMAP stores, or to enabled MUAs. Absent such
    capability in those other servers/stores/MUAs, it MUST/SHOULD/MAY
    downgrade before forwarding. If it cannot downgrade (for whatever
    reason), it MUST respond with 558.

    Note that I do not believe downgrading is as easy as you have
    suggested (see below), hence the MUST/SHOULD/MAY. I would be happy to
    regard a server which never downgraded as being minimally compliant
    (and rather easy to implement, to get us started).

All UTF-8 headers bodies can be downgraded to being all-ASCII.
However, any header body that contains a non-ASCII mailbox name might
not be able to be downgraded if there is no Address-map header that
gives a mapping for the downgrading.

    BTW, I see that you use the term "header body". Can I persuade you to
    use the term "header content"? That was used in Son-of-1036, and is
    being used in Usefor. And the term "body" has the widely recognized
    connotion of the message body. Also, you are using the term "header"
    when you really mean "header field". That is very naughty, and your
    wrist should be slapped (I entirely approve of your usage, of course
    :-) ).

3.2 Downgrading header bodies

This section defines how to downgrade header bodies. Note that
downgrading MUST only be done if necessary. That is, downgrading
MUST never be done on fields or bodies that are all-ASCII.

3.2.1 Mailboxes

Mailboxes appear in many standard headers, such as To:, From:, Sender:,
Reply-to:, Cc:, Bcc:, Received:, and some of the Resent-: headers.
Downgrading mailboxes is done as follows:

    Yes, but that list is not exhaustive. There are many headers
    containing mailboxes not in that list (Approved: is the obvious
    example), and many more will be invented over time. How is a server
    supposed to know which headers contain them and with what syntax? For
    example, do you know, off the top of your head, whether
    Mail-Copies-To: includes a mailbox (in fact, it does)?

    Note that there is no easy solution to this problem (essentially the
    same problem that makes RFC 2047 unusable as written). Maybe a system
    will try to recognize mailboxes by their syntax and will get it right
    often enough to be useful. Maybe it would help if you were to insist
    that all Non-ASCII addr-specs were REQUIRED to have <...> around them.

1) If necessary, convert the domain using IDNA.

   2) If necessary,convert the local-parts using values from an
      Address-map: header in the message

   3) If necessary,convert any display-name or comment using
      quoted-printable with UTF-8 encoding

3.3.2 Message-ids

Downgrading message-ids is done as follows

AAAAARRRRRRRRRRRGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHH!

    PLEASE no downgrading of Message-ids. No Non-ASCII in Message-ids.
    Netnews is the main user of Message-ids (they are hardly useful in
    email except for copying into References so as to make threading
    work). But if the slightest amount of
    munging/downgrading/rewriting/whatever were ever allowed in
    Message-ids, the whole of Usenet would collapse in a heap, even before
    it was time to show the Film at 11. The RFC 2822 msg-ids are already
    too liberal to work on Usenet.

    Generally speaking, it would be far better to restrict the use of
    UTF-8 in headers to those contexts where it is explicitly allowed. In
    pure RFC 2822 mail, that would be local-parts, domains, phrases,
    unstructureds and comments. Period. Extensions might specify other
    contexts (parameters in Content-Type raises its ugly head) and would
    also specify how to downgrade (if at all). I believe some headers
    currently allow URIs; they could be extended to allow IRIs (for which
    a downgrading is already defined).  Usenet would add Newsgroups. And
    so on.

[snip]

3.3.3 Informational headers

If necessary, downgrading the bodies of informational headers (Subject:,
Comments:, and Keywords:) is done using quoted-printable with UTF-8
encoding.

    Yes, but it might be wiser (though uglier) to use the existing RFC
    2047 downgrade, which is at least understood by many/most MUAs now.
    Otherise, you have to define when to upgrade (and maybe the man who
    wrote
        Subject: =20 considered harmful
    really didn't want it to be upgraded).

3.3.4 Address-map headers

If necessary, the Address-map: header is downgraded using Base64 for
local-parts, and IDNA for domain names.

[snip]

As another example:

  Address-map: bj<oumlaut>n@r<aumlaut>ksm<oumlaut>rg<aring>s.se,
      bjorn-ascii@xxxxxxxxxxxxxxxxx

would be downgraded to:

  Address-map: YmrDtnJu@xxxxxxxxxxxxxxxxx,
      bjorn-ascii@xxxxxxxxxxxxxxxxx

    All right, but how to you know when/whether to upgrade again? If the
    LHS of an Address-Map pair is
        frederic@xxxxxxxxxxxxxxxxx
    how do you know that 'frederic' is not a base 64 represenation of some
    unpronounceable Mongolian name?

3.3 Things not changed from RFC 2822

    No, before you do that you need to consider all the other headers that
    might contain Non-ASCII. For example,

Content-Disposition: attachment; filename="Jos<eacute>'s_file"

    To which the answer might (or might not) be RFC 2231. Yes, it is ugly,
    but it is already in the field.

    And I am sure there are lots of other problem cases to be considered
    (not forgetting X-headers).

3.3 Things not changed from RFC 2822

Note that this protocol does change the definition of header field
names. That is, only the bodies of headers are allowed to have non-ASCII
characters; the rules in RFC 2822 for header names are not changed.

Similarly, this protocol does not change the date and time specification
in RFC 2822.

    Agreed about those cases but, as I said above, it is better to specify
    where Non-ASCII IS allowed, rather than where is ISN'T.

3.4 Additional processing rules

[snip]

Terminal SMTP servers MAY look into the headers of a message to
determine whether they should upgrade a downgraded set of headers to
UTF-8. This is easy to determine: if the Address-map: header contains
only ASCII, it was downgraded earlier in the chain of SMTP server.
Upgrading is particularly useful on bounce messages caused by bad
mappings.

    No, that doesn't work. It may be that the message contained no
    Non-ASCII local-parts or domains. Maybe it had been downgraded because
    of UTF-8 in the Subject, or in some comment or display-name.


Indeed, the next big problem is how servers and other agents are to recognize whether any of the headers of a message contain any Non_ASCII. Yes, you could scan the headers of every message looking for an octet
127, but that is a great expenditure of effort considering that 99.9% of
the world's emails will have pure ASCII headers for several years to come.
Far better to have some indication in the message that it is contains 8bit
stuff (most likely an extra header to say so). Indeed, Mark Crispin is on
record as saying that, if he is to have his arm twisted into having UTF-8
headers in IMAP, he would insist on such a header).

In addition to that, SMTP is not the only mechanism for transporting email
(or netnews). There is UUCP. There is NNTP. There is X.400 (complete
with complex gatewaying rules in and out). There are satellites and
carrier pigeons and goodness knows what. Not all of these protocols will
want to implement a UTF-8-HEADERS extension. Indeed, for UUCP and NNTP it
is quite unnecessary, because they are 8bit clean already, and the
upcoming NNTP draft already assumes UTF-8 (in the few places where it
would notice).

So if a message passes through one of these protocols, it must carry
something with it that warns of Non-ASCII characters should it enter a
"normal"/SMTP environment at the far end.

But far more than that is the political advantage in having such a header.
Today, the great bulk of the internet message system uses ASCII headers
and nothing else. A few brave souls are determined to use UTF-8 (or,
shudder!  GBxxxx) in their headers. OK. They should bear the cost of
bringing it in.  That includes the trouble of having to mark their
messages as "unclean". Of causing suitable user agents to be implemented.
Of persuading their server admins to provide enabled POP3 and IMAP
servers. But, most of all, to persuade SMTP servers around the world to
carry their stuff at least without destroying/munging it. Their own user
agents and local servers are more or less under their control. Not so the
uncaring SMTP relays through which their messages may have to pass (we may
assume that the bulk of the people they want to communicate with will be
speaking their own languages, and will thus also have enabled software
available). But to get random SMTP servers worldwide to upgrade will be a
hard slog, and it will only be the dedicated people who want to use the
facility who will have reason to apply the pressure to make it happen.

Which is why I think it better for this to be an Experimental Protocol in
the first instance. It is less "threatening" to the IETF establishment; it
silences those people who will not allow anything incompatible with what
is already deployed without workarounds and kludges and yet more encodings
already in place. By all means, if you can get it through on the standards
track, then good luck to you, but not at the price of holding it up for 5
years. Time is not on our side. People are already using UTF-8 (and,
shudder!, GBxxxx) in headers because "it works for them". They are not
going to wait.

Usefor has already been through this. Internationalized newsgroup-names
were to have been the major advance of the project. But we have been
persuaded to remove them from the draft and to bring them forth later as
an experimental protocol. Even though they had been shown to work without
problem within the existing Usenet without any server upgrades.

So let me suggest a header so that UTF-8 users can mark their messages as
"unclean".

Header-Transfer-Encoding : "Header-Transfer-Encoding:" ( "8bit" / "7bit" )
                               *( ";" parameter )

OK, it needs CFWS and all that jazz in the proper places. We can argue
later whether the operative keyword is "8bit" or "utf-8". Note the
optional parameters (syntax as RFC 2045) which allow extensibility. The
only parameter I would propose initially is "language = <language-code>".
I explicitly OMIT a charset parameter, because the REQUIRED charset for
Non-ASCII headers is UTF-8. And I make that omission very EXPLICIT because
it indicates to the Chinese how they could workaround using GBxxxx within
their own borders supposing that they refuse to use UTF-8, as they most
assuredly will.

It might be argued that this header SHOULD precede and use of Non-ASCII in
the headers (but given the propensity for transports to reorder headers, I
doubt that would survive).

Some people have doubts about including a language header. I put it there
to forestall Bruce Lilly who will otherise come before us pointing out
that the word "boot" has different meanings in German and English, and
more importantly by pointing out that there is an IETF requirment to
include language specifications in all protocols. And even with that
parameter in place, he will still complain that it does not allow
different languages to be specified in different headers :-( .

So you now say that this header MUST be present, with "8bit", in any
message which includes any Non-ASCII character in any of its headers (or
included body part header fields or message types). And it MAY be present,
with "7bit" in pure ASCII messages. And that "MUST" MAY be removed in a
future version of the dicument (in 20 years time when all non-compliant
implementations are long dead).

And with a header like that in place, I think this idea might very well
fly.

--
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl@xxxxxxxxxxxxxxxx      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5