[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: if you really want utf-8 headers...
John C Klensin <john-ietf@xxxxxxx> wrote:
> 1.3 The gateway converts all envelope addresses to IMAA
> form and encapsulates the original message using
> message/rfcNNNN, so we have a MIME body of...
>
> From: "1342/2047 PersonalName" <IMAA-local-part@IDNA-domain>
> To: "1342/2047 PersonalName2" <IMAA-local-part2@IDNA-domain2>
> Date: RFC2822-date
> MIME-Version: 1.0
> content-type: message/rfcNNNN
> content-transfer-encoding: <as needed>
>
> <original message, with original headers, in original form>
Thomas Roessler <roessler@xxxxxxxxxxxxxxxxxx> replied:
> Wouldn't that construction violate MIME's "no nested encodings"
> rule when transferred in a 7bit environment?
Yes. RFC 2045 section 6.4 says:
it is EXPRESSLY FORBIDDEN to use any encodings other than "7bit",
"8bit", or "binary" with any composite media type, i.e. one that
recursively includes other Content-Type fields.
Therefore, if you try to encapsulate the 8-bit header and pass the
message to a 7-bit MTA, you won't be able to apply quoted-printable or
base64 encoding, and you're stuck.
John countered:
> If it did, moving to something like
>
> multipart/used-to-be-utf8-headers; boundary = "--foo"
> --foo
> message/utf-8-headers
> > content-transfer-encoding: ...
>
> <header text>
> --foo
> message/ ???
>
> <message text, possibly encoded>
> --foo--
>
> would seem to work, although it would unquestionably be less
> attractive.
You have almost arrived at a structure I thought of a few weeks ago but
never finished writing up. If you make one more tweak you get something
more attractive than the initial message/rfcNNNN idea:
#### begin example message ####
From: ASCII@ASCII
To: ASCII@ASCII
Subject: ENCODED_WORD
Content-Type: multipart/header8; boundary=boundary123
---boundary123
Content-Disposition: inline
Content-Type: text/plain: charset=utf-8
Content-Transfer-Encoding: 8bit or quoted-printable or base64 as needed
8:From: UTF8@UTF8
8:To: UTF8@UTF8
8:Subject: UTF8
--boundary123
Content-Disposition: inline
Content-Type: WHATEVER; charset=WHATEVER
Content-Transfer-Encoding: WHATEVER
BODY
--boundary123--
#### end example message ####
The tweak I spoke of is in the content-types of the two inner parts.
The second part has the appropriate content-type for the body, the same
content-type that would have gone in the top-level header before we
inserted this multipart/header8 shim. The first part has content-type
text/plain, which is not a lie because a message header is indeed plain
text. The fact that it is not only plain text but also a message header
is conveyed by the content-type in the outer header: multipart/header8
is defined to contain exactly two parts, of which the first is a UTF-8
header and the second is an arbitrary message body.
What makes this style of encapsulation more attractive than the
message/rfcNNNN style is that it can be displayed by today's MUAs.
An MUA today will not know what to do with message/rfcNNNN, but it
will be able to cope with multipart/header8: it will treat it as
multipart/mixed, according to RFC 2046 section 5.1.7. And it will also
be able to cope with the UTF-8 header tagged as text/plain: it will
simply display it. I've tried this with my own MUA (mutt) and indeed
it makes no attempt to display message/foo but does correctly display
multipart/foo.
This structure is rather ugly to propose as the next generation message
format, but it can instead be proposed as the downgraded form of the
next generation format. In this model, there are two classes of header
fields: old-style (what we have today) and new-style (similar, but with
direct support for non-ASCII text and maybe some other extensions). An
old-style header is a sequence of old-style fields, and a new-style
header is a sequence of either-style fields; that is, a new-style header
can contain both old-style and new-style fields.
A new-style message would simply be a new-style header and a body, but
it could be downgraded to an old-style message by splitting the header
into an old-style fallback header, a new-style residual header, and an
old-style content header, using the structure described above. For
example:
#### begin new-style message ####
Date: Mon, 5 Jan 2004 05:14:38 +0000
8:From: UTF8@UTF8
8:To: UTF8@UTF8
8:Subject: UTF8
In-Reply-To: <blah@blah>
Content-Type: text/plain; charset=iso-2022-jp
BODY
#### end new-style message ####
That could be downgraded to:
#### begin old-style message ####
Date: Mon, 5 Jan 2004 05:14:38 +0000
From: ASCII@ASCII
To: ASCII@ASCII
Subject: ENCODED_WORD
In-Reply-To: <blah@blah>
Content-Type: multipart/header8; boundary=boundary123
--boundary123
Content-Disposition: inline
Content-Type: text/plain; charset=utf-8
8:From: UTF8@UTF8
8:To: UTF8@UTF8
8:Subject: UTF8
--boundary123
Content-Disposition: inline
Content-Type: text/plain; charset=iso-2022-jp
BODY
--boundary123--
#### end old-style message ####
The fallback header is the outer header minus the Content-* fields
(which is part of the shim). The residual header is the body of
the first part of the multipart/header8. The content header is
the header of the second part of the multipart/header8 minus the
Content-Disposition field (which is part of the shim).
A downgraded message can be upgraded back into a new-style message,
but before I discuss that I need to clarify the relationship between
old-style and new-style fields.
Given an old-style field Foo:, there does not automatically exist a
new-style field 8:Foo:. The new-style field does not exist without its
own specification. Similarly, if someone defines a new new-style field
8:Bar:, they are not obligated to specify a corresponding old-style
field.
However, if both new-style and old-style versions of a field are
specified, then they must agree on whether multiple instances of the
field are allowed in a header. If multiple instances are allowed, then
there is no special significance to the occurence of both old-style and
new-style forms within a header; they are simply independent instances
of the field, same as they would be if they were all old-style or all
new-style. But if multiple instances are not allowed, and both forms
occur in the same header, then they are alternates, and one must be
respected over the other. Obviously, old software will respect the
old-style form (because the new-style form won't be recognized), but new
software that understands new-style header fields should respect the
new-style form.
The specification of the new-style field may define a downgrade
conversion to the old-style form, possibly using encoded-words
and/or ACEs and/or lookups to special servers. Downgrade
conversions would be defined by at least the standard fields
8:From:, 8:Sender:, 8:Reply-To:, 8:To:, 8:Cc:, 8:Bcc:, and 8:Subject:.
The procedure for downgrading a message is as follows: The
Content-* fields go into the content header (in the second part
of the multipart/header8), the other old-style fields go into the
fallback header (in the outer header), and the new-style fields
go into the residual header (in the body of the first part of the
multipart/header8). Furthermore, old-style copies of some of the
new-style fields are created and put into the fallback header. A copy
is made if and only if the following four conditions are met:
1. the corresponding old-style field was not already present in the
original new-style header
2. the new-style field is recognized
3. multiple instances of the field are not allowed
4. a downgrade conversion is defined for the field
Finally, the shim structure is created around all of that.
Condition 1 allows the original message creator to supply a precomputed
downgraded field in the original new-style header, possibly different
from the one that would result from the standard downgrade algorithm for
that field.
There is a small exception to the rule that old-style fields go into
the fallback header, motivated by the unique role of Received: as a
trace field added in a particular order by multiple agents. If the
original new-style header contains both Received: and 8:Received:
fields, then they all go into the residual header, so that their order
can be preserved.
The procedure for upgrading a downgraded message is simple: Concatenate
the fallback header, the residual header, and the content header to form
the new-style header, and discard the shim.
By definition, every old-style header is also a new-style header, so if
you want to add a new-style field to a header, you can in general just
do it, and let the header get downgraded later if necessary. However,
we don't want to create nested shims; therefore new-style fields must
not be added to fallback headers (headers containing Content-Type:
multipart/header8). In that case, first upgrade the message, then add
the new-style field.
By the way, for anyone concerned about having multiple instances of
the 8: field, the prefix could be "8;" instead of "8:", resulting in
distinct field names "8;From", "8;To", "8;Subject", etc. Also, we could
leave room for backward-compatible expansion by using a prefix of "8::"
or "8;;", so that non-critical parameters could be inserted between
the (semi)colons, which would be ignored by implementations that don't
understand them.
As for the "other extensions" I alluded to when introducing the
term "new-style field"... as long as we're defining a new header
field syntax, we might as well consider making other changes besides
allowing non-ASCII. For example, people have expressed a desire for
alternate addresses. One approach is semi-stable cachable-or-lookupable
equivalent addresses via new servers and/or an Address-Map field.
Another approach is one-shot inline alternate addresses via an extension
of the grammar.
The current grammar defines the mailbox token as:
mailbox = name-addr / addr-spec
Suppose that in new-style fields the mailbox token is redefined as:
mailbox = single-addr / any-of-addr
single-addr = name-addr / addr-spec
any-of-addr = [display-name] "[" single-addr-list "]"
single-addr-list = single-addr *("," single-addr)
For example:
8:To: [ Joe1 <joe@xxxxx>, Joe2 <joe@xxxxx> ], Foo <foo@bar>
(Of course the expected common usage is addresses of different scripts
or languages.)
Unlike the Address-Map or server-based approaches, there is no claim
here that the addresses are equivalent; this is simply a way to write
"address1 or address2 or address3..." in a particular place in a header.
Anyone can create such a multiple-choice list for whatever purpose they
wish; there is no question of whether an any-of-addr is authentic or
bogus. An any-of-addr can be added to an address book, but should not
be automatically cached and reused, because there is no reason to assume
that an any-of-addr that made sense for one field of one message will be
appropriate in any other context.
The intent here is *not* to create an illusion of a single address with
multiple appearances, but rather to invite a choice among multiple
distinct addresses. An MUA could display the any-of-addr literally, and
the user could simply ignore the scripts that are hard to remember and
remember the one that's easy to remember. The MUA could also provide an
option to auto-hide all but one of the choices (the one estimated to be
the best suited to the user), just as MUAs provide options to hide some
header fields.
When an MUA sends a message to an any-of-addr, it should make the
envelope recipient match whichever single-addr was displayed to the
user; if multiple addresses were displayed, it should use the one that
was first in the displayed list, or let the user choose one.
New-style fields that contain mailboxes and define downgrade conversions
will need to specify how to downgrade an any-of-addr to a single-addr,
perhaps by simply discarding all but the first single-addr in the list.
AMC