[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Unknown character sets
Hello SMTP-WG --
Some comments on being a gateway between some form of
Just-Send-8 and ESMTP/8bitMIMEtransport, following the thread
about the Transition document (Vaudreuil, Greg: Transition of
Internet Mail from Just-Send-8 to MIME/8bit ESMTP, Internet
Draft, Version 1, 18 November 1992.)
I agree, strongly, with Greg's original suggestion.
There are these cases where the receiver gateway gets a message
and has to convert to MIME:
1) Receiver knows nothing, except 7-bit
If the sender is in the Internet (whatever that
might mean) it is forced by RFC-1341 to regard this as
"Content-type: text/plain; Charset=US-ASCII",
and it specifically notes that it might be
some horrible uuencoded thing: we are still
going to regard it as text -- because it
*is* text to us (albeit funny text describing
some binary file).
If the sender isn't in the Internet, then
the gateway is also a not-necessarily-822 to
MIME gateway, in which case it might want
(according to Greg's suggestion) to send
"Content-type: text/plain; Charset=unknown".
This seems fine by me; see stuff below about
the MUA.
2) Receiver knows nothing, except 8-bit
If there was a distinguished 8-bit character
set (in the way that US-ASCII is a distinguished
7-bit set) we would choose that. I would be
perfectly happy to call 8859-1 the distinguished
8-bit set, but I accept that many might not be.
If it's not possible to do this, we have to
make up a character set name, "unknown".
Alternatively, a within-Internet receiver could bounce this
as non-conforming (either to RFC-821, if the
sender sent 8-bit characters without EHLO; or
to 8bitMIMEtransport if it sent 8-bit characters
without MIME tags).
3) Sender is known to use a particular character set
Such as if the gateway had a table of hosts and
their character sets
mycompany.de: x-iso-646-german
mycompany.uk: x-iso-646-unitedkingdom
These names (specifically disparaged in RFC1341) would
be used internally to the gateway to translate to what
it found was best in its relaying. Preferably it
would translate these to ISO-8859-X. The base case
of this is private zone agreement, which is like
this in its most horrible instantation:
*: x-ibm-extended-ascii
4) Sender and Receiver have private agreement
Such as if they have privately extended SMTP
to add a verb, such as ECMA, which indicates that
the sender is going to use ECMA-94, which can then
be transformed. (This would be like x-ecma-94 in
the previous scheme, except per SMTP connection.)
5) Sender and Receiver have private X-CHARSET header
Privately, they agreed on some header extension:
X-CHARSET: ECMA-94
Easy: just do the translation to MIME.
6) Receiver is 8-bit clean, sender is local
Perhaps the gate can look up the sending user's
locale or similar information.
The suggestion that we use "x-unknown" doesn't make sense to
me. The "x-*" names are those we do not have to agree on and
which we are not obliged to implement, send, receive,
recognise; nothing. The only "Charset=X*" which would make
sense are those like "Charset=X-EBCDIC"; which are explicitely
discouraged by RFC-1341, and "Charset=X-My-New-Experimental".
The only other thing that might make sense is saying
Charset=unknown-8bit
to indicate that it's liable to break various things. Indeed,
I'd probably like an idea to define "unknown", "unknown-8bit"
and "unknown-7bit". I wish that RFC-11341 didn't require the
interpretation "Charset=US-ASCII" for untagged-messages, and
one could instead interpret them as "Charset=unknown" and the
receiving MUA could apply rules or heuristics as it thought
best. That way you get an end-to-end agreement about the
non-MIME-tagged character set, just as you do about the
encoding.
We should not use "Content-type: application/unknown" under any
circumstances, because this incoming message is masquerading as
a text message (RFC-822: Standard for the format of ARPA
Internet *text* Messages); if it isn't really a text message,
then there is an *end-to-end* agreement about what it is. Any
intervening relays don't know anything about it.
Olle and Peter say:
> We consider it dangerous that the gateway assume _anything_ about the
> contents of a non-MIME message. Shouldn't we promote that the
> Content-Type header always can be trusted? A "text/plain" message
> containing a Binhexed Mac program (with a sentence before it or a
> signature containing some eightbit octet) isn't what we would like to
> see, is it?
I think that the gateway might have more knowledge than what's
in the message; as I illustrated above, it might have an
out-of-band agreement with communicating sites which haven't
yet upgraded, or which form part of a gateway out of the
Internet.
Regarding Binhex: RFC-1341 specifically notes that untagged
messages might well have horrible encoded insides. The 8-bit
case should be the same as the 7-bit one.
Neil says:
> I admit to being arguing with Greg to put this in, but I find
> Olle and Peter's argument convincing. I was trying to avoid
> having separate types when the charset was known versus unknown.
> But having something like:
>
> MIME-Version: 1
> Content-Type: application/unknown; Charset = "<Character set>"
> Content-Description: Untagged text converted to MIME.
>
> would be fine to me also, seeing how we really don't know the
> content type even if we *do* know the character set.
I don't like this one at all. We should assume that the thing
is text unless we're told otherwise; for two reasons. A) the
vast majority of messages *are* text; we should not penalise
the possibly-naive text senders at the expense of
probably-computer-sophisticated non-text senders.
Best regards to all,
Jonathan.
---------------------
Jonathan Laventhol
Systems Administrator
D. E. Shaw & Co.
120 West 45th Street
New York, NY 10036
<jcl@deshaw.com>
---------------------