[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Whether 8-bit SMTP? And how?
> Date: Tue, 26 Feb 91 23:38:56 +0100
> From: Keld J|rn Simonsen <keld@dkuug.dk>
> Message-Id: <>
> To: ietf-smtp@dimacs.rutgers.edu, moore@cs.utk.edu
> Subject: Re: Whether 8-bit SMTP? And how?
> X-Charset: ASCII
> X-Char-Esc: 29
[ # = Keith Moore (that's me) ]
# I don't want extended-SMTP to worry about character sets. First of all,
# there's no need for this information to be kept in the message envelope unless
# SMTP implementations are going to try to convert between character sets. This
# is a bad idea. Consider that a message frequently passes through several
# SMTPs before reaching its destination. Now consider that SMTP-1 might support
# character sets A, and B, SMTP-2, B, and C, and SMTP-3 , A and C. Now a
# message in character set A sent through this path is converted from character
# set A to B, then from B to C, and finally, perhaps, from C to A (in a
# desperate attempt to restore the original). Each of these conversions is
# likely to result in a loss of information. It's far better if the
# intermediate SMTPs simply *copy bits*, and if any necessary conversions take
# place on the ends (during posting/delivery or by the UAs).
> There are surely ways of doing character set conversions
> which are guaranteed to not lose information. I think your
> arguments are building on false assumptions.
I guess it depends on what you mean by "losing information". One type of
information loss occurs when a character that exists in the original message
as composed is not displayed correctly (as the same symbol) when the message
is read by the recipient. Unless every set you map into is a superset of the
set of characters used in the original message, this type of information loss
is unavoidable.
Another type of information loss occurs when a message, having been translated
from its original encoding into another encoding (or perhaps having suffered
multiple translations), cannot be restored to its original encoding. You
could prevent this type of information loss by defining reversible mappings
between every pair of 8-bit character sets. But this kind of translation
isn't terribly useful--since it assumes that either all encoding schemes
support the same set of characters, or you are going to translate back into
the original encoding before delivery. (Somehow I don't think this is what
you had in mind.)
The reason you perform a character set translation is to make the message
readable by humans using the display hardware that they have available. SMTP
doesn't have to display mail to humans, and therefore it doesn't need to care
about what character set a message is written in. (Beyond the requirement
that the headers be expressed in something close to ASCII.) Allowing *every*
SMTP transfer to potentially involve a character set translation has no useful
purpose, and adds lots of complexity to the SMTP implementation.
It's true that not everyone will have the hardware or system support to
display a particular character set, and some conversions may be needed to make
a message as readable as possible. However, SMTP is the wrong place to do the
conversion. A message may have to transit several SMTP servers before
reaching its destination. None of these, except possibly the last one, has
any idea of what the "best" character set is for final delivery, and they
shouldn't be making guesses and doing conversions that might lose information.
(Note that conversion between character sets is a subtly different issue than
encapsulation of 8-bit wide messages in a 7-bit encoding. 8-bit SMTP, if
adopted, will have to be able to do *some* kind of conversion of 8-bit
messages to 7-bit, but that doesn't imply that the set of characters used in
the visual representation of that message must be changed by that process.)
> I think your simple solutions "simply" leads to simple chaos.
> A lot of people would simply not be able to read messages coming
> from other places. We simply lose the current interoperability
> of internet mail, IMHO.
I don't expect that everybody will be able to read messages sent from
anywhere. The biggest limitation here is in display hardware that only
supports a limited set of languages, and I don't expect to see that limitation
lifted anytime soon.
I do think it's likely that people will usually be able to display characters
required by the languages that they use frequently. Perhaps the recipient's
machine can express the necessary characters, but prefers a different encoding
scheme than the sender's machine. Fine. Let's define things in such a way
that, say, Spanish is always expressed in ISO 8859-1 in messages transmitted
with SMTP, and leave it to the posting and delivery agents to perform the
necessary translations from and to the local character sets. EBCDIC machines
that speak SMTP already do this kind of translation to and from ASCII, and it
works pretty well. I would hope, however, that most computers used by
Spanish- speaking computer users would already understand 8859. If not, their
OS and hardware vendors should be encouraged to support it.
# * How to convert from 8-bit to 7-bit?
#
# Most of the discussions I've seen on this list with respect to 8-to-7 bit
# conversion have assumed that the conversion should take place with no loss of
# information, usually by having the sender-SMTP encode the entire message as
# 7-bit characters, to be decoded somewhere down the line. Once again, this
# makes SMTP complicated. Instead of having to deal with a single type of
# message, SMTPs would have to keep track of what kind of message is being sent.
# It has to know, for instance, whether a message being sent is already in 7-bit
# format, so it won't try to specially encode an already 8-bit clean message.
# Furthermore, it should probably try to distinguish between a message that is
# 8-bits encoded in 7-bits, and another that is plain 7-bits, so it can perform
# the reverse encoding when receiving a message from a 7-bit-only system. This
# requires that the receiving-SMTP parse the message header of an incoming
# message to determine whether the message is encoded -- it cannot do the
# conversion "on-the-fly" (otherwise, how does a present-day SMTP
# implementation, that knows nothing about message encodings, tell the receiver
# SMTP how this message is encoded).
> Still some false assumptions: This can be handled quite easily.
> You could decode the 7-bit code into some intermediate form
> and then encode it into 7-bit (or 8-bit) again. This could be done
> on the fly.
The problem I was describing is the necessity to parse the message header in
order to figure out what format the message (including the header) is encoded
in. If you expect extended-SMTP do 8- to-7-bit information-preserving
encoding when talking to a present-day SMTP, you either have to constrain the
headers to be in ASCII, or you have to have a "fake" ASCII header in the 7-bit
encoded form, which is merged back in with the "real" (8-bit) header during
the conversion process back to 8-bits.
> A present-day dumb SMTP implementation could just forward the
> header telling the encoding to the receiver SMTP, this is the default
> for headers.
I don't want SMTP to look at headers. Currently there is no need for it to do
so -- all it has to do is look at the recipient addresses and forward mail to
the right places. The Received: header is normally added by prepending it to
the message, which doesn't involve any parsing at all.
I'm trying to define a way to extend SMTP to 8-bit data paths without major
changes to existing implementations. Requiring SMTP servers to add
header-parsing code strikes me as a major change.
# My proposal is as follows: if a sender-SMTP finds that the receiver-SMTP
# cannot accept 8-bit SMTP, it should (a) zero the 0x80 bit on all bytes sent,
# and (b) add a header something like:
#
# Data-Conversion-Warning: 0x80 bit stripped while sending message from
# smarthost.com to stupid-7-bit-only-host.edu. Some information may be lost.
> Can't we really do something better (sigh).
>
> Keld Simonsen
Well, ideally, everyone would upgrade to 8-bit SMTP and we would never see the
results of 7-bit conversion again. I actually think this might happen pretty
quickly in countries where English is not the language of choice. However,
this is a political question, not a technical one, and thus removed from my
main area of expertise.
Keith Moore