[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

DIS 10646, embedding, universal character sets, etc



Dan Oscarsson <Dan.Oscarsson@dna.lth.se> writes...

>Todays ASCII only format is to limited for international use. What we need
>is a character set that contans all character in the world.

  I don't think there is such a character set.  The current draft ISO
[DIS] 10626 is certainly not it.  Not only does it omit a number of
important languages, but it omits characters that are "obsolete" or
"unused" in existing languages.  While those omissions may be
appropriate for many "data communications" purposes, people whose
interests are literature, or linguistics, or issues in translation or
interpretation of texts are as legitimate users of electronic mail to
discuss their ideas with their colleagues as are those of us who use
electronic mail to discuss electronic mail.  Similar comments arise
with the forms of ideographic languages in which classical texts, or
very literary texts, are written--these involve more characters, and
different characters, than the modern "communications" or "business"
subsets of those languages.  One need only observe that there are
"unabridged" dictionaries of English and, I assume, Swedish and most
other alphabetic languages, and they tend to be LOTS larger than the
pocket dictionary that lists the words every secondary school graduate
is expected to know.
  In addition, 10646 is not a Standard.  We know that there are going
to be negative votes, citing a range of technical justifications.  I
believe that the "extension level 2 / dynamic compaction mode" may be
objected to in some of those negative votes.  It would be unwise of the
Internet community at this stage, as a purely pragmatic matter, to
assume that this feature will survive in its present form.  So, if
10646, especially with compaction mode, are concluded to be The Right
Thing, I think this working group should suspend itself until it
becomes finally clear there is a 10646 and how compaction mode is
handled.
  Now I want to stress that I don't think Unicode, or any other system,
is any better.  There are a lot of languages in the world, past and
present, and a lot of characters and character variations in those
languages (more if the "no Han unification" principle of DIS 10646 is
carried to its logical extreme, but this is a small quantitative
difference, not a significant qualitative one).  The problem is,
ultimately, that a "single character set" that is relatively static
enough to be used in the way that is anticipated for email and that
contains "all the characters in the world" is simply not a realizable
goal.
  IMHO, that implies that no matter what we do, no matter what
"universal" or "not so universal" character set is chosen, some users,
some of the time, will need to make local arrangements or conventions
to communicate, with precision, with their correspondents and
colleagues.  The question, then, is where to draw the line between the
characters that are represented, processed, and accepted "easily", that
every (or nearly every) system is expected to support, and those that
require some special (and probably unpleasant) representational
mechanism. 
  For the last 20 years or so, we have drawn that line at "ASCII".
That is clearly not satisfactory any more.  As we move toward other
things, we need to balance some kind of utility function against some
kind of loss function--involving what characters get lost.  The loss
function, however, has another element as we move toward multiple or
"universal" character sets, which is that we should assume that not all
sites will be able to handle all of them in a reasonable fashion.  We
could invent "accurate and precise, but undisplayable and
uninterpretatable here", with a different subset of characters falling
under this rubric at different sites.  A new version of the Tower of
Babel, at the character, rather than the word, level.
   That argument, which I hope is clear (but suspect it may not be)
applies to the character sets themselves, whether they are encoded in 7
bits or 8, in lines or counted messages, in something "RFC822-like" or,
as Dan puts it, in "something totally new that is a lot of work to
interchange..." 

So, what to do?

A conservative, and easily implementable, suggestion that does not
require going beyond what is now standardized in terms of character
sets and does not require embedded escape sequences in most cases.  I
dread trying to move embedded escape sequences across gateways into
systems that use variations of the RFC822 format, such as different
base character sets.  I think we should avoid that if we can.

Part of my presumption here is that "SMTP" and "mail format" are not
quite as obviously separate as Dan (and several others) have assumed.
There may well be ways to rearrange what follows so as to separate them
again (I suggest one below as a strawman, knowing that it is
unacceptable) that would be within the spirit of this general proposal.

(1) Make TWO new envelope verbs (or some combination of separate
arguments to one verb that would have the same effect).  The terms used
are intended to be suggestive of meaning, not proposals as to what
should be actually used.

  (i) ENCODING keyword
  This specifies the representation of the bits of the characters
themselves.  I would expect that the modal value would be "eight",
i.e., an eight bit character gets transmitted as eight bits.  The other
obvious value is some indication of how eight bit characters are
encoded over a seven bit data path.  I suggest that the options for
doing that can be explored separately.  If I correctly understand Mark
Crispin's comments about what [even] TOPS-20 might be able to between
MTAs (as distinct from within the system) we might not even need to
support a seven-bit encoding.  Advocates of character sets involving
more than one octet per character should consider whether values like
"sixteen" would be useful--IMHO, these character sets are lots easier
to deal with in their fixed-length forms.  I suggest that the right
thing to do is to write one or two keywords into the "standard" RFC,
and provide a registration procedure for others using, perhaps, the 
Telnet option list as a model.

Note that "ENCODING", and not "CHARSET", below, is related to Dan's
"ISOC". 

  (ii) CHARSET keyword
  Here, we specify character sets.  "keyword" could be chosen from the
finally-approved members of the ISO8859 family with provision for
addition and registration of whatever else came along, was adequately
standardized and defined, and seemed appropriate.  I would hope that we
would and could write a rule that specified that everyone who supported
extended SMTP at all was expected to support ISO8859-1, not because
Latin-1 is a good universal choice, but because we should push "the
character set that *everyone* is expected to understand" up from ASCII,
and Latin-1 seems to be the right choice.  That reflects a clear North
American/ Western European bias, which is unfortunate.  But Dan's
suggestion (and that of others) to use eight bit minimized DIS10646,
restricted latin subset reflects, in practice, exactly the same bias,
as does 10646 itself.

Now, the receiving SMTP can either accept these verbs and their
arguments and return a 2xx code for each, or it can reject them, as an
error.   The existing SMTP model is retained, in which the sender can
then send some other verb and argument, in the hope that it will be
accepted, or can reset or close the connection.  This model is a bit
less elegant that having the sender ask the receiver what it can
accept, but preserves the concept --important for other reasons-- of
being able to "batch" or "stream" SMTP.  A true inquiry/response
dialogue of the "tell me what you can accept", "I can accept A, B, and
D", "ok, I'll use B" flavor prohibits, I think, streaming behavior.  We
should not discard that lightly.

But character set implies a little bit more than a mapping of graphics
onto code positions, since some of the "keywords" could be associated
with "character sets", or character protocols, that imply some
switching around and remapping by control sequences.  If it is
approved, "10646-with-minimization" is one of these ("10646-8-bit-
Latin-subset" is not).  And, while I would hope to never see it because
I think it is the architectural design plan for the Tower of Babel, one
could, in principle, specify "ISO2022 and any registered character set"
here, implying arbitrary switching back and forth.  That, incidentally,
could yield a good approproximation to a "universal" set, since one can
register almost anything as long as certain conventions are followed.

I can find nothing in RFC822 or 821 now that prevents a receiving MTA
from filtering out "unrecognized" control characters from the data
stream.  Some do today; it is popular way to avoid the effects of
little children mailing files that put remote systems or terminals into
undesirable states.   By having the character set specification in the
envelope an MTA also retains the ability to determine what can, and
cannot, be filtered.  It seems to me that this is desirable.

I think the thing that is controversial here is the idea of the
character set specification in the envelope.  Why is it important?

Dan says...
>Every header line starts in initial state (because the[y] can change
> order, be added or removed).
  But I'm arguing against "state switching" (i.e., embedded control
sequences) in the headers (or anywhere else for that matter) if it can
be avoided.  I therefore want the intent to do that declared, at
envelope time, so that the receivers who won't handle it (I predict
"lots") have an opportunity to demand something else.
   And it means that the header -- particularly personal name phrases,
comments, subject lines, etc. -- can use the "extended" character set
without having to do state-switching or embedded controls.  Seems to me
that is a big advantage.

Strawman:
  The easiest ways out of this problem of needing the character
specification in the envelope in order to use extended characters in
the header without state-switching are:
  (i) Prohibit anything but ASCII in the headers, so the current RFC822
rules about header character sets are unchanged.
  (ii) Permit the character set specification in the header, but force
an order on it, so that it must preceed anything but Received (and, on
delivery, Return-Path) lines.
  I know the first is unacceptable; I think the second is too.  And I
think I've given some positive arguments for character set
specification in the envelope in addition to this "no other place to
put it" one.

Now the rest of the header and the message:
  -- I believe that embedded *anything* complicates the process of
sending clear-text, extended character set, messages unnecesssarily.  I
note with interest that the new WG concerned with ODA documents has
given itself many more months than the schedule on the SMTP extensions
WG to deal with ODA-over-SMTP.  And ODA-over-SMTP is, presumably, lots
easier than "embedding most anything over SMTP".  In the former case,
they know, with relatively firm Standards in place, the properties of
what they are trying to embed.
   I suggest that we should postpone this.  
   However, note that a plausible future extension to CHARSET, if that
were in the envelope, would be a construction like
     CHARSET keyword,BINARY where "keyword" would be a character set,
as above, and "binary" (or some specific variation on that theme) would
indicate that the message body consisted of "binary data" rather than
"text message".  I would presume, in this case, that the nature of the
"data" --e.g., the embedding system and what was embedded-- could be
specified in [new] header fields.   I would also assume that this would
not relax the 1000 character, CRLF line restriction, but I'd like to
hear from the ODA WG before thinking very much more about this.

Risto Kankkunen then said, in response to Dan's message...
>> Format:
>> <APC> format text CRLF
>> encoded object lines
>> <ST>
>
>Hmm. I like the line-counting or key-word methods more than this.

   Now, as indicated above, I would prefer to avoid dealing with this
entirely right now.  I think we can succeed in the "international
character set" objective without touching it.  If we *do* try to deal
with it, I fear that we may fail entirely.
  However, I agree with Risto if the issue must be faced.  I think
that, of all the things that can be placed in mail to denote the
boundaries and type of something embedded, control sequences are the
worst.  They are too subject to catastrophic error as a result of
single-character errors and too subject to gateway and translation
mangling, interception by systems that take them as local instructions
rather than instructions for a remote SMTP server or UA, and so forth.
Knowing the problems we get into with "plain ASCII" -- which is much
less vunerable -- I despair of making anything with embedded controls
robust.   I'm not suggesting that it cannot be defined perfectly well,
only that I'd rather see a lot of implementations that work, even if
slightly less elegantly, than have us get our exercise pointing at
implementations and shouting "broken".

>> SMTP:
>...
>  What happens, if it cannot,
>doesn't need to be written down to the protocol.
>...
>What the mailer does in these situations isn't any concern of the SMTP
>protocol.
  I disagree with the basic philosophy here, independent of Risto's
particular suggestions.  I think that, if there is going to be any hope
of large-scale interoperability between systems with varying "extended"
capabilities, the expected minimum and expected fallback behaviors must
be specified.  If that specification of expectations and conventional
behavior does go into "the protocol", we will just end up having to
clean up the resulting mess in a subsequent "host requirements"
document. 

>Because handling binary requires the ability to
>handle arbitrary long lines (=distance between CRLF's can be whatever), 
   This is actuallly not true.  Just as one can have an escape
convention for a leading ".", one could, if "binary" were otherwise
supported, adopt the convention that a *real* CR or LF or both would be
encoded somehow, and that a "mail reader" for "binary" would simply
strip all CRLF sequences that appeared in the message as transport
artifacts, then decode those octets that happened to have the same bit
patterns as the CR and LF characters.
   I'm not suggesting this.  Indeed, what I'd suggest is deferring the
whole binary discussion and focusing on unstructured, clear text,
character messages.  I think that is also what Greg has twice asked the
WG and discussion to focus itself on.

   I have one additional problem with [my perception of the trend of]
some of the recent suggestions.  One of the more interesting ISO
contributions to the world is a concept of standardization without
interoperability.  This creative activity involves writing Standards
with so many options that
    (i) no one can really understand all of them
    (ii) to implementations that conform to the same standards, can,
by different choices of options manage to not interoperate at all.  Of
course, if they can interoperate, they can do all sorts of interesting
things.  Maybe.
  Having created this problem, ISO has produced a solution, which is to
impose another layer, the so-called "standard profiles", which specify
combinations of things that can actually be expected to work together
and, if two installations adopt the same profile, they might actually
be able to interoperate.

  I would hope that we can avoid a situation with SMTP extensions in
which there are so many options that we must follow in ISO's footsteps
in that regard and invent profiles.  I get a little concerned about the
possibility when I see, e.g., a suggestion that we use ISO (sic) 10646
with the 8 bit, Latin-1, minimization options "initially".
   --john
   Klensin@INFOODS.MIT.EDU
-------