[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

822/SMTP Meeting: Character Set Concerns



In reply to message <>, 
sent by Greg Vaudreuil <gvaudre@nri.reston.va.us>:

I will not be attending the Internet Mail extensions WG meeting
to be held in Copenhagen.  Accordingly, this note addresses
the "two meta-issues" raised in the posting to the lists.

%1) Character Set issues

%   1) Pick one or more than one?
%	There is a compelling reason to designate a "common" or
%	"preferred" character set for Internet use.  Unfortunatly no clear
%	candidate has emerged.  How to maximize interoperation is the 
%	focus of this discussion.

  The default character set must be US ASCII in order to be backwards
compatible with existing implementations.  ASCII messages should be
assumed if no Character-Set: or equivalent header exists with a given
message.

  Other character sets should be supported via some mechanism similar
to a "Character-Set:" header in text mail messages that conform to the
extensions defined by the RFCs.  Predefined standard values for that
header should include all of the ISO 8859 family of 8-bit standards
(not just "ISO Latin 1" a.k.a. ISO-8859-1 ).  

  The RFC should clearly indicate that encoding using an ISO 8859
standard is preferable to one of the ISO 646 encodings (which are
rapidly fading from use anyway).  The ISO 646 encodings should not be
in the set of "formally supported" character sets because they are
superceded by the ISO 8859 encodings and the ISO 646 encodings are
not backwards compatible with US ASCII (the currently specified
and implemented mail character set).  ISO 8859 encodings are backwards
compatible with US ASCII. (Also see comments below on local conventions).

  If Asian language support is required and the enhanced 8-bit
mechanism would also permit passing a 32-bit character as 4 octets
without mangling, a predefined value for the ISO-10646 standard (once
ISO finalises and approves it -- it isn't finished yet) should also be
present.  Long term interoperability will thus be enhanced.  The
Japanese standards for Kanji are incompatible with the Chinese
standards for characters, even though the glyphs are identical.  This
leads to the conclusion that using ISO 10646 encoding will be more
portable and avoid any Chinese/Japanese politics.  It appears that
there will be a 1-1 mapping possible between the ISO DIS and the
pre-existing Chinese and Japanese standard encodings.  Conversion is
thus straight-forward if it should be locally required.

  It should be permitted but explicitly not required for sets of
cooperating sites to support other mutually agreed upon character set
encodings using some sort of X-some-other-definition values in the
Character-Set: header field.  

  I am unenthusiastic about the ISO 2022 approach or using UNICODE.
Despite claims to the contrary, neither ISO DIS 10646 nor UNICODE is
sufficiently complete and unambiguous in encoding all conceivable
characters.  Having reviewed both documents and being uninvolved with
either the Unicode folks or the ISO DIS 10646 folks and having talked
to other potential users, it appears that the ISO DIS is closer to
being sufficiently complete and unambiguous than UNICODE.

%   2) If multiple character sets, define a mechanism for profiling
%      particular sets for particular communities.
%	The IETF traditionally has written protocols which do not
%	require information external to the specification to
%	interoperate.  If an external mechanism is needed, it needs to
%	be well defined.

  The enhanced SMTP mechanism should not need to know what the
character set encoding is as long as it is an 8-bit set and the
encoding of the CR LF pair is consistent with US ASCII.  The
character set standards themselves can be referenced to the ISO
documents or if need be restated as appendices.

  I do not see the need for an "enclave profiling mechanism" because
I don't think that sites outside an enclave should have to know about
local customs.  To do otherwise makes severe implicit hardware and
software requirements in order to conform to the enhanced RFCs.
(Also see further discussion below).

%2) Enclave Issues
%
%   1) Transport conversion,  Character set/ information conversion.
%	Two types of Email conversion have been discussed in one or
%	more of the discussion lists, including transport encodings
%	between mixed transport environments, and character sets
%	between groups of users who have chosen different "local"
%	character sets.  The need for such conversions, and the relm
%	for which they should be used needs to be discussed, and if
%	necessary, an effort to engineer solutions needs to begin.

  Conversions amongst locally defined character sets and conversions
between RFC-supported character sets and a local character set should
be a locally solved issue (just as it is de facto now with EBCDIC
sites).  For example, if the Danes wish to use their ISO 646 variant
amongst themselves in lieu of ISO 8859-1, it should be permitted but
other folks shouldn't be required to convert to the local ISO 646
variant before sending mail via SMTP.  Any required conversions within
the set of formally-RFC-supported character sets should be defined in
the RFC.  It isn't clear to me that the RFC should require ANY
conversions.

  For example, it seems legitimate that a system only supporting
ISO-8859-1 and US ASCII might not be able to display a message in the
Arabic version of ISO 8859.  The alternative is to implicitly or
explicitly require that we all have multilingual terminals and such a
hardware requirement is not enforcable or reasonable.  The key thing
is that if two users both have systems supporting ISO-8859-N that they
should be able to communicate effectively provided that the
intermediate nodes comply with the enhanced RFC requirements.

%   2) How should enclaves be defined?  How can borders be enforced?
%	If enclaves of users are to be formalized, where an enclave
%	may share a "profile", the boundaries and identification
%	mechanism needs to be defined, whether this includes static
%	configuration, negotiation, or DNS lookup.

  I don't think that enclaves should be defined by the RFC.  I think
that if sets of cooperating sites wish to use some mutually agreed
upon other encoding, that that should be kept a local matter.  I
should not have to know that users in some locale encode things in
ISO 646-X rather than in US ASCII.  If I wish to communicate with a
user in some 8-bit or other character set, then that should be a concern
between the other user and myself.

  Throughout this note, I've tried to keep in mind the realities of
the situation and remain practical in what is/isn't required.  The
chief thing to recall is that there are still sites that use fixed
host tables rather than the DNS and excessive requirements that all
hosts become capable of handling/displaying all character encoding
schemes are doomed to failure.  Designating preferred character
encodings (such as the ISO 8859 family) helps encourage folks to move
towards the same goal, but doesn't require them to get there
immediately.

  I would welcome constructive responses to the list, as I won't be
at the meeting.

Randall Atkinson
randall@Virginia.EDU