From: Russ Allbery (rra@stanford.edu)
Date: Wed Jul 31 2002 - 13:51:01 CDT
Bernie Cosell <bernie@fantasyfarm.com> writes:
> *SURELY* needs to embrace better-than-7-bit headers, beyond the obvious
> things like being able to properly put in their 'organization' and other
> such [including, as per the current discussion, 'newsgroups:] but more
> broadely, for example, for email addresses [surely folks would like to
> be able to have things like "andré@myhost.in.fr" for their email
> addresses???]
Here's the thing to understand about the e-mail viewpoint here: They've
already made the conceptual jump in realizing that the bits that are sent
over the wire need not have a one-to-one correspondence with what people
see. Even the character set is, in essence, an encoding of a general
notion of "character."
That, combined with the amount of legacy systems that the e-mail people
feel constrained to support, has led to a stance that basically says that
when you send anything other than pure ASCII over e-mail, you encode it at
the sender's end and decode it at the recipient's end.
E-mail by the standards *does* handle characters sets other than 7-bit
ASCII, in a repeatable, standardized, and fully specified fashion. It
does so in a way that ensures that any time you encounter non-ASCII text
in an e-mail message, it comes with explicit character set information so
that you always know what character set that text is in.
It does this by using a wire transfer encoding that's 7-bit ASCII and that
e-mail user agents need to encode to and decode from.
Now, when USEFOR started down this same path, there was an understandable
dislike of encodings, since encoding something adds additional complexity
and is easy to implement incorrectly. On top of that, e-mail had some
serious problems with picking encodings; there are not one but four
separate e-mail encodings (base64, quoted-printable, RFC 2047, and RFC
2231), all of which you have to implement to have a fully compliant mail
implementation. And to make matters worse, none of those encodings are
particularly suitable for encoding the Newsgroups header, which imposes
this additional requirement that all encodings of the same set of
newsgroups produce byte-for-byte identical text. (Most of the mail
encodings have annoying properties like being case-insensitive.)
There was therefore an early bias in the USEFOR discussions towards just
saying "all 8-bit characters are UTF-8" and not having to deal with all of
this. At the time, it was hoped that mail was also going in that same
direction after some transition period and getting rid of all of these
encodings.
After further discussions, it seems apparent now that mail is *not* going
in that direction, and that in fact mail is likely to stick with its set
of encodings for the indefinite future, which would make going to UTF-8 in
newsgroup messages a significant break between the news article format and
the mail message format. Some people don't consider this a problem. Some
people consider this to be a significant problem. (I'm in the latter
camp.)
One of the first places where this problem shows up is with handling
submissions to moderated groups, since this is the most frequent example
of a news message being sent via e-mail and then turned back into a news
message.
It is not actually *necessary* for news to use raw UTF-8 in headers. It
is more convenient for a lot of reasons, but also has other drawbacks.
There have been extensive debates on whether this is really the right
approach or not. Over time, I've become more and more convinced that
following mail is more important than doing something that's somewhat
cleaner, but I'm probably in the minority right now on USEFOR in holding
that position.
One signficant drawback of not breaking compatibility with mail is that we
would have to invent yet a fifth encoding mechanism for the Newsgroups
header (possibly reusable by other news headers) because none of the four
that mail has come up with are suitable for that purpose for various
reasons. The current USEFOR proposal is to standardize that encoding but
to only apply it when mailing messages to moderators, and to otherwise use
raw UTF-8 in the Newsgroups header. The story concerning use of UTF-8 in
the other headers is rather more confused at the moment, in my opinion,
but other people may have other opinions about how confusing it is.
> my quandry, also. *GIVEN* that there is some email-rule for putting
> better than 8-bit stuff into email headers, why do we [as usenet folk]
> *CARE* what that mechanism is? Right now there's a lot of synergy on
> the client side by having news and email share a common format... it
> seems like it'd be a terrible shame to lose that.
Yes, that's also my opinion, but that opinion is not currently in the
majority on USEFOR so far as I can tell. (Although as mentioned above,
with Newsgroups we're going to have to make something up on our own one
way or another, whether that be another encoding or ways of handling raw
UTF-8.)
-- Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>