Re: Transformation of Non-ASCII headers

New Message Reply About this list Date view Thread view Subject view Author view

From: Bruce Lilly (blilly@erols.com)
Date: Sun Feb 16 2003 - 10:08:48 CST


J.B. Moreno wrote:
> On 2/14/03 12:07 PM, Bruce Lilly at <blilly@erols.com> wrote:

>>That is a necessary prerequisite from the current chaos to use of a single
>>8-bit untagged charset. Enforcement isn't necessary -- if there's a desire to
>>move to utf-8, use of untagged 8-bit content will have to cease first so that
>>when generation of untagged utf-8 is eventually permitted, one can be
>>absolutely assured that such untagged 8-bit content *is* utf-8 and not any of
>>a hundred other charsets. That's the carrot.
>
>
> And it's not enough of one -- something you can use today, that will be
> understood today, will (hopefully) be enough of one

RFC 2047/2231 can be used today and understood today.

> There *will* be untagged 8-bit content -- the only question is whether we
> can convince people to use UTF8 or not. The idea that a current standard is
> being ignored, so we must write a new standard that says the exact same
> thing, is simply silly.

That is not the rationale. The rationale is that if one is going
to interpret the stream of octets 110xxxxx 10xxxxxx for example
as always being utf-8, then it is first necessary to ensure that
somebody is not sending such a sequence in some other untagged
8-bit charset. And there are a large number of such sequences
which conform to the patterns that can be generated by utf-8
which are not only possible, but not particularly uncommon
sequences in other charsets, especially in the common case
in header fields, where only a few 8th-bit-set octets appear.

One cannot use any untagged 8-bit charset today with any reasonable
expectation of it being understood, not only because of lack of
support in readers and lack of compatibility with transport, but
also because there simply is no way for the reader to determine
*which* charset is in use for the typically short sequences that
are used in header fields. The only way that a reader can be
guaranteed that an untagged 8-bit sequence is utf-8 is if all
use of all other untagged 8-bit charsets is stopped, and the only
way to ensure that is by a "cooling-off" period in which *no*
untagged 8-bit charsets are used (since it's not possible to
reliably determine which charset is in use in the typically short
sequences used in header fields).


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.