From: J.B. Moreno (planb@newsreaders.com)
Date: Fri Feb 14 2003 - 14:09:14 CST
On 2/14/03 12:55 PM, Bruce Lilly at <blilly@erols.com> wrote:
> J.B. Moreno wrote:
>> On 2/12/03 10:44 AM, Bruce Lilly at <blilly@erols.com> wrote:
>>
>>> Before utf-8 can be adopted, there needs to be a transition
>>> period where there is a moratorium on *all* untagged 8-bit
>>> header field content as a prerequisite to a state where
>>> the only untagged 8-bit content is utf-8. The current
>>> Usefor draft lacks such a transition plan.
>>
>> Because it doesn't need it -- UTF8 can be reliably inferred, and even if it
>> couldn't it wouldn't matter; as a last resort the user can always tell the
>> UA what it is.
>
> UTF-8 in the absence of tagging and in the presence of other
> untagged 8-bit charsets cannot be reliably detected, particularly
> for the short runs of text typical of text in header fields.
Andrew Gierth did an analysis for us last year (June 02), the results were
that it *could* be reliably detected in short runs of text typical of header
fields (specifically, the Subject): the instances of false positives was
under 0.012% (not 12%, not 1.2%, not 0.12%, but under 0.012%) by any measure
of "reliable" I've ever used, that fits the bill.
If you think that's changed since then, you could ask him to rerun his test,
but I, for one, am satisfied that UTF8 can be reliably inferred.
(Which makes me think that a possible better way to go for mail, if it
refuses to go 8 bits, would be to introduce a general purpose encoding that
does *not* include a charset, but was simply a byte value translation;
=?iso-bval?Q?Jan=DFen?=, it'd be no harder to get supported than Punnycode
or any other encoding, and would deal with the fact that often mail relayers
*do* get untagged 8 bit bytes).
-- J.B. Moreno