Re: Transformation of Non-ASCII headers

New Message Reply About this list Date view Thread view Subject view Author view

From: Bruce Lilly (blilly@erols.com)
Date: Sun Feb 16 2003 - 21:51:42 CST


J.B. Moreno wrote:
> On 2/16/03 6:57 PM, Bruce Lilly at <blilly@erols.com> wrote:
>
> First off, please stop cc'ing me, it's annoying getting two identical
> messages.

I cannot keep track of who is subscribed to various lists; please
use a Reply-To header field if you want replies directed to
the list only.

> No, YOUR ratio is meaningless -- the question is "does the test correctly
> identify or reject UTF"

No, the question was about false positives; the number of negatives
is irrelevant to that question.

> The ratio you are focused on only shows that currently it is relatively
> useless to ask that question

No, the question is quite important, as it clearly shows that
correct identification of a mix of untagged charsets is not
as simple as "assume utf-8" as has been claimed; that is
wrong as often as not. The only way a single untagged raw
charset can be identified is if it is known a priori that
it is the only untagged charset in use, and clearly that is
not now the case -- indeed actual use of raw utf-8 is
negligible.

> If by blessing UTF8 we cause more UTF8 to be posted, then that 9 will go up.
> And since any increase in the number of UTF8 articles will /decrease/ the
> ratio you seem to think is important

No, it is unlikely to change at all; the ratio of false
positives is due to the fact that coded utf-8 generates
octet sequences which are not markedly different from
other 8-bit charsets, especially on short texts.

> So, either the status quo continues, and any 8 bit char should be assumed to
> be in the current charset, or people start using UTF8, in which case
> checking for UTF8 won't MISLABEL "local" as UTF often enough to be a
> problem.

It will mislabel other charsets as utf-8 about 50% of the
time, based on the data presented.

What *should* happen is that compatible RFC 2047 tagging
and encoding should be used, in which case there are no
backwards compatibility problems, language tagging in the
protocol as required by RFC 2277 is provided, there are no
problems with IMAP or SMTP, and there is no need to guess
about charset as it will be explicitly labeled.

> Now, the nice thing about this is that we can say today "check for UTF8, if
> positive then treat it as UTF8" without that being bad advice

It will absolutely be bad advice -- it will be wrong half
the time.

>>And it's clear that those quoting < 0.1 % "false positive"
>>ratios are quoting the wrong numbers. Which is not surprising
>>given a) the religious fervor involved, and b) the reality of
>>small bits of text, where one would expect a high false positive
>>rate (anything under 20% would be suspect).
>
>
> Whether it's the wrong ratio or not depends upon the question, if the
> question is "can it tell the difference between the local charset and UTF8"
> the answer is a resounding "yes!", if the question is "is it currently
> worthwhile to make such a test for UTF8", the answer is an equally
> resounding "no!" (and the ratio of looks-like / actually-is, isn't the
> reason why, at most that's the icing on the cake, the ratio of utf8 to local
> charset is why).

But those weren't the questions, the question was false positives.
And a 50% error is awful.

> But the answer to the second question depends upon how many UTF8 articles
> are posted, by blessing UTF8 we will (hopefully) cause it to change.

No, as long as there is a mix of untagged charsets, one
can expect approximately a 50% error rate on identification,
since the rate is determined by the nature of the charsets
and is independent of volume.


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.