Re: Transformation of Non-ASCII headers

New Message Reply About this list Date view Thread view Subject view Author view

From: Bruce Lilly (blilly@erols.com)
Date: Sat Feb 22 2003 - 00:40:31 CST


Andrew Gierth wrote:
>>>>>>"Mark" == Mark Crispin <mrc@CAC.Washington.EDU> writes:
>
>
> > On Sun, 16 Feb 2003, Bruce Lilly wrote:
> >> No, no, no; the rate of false positives (again assuming
> >> that one knows the real charset) is the ratio of the
> >> false matches to the total matching the utf-8 rule, or
> >> 17 / 26 which is greater than 65%.
>
> Mark> That's correct,
>
> on the contrary, it is statistical nonsense.
>
> The sample contained 151,991 strings of non-UTF-8 text containing
> 8-bit characters, and when these were fed to the "is this UTF-8"
> algorithm it incorrectly answered "yes" in 17 cases. (taking the
> original figures for the time being, in fact the real error rate was
> lower).

And several million additional non-UTF-8 strings could have
been fed in (the ones with no 8-bit content). That's irrelevant
to the question of whether or not the strings which matched utf-8-
compatible sequences were in fact utf-8 or false matches.


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.