From: Andrew Gierth (andrew@erlenstar.demon.co.uk)
Date: Thu Feb 20 2003 - 07:31:26 CST
>>>>> "Mark" == Mark Crispin <mrc@CAC.Washington.EDU> writes:
> On Sun, 16 Feb 2003, Bruce Lilly wrote:
>> No, no, no; the rate of false positives (again assuming
>> that one knows the real charset) is the ratio of the
>> false matches to the total matching the utf-8 rule, or
>> 17 / 26 which is greater than 65%.
Mark> That's correct,
on the contrary, it is statistical nonsense.
The sample contained 151,991 strings of non-UTF-8 text containing
8-bit characters, and when these were fed to the "is this UTF-8"
algorithm it incorrectly answered "yes" in 17 cases. (taking the
original figures for the time being, in fact the real error rate was
lower).
The number of _correct_ "yes" answers is dependent only on the
composition of the sample and not the error rate of the algorithm.
-- Andrew.