From: Bruce Lilly (blilly@erols.com)
Date: Fri Feb 21 2003 - 16:11:39 CST
Charles Lindsey wrote:
> So here it is again, in words of one syllable.
>
> There is a test which, given a sequence of octets, will answer the
> question "does this appear to be valid UTF-8" with a straightforward "yes"
> or "no". The test is not perfect.
Evidently you either can't count past one or don't know what a syllable is...
> We are only interested in headers with at least one 8-bit octet, because
> all other cases are already known to be ASCII (that also includes cases
> properly encoded using RFC 2047).
We are only concerned with patterns which match the sequences
which could be generated by UTF-8, because all other cases are
known not to be utf-8.
> Now it appears that, out of every 152,000 (or thereabouts) cases where it
> ought to report "no", it actually reports "yes" about 18 times. In other
> words, when it is _supposed_ to report "no", it falsely reports "yes"
> 0.012% of the time. Those cases are called the "false positives".
No, out of the cases which match possible utf-8 sequences, the
assumption that the charsets *is* utf-8 is wrong approximately
half of the time. Those are the false positives. The ones which
are known a priori not to be utf-8 (both 7-bit-only and sequences
which cannot be utf-8) are irrelevant.
And with at most 0.2% of current articles containing utf-8
sequences, the problem will become larger in magnitude if
more untagged utf-8 is used (but the false positive rate is
unlikely to change much as it is primarily determined by the
overlapping codes in the various charsets).