Re: Transformation of Non-ASCII headers

New Message Reply About this list Date view Thread view Subject view Author view

From: Bruce Lilly (blilly@erols.com)
Date: Fri Feb 21 2003 - 16:11:39 CST


Charles Lindsey wrote:

> So here it is again, in words of one syllable.
>
> There is a test which, given a sequence of octets, will answer the
> question "does this appear to be valid UTF-8" with a straightforward "yes"
> or "no". The test is not perfect.

Evidently you either can't count past one or don't know what a syllable is...

> We are only interested in headers with at least one 8-bit octet, because
> all other cases are already known to be ASCII (that also includes cases
> properly encoded using RFC 2047).

We are only concerned with patterns which match the sequences
which could be generated by UTF-8, because all other cases are
known not to be utf-8.

> Now it appears that, out of every 152,000 (or thereabouts) cases where it
> ought to report "no", it actually reports "yes" about 18 times. In other
> words, when it is _supposed_ to report "no", it falsely reports "yes"
> 0.012% of the time. Those cases are called the "false positives".

No, out of the cases which match possible utf-8 sequences, the
assumption that the charsets *is* utf-8 is wrong approximately
half of the time. Those are the false positives. The ones which
are known a priori not to be utf-8 (both 7-bit-only and sequences
which cannot be utf-8) are irrelevant.

And with at most 0.2% of current articles containing utf-8
sequences, the problem will become larger in magnitude if
more untagged utf-8 is used (but the false positive rate is
unlikely to change much as it is primarily determined by the
overlapping codes in the various charsets).


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.