From: Bruce Lilly (blilly@erols.com)
Date: Sat Feb 22 2003 - 00:49:29 CST
Andrew Gierth wrote:
> Failing to pay attention when your statistical error is explained to
> you is a good sign that you're not really interested in the truth, and
> you only want to extract figures that support your preselected
> position.
I have in fact paid quite close attention; I simply disagree.
> Bruce> the ratio of false positives is due to the fact that coded
> Bruce> utf-8 generates octet sequences which are not markedly
> Bruce> different from other 8-bit charsets, especially on short
> Bruce> texts.
Detailed in another message recently posted; summary one expects
about a 50% percent error rate with iso-8859-x in the mix; a bit
more with some other charsets as well.
> To quote RFC2279:
>
> UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of
> octets, where the number of octets, and the value of each, depend on
> the integer value assigned to the character in ISO/IEC 10646. This
> transformation format has the following characteristics (all values
> are in hexadecimal):
> [...]
> - UTF-8 strings can be fairly reliably recognized as such by a
> simple algorithm, i.e. the probability that a string of characters
> in any other encoding appears as valid UTF-8 is low, diminishing
> with increasing string length.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^
N.B. And in many cases, we're only talking about a few characters, which
is a different situation from that described in that part of 2279, which
is an extended run of text. Also note that "short texts" as I said, and
which applies to header field text strings under discussion, is quite
different from 2279's "increasing string length".
> My figures solidly agree with this claim. You are nevertheless
> dismissing it based on _NO_ evidence.
The evidence is the *same* data, coupled with publicly available
information on the code sequences and code space.