From: Andrew Gierth (andrew@erlenstar.demon.co.uk)
Date: Thu Feb 20 2003 - 07:59:00 CST
>>>>> "Bruce" == Bruce Lilly <blilly@erols.com> writes:
>> No, YOUR ratio is meaningless -- the question is "does the test
>> correctly identify or reject UTF"
Bruce> No, the question was about false positives; the number of
Bruce> negatives is irrelevant to that question.
Misinterpreting the statistics is forgivable, since it's a very
non-intuitive field. Several other people did when I first posted
them.
Failing to pay attention when your statistical error is explained to
you is a good sign that you're not really interested in the truth, and
you only want to extract figures that support your preselected
position.
Bruce> the ratio of false positives is due to the fact that coded
Bruce> utf-8 generates octet sequences which are not markedly
Bruce> different from other 8-bit charsets, especially on short
Bruce> texts.
To quote RFC2279:
UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of
octets, where the number of octets, and the value of each, depend on
the integer value assigned to the character in ISO/IEC 10646. This
transformation format has the following characteristics (all values
are in hexadecimal):
[...]
- UTF-8 strings can be fairly reliably recognized as such by a
simple algorithm, i.e. the probability that a string of characters
in any other encoding appears as valid UTF-8 is low, diminishing
with increasing string length.
My figures solidly agree with this claim. You are nevertheless
dismissing it based on _NO_ evidence. Please consult Mr. Crispin's
signature file for further enlightenment.
-- Andrew.