Re: Transformation of Non-ASCII headers

New Message Reply About this list Date view Thread view Subject view Author view

From: Andrew Gierth (andrew@erlenstar.demon.co.uk)
Date: Thu Feb 20 2003 - 07:59:00 CST


>>>>> "Bruce" == Bruce Lilly <blilly@erols.com> writes:

>> No, YOUR ratio is meaningless -- the question is "does the test
>> correctly identify or reject UTF"

 Bruce> No, the question was about false positives; the number of
 Bruce> negatives is irrelevant to that question.

Misinterpreting the statistics is forgivable, since it's a very
non-intuitive field. Several other people did when I first posted
them.

Failing to pay attention when your statistical error is explained to
you is a good sign that you're not really interested in the truth, and
you only want to extract figures that support your preselected
position.

 Bruce> the ratio of false positives is due to the fact that coded
 Bruce> utf-8 generates octet sequences which are not markedly
 Bruce> different from other 8-bit charsets, especially on short
 Bruce> texts.

To quote RFC2279:

   UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of
   octets, where the number of octets, and the value of each, depend on
   the integer value assigned to the character in ISO/IEC 10646. This
   transformation format has the following characteristics (all values
   are in hexadecimal):
[...]
   - UTF-8 strings can be fairly reliably recognized as such by a
      simple algorithm, i.e. the probability that a string of characters
      in any other encoding appears as valid UTF-8 is low, diminishing
      with increasing string length.

My figures solidly agree with this claim. You are nevertheless
dismissing it based on _NO_ evidence. Please consult Mr. Crispin's
signature file for further enlightenment.

-- 
Andrew.


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.