Re: Transformation of Non-ASCII headers

New Message Reply About this list Date view Thread view Subject view Author view

From: Bruce Lilly (blilly@erols.com)
Date: Sat Feb 22 2003 - 00:49:29 CST


Andrew Gierth wrote:

> Failing to pay attention when your statistical error is explained to
> you is a good sign that you're not really interested in the truth, and
> you only want to extract figures that support your preselected
> position.

I have in fact paid quite close attention; I simply disagree.

> Bruce> the ratio of false positives is due to the fact that coded
> Bruce> utf-8 generates octet sequences which are not markedly
> Bruce> different from other 8-bit charsets, especially on short
> Bruce> texts.

Detailed in another message recently posted; summary one expects
about a 50% percent error rate with iso-8859-x in the mix; a bit
more with some other charsets as well.

> To quote RFC2279:
>
> UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of
> octets, where the number of octets, and the value of each, depend on
> the integer value assigned to the character in ISO/IEC 10646. This
> transformation format has the following characteristics (all values
> are in hexadecimal):
> [...]
> - UTF-8 strings can be fairly reliably recognized as such by a
> simple algorithm, i.e. the probability that a string of characters
> in any other encoding appears as valid UTF-8 is low, diminishing
> with increasing string length.

         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^
N.B. And in many cases, we're only talking about a few characters, which
is a different situation from that described in that part of 2279, which
is an extended run of text. Also note that "short texts" as I said, and
which applies to header field text strings under discussion, is quite
different from 2279's "increasing string length".

> My figures solidly agree with this claim. You are nevertheless
> dismissing it based on _NO_ evidence.

The evidence is the *same* data, coupled with publicly available
information on the code sequences and code space.


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.