Re: Transformation of Non-ASCII headers

New Message Reply About this list Date view Thread view Subject view Author view

From: Andrew Gierth (andrew@erlenstar.demon.co.uk)
Date: Sat Feb 22 2003 - 03:30:30 CST


>>>>> "Bruce" == Bruce Lilly <blilly@erols.com> writes:

> Andrew Gierth wrote:
>> Failing to pay attention when your statistical error is explained
>> to you is a good sign that you're not really interested in the
>> truth, and you only want to extract figures that support your
>> preselected position.

 Bruce> I have in fact paid quite close attention; I simply disagree.

In that case I suggest you consult some references on the subject. The
figure that you are using is not the false positive rate, but what is
called the "positive predictive power" of the test - i.e. the
probability that the text was in fact UTF-8 _given that the test said
it was_. The PPP is well known to be highly dependent on the composition
of the original data (the "prevalence" - i.e. the proportion of true
positives in the original population). The false positive rate, on the
other hand, is the probability that the test will say that something is
UTF-8 when in fact it was not, and is not dependent on the prevalence.

While the positive predictive power of a test is very important in
some cases (such as medical screening), it's not really relevent to
the problem at hand, precisely because it is so dependent on the
prevalence. From the perspective of a client author deciding what to
do with untagged data, the question is "how often does applying this
rule cause the client to do the wrong thing" - which is related to the
false positive rate (more generally, to the correct classification
rate, but in this case the false negative rate is zero so the two are
almost the same), _not_ the predictive power.

Here is the results matrix for my original data (where "positive"
is taken as "is UTF-8"):

                  Actual + Actual -
  Predicted + 9 17
  Predicted - 0 151974

This gives the following characteristics:
  Prevalence 0.006% (i.e. genuine UTF-8 is vanishingly rare in the sample)
  Correct Classification Rate 99.989%
  False Positive Rate 0.011%
  False Negative Rate 0
  Sensitivity 100%
  Specificity 99.989%
  Positive Predictive Power 34.6%
  Negative Predictive Power 100%

To show the difference that the prevalence makes to the results, let's
add back in to the original sample all the strings (817,947 of them)
which had no 8-bit characters at all and are therefore valid utf-8.
This makes the result matrix:

                  Actual + Actual -
  Predicted + 817956 17
  Predicted - 0 151974
  
  Prevalence 84.3%
  Correct Classification Rate 99.998%
  False Positive Rate 0.011%
  False Negative Rate 0
  Sensitivity 100%
  Specificity 99.989%
  Positive Predictive Power 99.998%
  Negative Predictive Power 100%

Notice that the false-positive rate did not change (but the predictive
power changed drastically).

 Bruce> the ratio of false positives is due to the fact that coded
 Bruce> utf-8 generates octet sequences which are not markedly
 Bruce> different from other 8-bit charsets, especially on short
 Bruce> texts.

on the contrary, utf-8 octet sequences _are_ markedly different from
those of other charsets, as can be shown from the fact that a test
exists that selects them with high specificity (see figures above)
even on relatively short texts such as subject lines.

 Bruce> Detailed in another message recently posted; summary one
 Bruce> expects about a 50% percent error rate with iso-8859-x in the
 Bruce> mix; a bit more with some other charsets as well.

this statement is your complete misinterpretation of the data (and the
meaning of "error rate" as opposed to "predictive power").

>> To quote RFC2279:
>>
>> UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of
>> octets, where the number of octets, and the value of each, depend on
>> the integer value assigned to the character in ISO/IEC 10646. This
>> transformation format has the following characteristics (all values
>> are in hexadecimal):
>> [...]
>> - UTF-8 strings can be fairly reliably recognized as such by a
>> simple algorithm, i.e. the probability that a string of characters
>> in any other encoding appears as valid UTF-8 is low, diminishing
>> with increasing string length.

 Bruce> N.B. And in many cases, we're only talking about a few
 Bruce> characters, which is a different situation from that described
 Bruce> in that part of 2279, which is an extended run of text.

what 2279 says (and which my data confirms) is that the probability
of mis-recognition is low EVEN FOR SHORT STRINGS, and decreases as the
string gets longer.

The probability is actually worst for short strings in Chinese (or
other double-byte charsets of similar structure) but even there, and
even for very short strings (only one or two non-ASCII characters) the
probability is still fairly low.

 Bruce> Also note that "short texts" as I said, and which applies to
 Bruce> header field text strings under discussion, is quite different
 Bruce> from 2279's "increasing string length".

Please at least find me _one_ actual subject line taken from the real
world which is recognised as valid utf-8 despite being actually
iso-8859-1. If the probability of such a mis-recognition is so high,
surely you'll find this is easy?

>> My figures solidly agree with this claim. You are nevertheless
>> dismissing it based on _NO_ evidence.

 Bruce> The evidence is the *same* data,

_MY_ data, which you are apparently unable to interpret properly.

 Bruce> coupled with publicly available information on the code
 Bruce> sequences and code space.

Did you not notice that there was NOT ONE SINGLE CASE in my data where
an iso-8859-x string was incorrectly identified as UTF-8? Despite the
fact that significant proportions of the sample came from hierarchies
where untagged iso-8859-x is the norm?

How does this, then, support your claim that the error rate is
significant?

-- 
Andrew.


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.