From: J.B. Moreno (planb@newsreaders.com)
Date: Sun Feb 23 2003 - 19:07:27 CST
On 2/23/03 11:41 AM, Bruce Lilly at <blilly@erols.com> wrote:
> Andrew Gierth wrote:
>
>> While the positive predictive power of a test is very important in
>> some cases (such as medical screening), it's not really relevent to
>> the problem at hand, precisely because it is so dependent on the
>> prevalence.
>
> It is absolutely relevant to the problem, since it bears on
> whether or treating all untagged 8-bit content which matches
> utf-8 octets as utf-8 is accurate.
You consistently ignore the fact that there are two questions involved, and
the answers to the individual questions depends upon two entirely different
things.
Question number one: Is non-UTF8 going to be frequently mis-identified as
UTF8. To answer that question we gathered a lot of real world data and
analyzed it (note that I said *real* world data, we didn't make up the data
set, it's what is actually floating around), and came to the conclusion that
misidentification would happen roughly once every 10,000 messages with 8 bit
characters.
Question number two: Is it wise to /treat/ text that matches the syntax of
UTF8 as UTF8.
There's little point in talking to you about question number two, until such
time as you admit that question number one is a valid question and that we
have correctly answered it.
-- J.B. Moreno