From: J.B. Moreno (planb@newsreaders.com)
Date: Sun Feb 16 2003 - 21:05:58 CST
On 2/16/03 6:57 PM, Bruce Lilly at <blilly@erols.com> wrote:
First off, please stop cc'ing me, it's annoying getting two identical
messages.
> J.B. Moreno wrote:
>
>>> 152,000 have at least one 8bit character, of which
>>> 26 match the utf-8 rule, of which
>>> 17 appear to be false matches
>>
>> So, 26/152,000 = .017% 17/152,000 = .011%
>
> No, no, no; the rate of false positives (again assuming that one knows the
> real charset) is the ratio of the false matches to the total matching the
> utf-8 rule, or 17 / 26 which is greater than 65%.
-snip-
>> So we have a range of between .011% all the way up to .019% for false
>> positives.
>
> No, that's a range of meaningless ratios, not a percentage of false positives,
> which for the two sets of figures given is substantial, around 50% or about as
> good as flipping a fair coin.
No, YOUR ratio is meaningless -- the question is "does the test correctly
identify or reject UTF", and the answer is: 151,983 times out of 152,000 it
did, i.e. the test was correct 99.988815789% of the time.
The ratio you are focused on only shows that currently it is relatively
useless to ask that question -- and even then, you've got the wrong end of
the stick; instead of asking what percentage of valid seeming UTF8 is really
valid, programmers will be asking themselves how often UTF8 appears instead
of the local charset (i.e. instead of 17/26, it's 9/152,000).
If by blessing UTF8 we cause more UTF8 to be posted, then that 9 will go up.
And since any increase in the number of UTF8 articles will /decrease/ the
ratio you seem to think is important, any significant usage of raw UTF8
will change that ratio enough so that it likewise isn't a problem (i.e. if
just 1% of the people using a local charset switch to UTF then the ratio you
are focused on would be 17/1,520 or 1.11%).
So, either the status quo continues, and any 8 bit char should be assumed to
be in the current charset, or people start using UTF8, in which case
checking for UTF8 won't MISLABEL "local" as UTF often enough to be a
problem.
Now, the nice thing about this is that we can say today "check for UTF8, if
positive then treat it as UTF8" without that being bad advice -- even if it
*never* encounters UTF8 that isn't really the local charset, it wouldn't
/think/ it does often enough to be a problem.
>> It's quite clear on two points: raw utf8 usage is extremely low, false
>> positives on checks for utf8 is likewise extremely low.
-snip-
> And it's clear that those quoting < 0.1 % "false positive"
> ratios are quoting the wrong numbers. Which is not surprising
> given a) the religious fervor involved, and b) the reality of
> small bits of text, where one would expect a high false positive
> rate (anything under 20% would be suspect).
Whether it's the wrong ratio or not depends upon the question, if the
question is "can it tell the difference between the local charset and UTF8"
the answer is a resounding "yes!", if the question is "is it currently
worthwhile to make such a test for UTF8", the answer is an equally
resounding "no!" (and the ratio of looks-like / actually-is, isn't the
reason why, at most that's the icing on the cake, the ratio of utf8 to local
charset is why).
But the answer to the second question depends upon how many UTF8 articles
are posted, by blessing UTF8 we will (hopefully) cause it to change.
-- J.B. Moreno