From: Jean-Marc Desperrier (jean-marc.desperrier@certplus.com)
Date: Tue Jun 25 2002 - 03:55:24 CDT
J.B. Moreno wrote:
>On 6/24/02 11:56 AM, Charles Lindsey at <chl@clw.cs.man.ac.uk> wrote:
>
>
>>"J.B. Moreno" <planb@newsreaders.com> writes:
>>
>>
>>>Theoretically all of those should be instances where raw UTF8 is being used,
>>>so we have our baseline -- if more than 1-2% of them are not UTF8 we have a
>>>problem with just using raw UTF8.
>>>
1-2 % is way too much. The acceptable limit is more around 0.1 %. Your
number is around 0.005 false positive.
>Andrew took the time to do this, and out of 91610 only 49 appeared to be
>UTF8, of which only 4 were false positives. So, even 1% is off by several
>order of magnitudes. Given a group with steady traffic (say 300 articles a
>day) that's only one error every 10 months. Well within what I'd consider
>an acceptable failure rate for something the user can probably manually
>override with a single command.
>
Out of 91610, how many had characters over 0x80 inside ?
I don't know what his feed is, and if non-US hierarchy are correctly represented.
This said I'm not surprised by the number.
European charsets usually use non-ASCII characters one by one.
So they can not form a valid utf-8 string.
Asian charset use a lot of character above 0x80.
But this does not higher the risk of false positive, because in order to have a false positive, *all* the byte pairs on the line must be valid UTF-8.
The more you have pairs of byte over 0x80, the less there is risks of this happening.
Also a good number of asian charsets are ISO based and will not use the range between 0x80 and 0x9F.