Re: Subject header statistics

New Message Reply About this list Date view Thread view Subject view Author view

From: Jean-Marc Desperrier (jean-marc.desperrier@certplus.com)
Date: Tue Jun 25 2002 - 03:55:24 CDT


J.B. Moreno wrote:

>On 6/24/02 11:56 AM, Charles Lindsey at <chl@clw.cs.man.ac.uk> wrote:
>
>
>>"J.B. Moreno" <planb@newsreaders.com> writes:
>>
>>
>>>Theoretically all of those should be instances where raw UTF8 is being used,
>>>so we have our baseline -- if more than 1-2% of them are not UTF8 we have a
>>>problem with just using raw UTF8.
>>>
1-2 % is way too much. The acceptable limit is more around 0.1 %. Your
number is around 0.005 false positive.

>Andrew took the time to do this, and out of 91610 only 49 appeared to be
>UTF8, of which only 4 were false positives. So, even 1% is off by several
>order of magnitudes. Given a group with steady traffic (say 300 articles a
>day) that's only one error every 10 months. Well within what I'd consider
>an acceptable failure rate for something the user can probably manually
>override with a single command.
>
Out of 91610, how many had characters over 0x80 inside ?

I don't know what his feed is, and if non-US hierarchy are correctly represented.

This said I'm not surprised by the number.
European charsets usually use non-ASCII characters one by one.
So they can not form a valid utf-8 string.
Asian charset use a lot of character above 0x80.
But this does not higher the risk of false positive, because in order to have a false positive, *all* the byte pairs on the line must be valid UTF-8.
The more you have pairs of byte over 0x80, the less there is risks of this happening.
Also a good number of asian charsets are ISO based and will not use the range between 0x80 and 0x9F.


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.