Re: Transformation of Non-ASCII headers

New Message Reply About this list Date view Thread view Subject view Author view

From: J.B. Moreno (planb@newsreaders.com)
Date: Sun Feb 16 2003 - 14:47:39 CST


On 2/16/03 11:23 AM, Bruce Lilly at <blilly@erols.com> wrote:

> J.B. Moreno wrote:
>
>> Andrew Gierth did an analysis for us last year (June 02), the results were
>> that it *could* be reliably detected in short runs of text typical of header
>> fields (specifically, the Subject): the instances of false positives was
>
> One cannot identify false positives without some indication of
> the true charset, which is necessarily absent in untagged data.

Sounds like you didn't read Andrew's message (because 1: the hierarchy gives
some indication of the true charset, and 2: even ignoring that wouldn't make
that much of a difference)

> Subject: Subject Line Data
> From: Andrew Gierth <andrew@erlenstar.demon.co.uk>
> Message-ID: <87it4663qk.fsf@erlenstar.demon.co.uk>
>
> ok, this time I saved more info about the subject lines.
-snip-
> About 2.5 days of data:
>
> 3,471,366 subject lines, of which
> 969,947 are in non-binary groups, of which
> 152,000 have at least one 8bit character, of which
> 26 match the utf-8 rule, of which
> 17 appear to be false matches

So, 26/152,000 = .017% 17/152,000 = .011%

> Subject: Subject header statistics
> From: Andrew Gierth <andrew@erlenstar.demon.co.uk>
> Message-ID: <87u1nsbsgp.fsf@erlenstar.demon.co.uk>
>
> Out of 91610 subject headers containing 8-bit (just under a day's worth),
> only 49 matched this (Perl) regexp:
-snip-
> 31 of them were a binary series with an English subject
> line in which the word "Can't" had been spelled with U+00B4 (acute
> accent)

49-31=18 (I'll assume that you'll consider the above as sufficient
identification of those 31 at least) and 18/91610=.019%.

So we have a range of between .011% all the way up to .019% for false
positives. This is less than 300 thousand headers at most, you may if you
wish say that is not sufficent for a conclusion, and request that Andrew
re-run his test (I wouldn't mind seeing it done daily for 2-4 weeks, if it's
not too much trouble for him), but to say the data isn't clear in what it
says is simply wrong.

It's quite clear on two points: raw utf8 usage is extremely low, false
positives on checks for utf8 is likewise extremely low.

(The two rates are so close actually makes the analysis easier, not harder).

-- 
J.B. Moreno


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.