From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Mon Feb 17 2003 - 09:26:46 CST
In <3E502557.7010804@Sonietta.blilly.com> Bruce Lilly <blilly@erols.com> writes:
>J.B. Moreno wrote:
>>>152,000 have at least one 8bit character, of which
>>>26 match the utf-8 rule, of which
>>>17 appear to be false matches
>>
>>
>> So, 26/152,000 = .017% 17/152,000 = .011%
>No, no, no; the rate of false positives (again assuming
>that one knows the real charset) is the ratio of the
>false matches to the total matching the utf-8 rule, or
>17 / 26 which is greater than 65%.
What utter rubhish! You do not strike me as being a particularly stupid
person, so I can only suppose that you are being deliberately obtuse.
So here it is again, in words of one syllable.
There is a test which, given a sequence of octets, will answer the
question "does this appear to be valid UTF-8" with a straightforward "yes"
or "no". The test is not perfect.
It is proposed that people who have a desire to read certain headers that
are non-compliant with any current standard, and will still be
non-compliant with Usefor as currently proposed, might wish to use this
test to decide how best to interprete the sequence of octets in some
header that has arrived.
We are only interested in headers with at least one 8-bit octet, because
all other cases are already known to be ASCII (that also includes cases
properly encoded using RFC 2047).
Sometimes the test reports "no". (In fact, on the present Usenet, it will
nearly always report "no".)
No genuine UTF-8 header will ever report "no". I.e. the false negative
rate is zero.
Sometimes the test reports "yes" (not often on the present Usenet), in
which case the user will attempt to interpret it as UTF-8, and sometimes
that will lead to him seeing garbage.
Now it appears that, out of every 152,000 (or thereabouts) cases where it
ought to report "no", it actually reports "yes" about 18 times. In other
words, when it is _supposed_ to report "no", it falsely reports "yes"
0.012% of the time. Those cases are called the "false positives".
Ain't simple arithetic marvellous!
And, if a user (who, remember, has deliberately chosen to ignore the
standard) should ever see a Subject-header garbled for that reason, then
he will have to read, on average, a further 99988 further non-compliant
Subject headers before he encounters another one.
-- Charles H. Lindsey ---------At Home, doing my own thing------------------------ Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl Email: chl@clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K. PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5