From: Bruce Lilly (blilly@erols.com)
Date: Sun Feb 23 2003 - 10:41:12 CST
Andrew Gierth wrote:
> While the positive predictive power of a test is very important in
> some cases (such as medical screening), it's not really relevent to
> the problem at hand, precisely because it is so dependent on the
> prevalence.
It is absolutely relevant to the problem, since it bears on
whether or treating all untagged 8-bit content which matches
utf-8 octets as utf-8 is accurate.
Which may be a moot issues as it is becoming quite clear that
untagged raw utf-8 isn't going to be legal for article
generation in any IETF Usenet article format standard at this
time.
> Bruce> the ratio of false positives is due to the fact that coded
> Bruce> utf-8 generates octet sequences which are not markedly
> Bruce> different from other 8-bit charsets, especially on short
> Bruce> texts.
>
> on the contrary, utf-8 octet sequences _are_ markedly different from
> those of other charsets, as can be shown from the fact that a test
> exists that selects them with high specificity (see figures above)
> even on relatively short texts such as subject lines.
All valid utf-8 sequences begin with one octet in the range
0xc0 - 0xdf and are followed by octets in the range 0x80 - 0xbf.
The widely used iso-8859 charsets also include all of 0xc0 - 0xdf
and half of the 0x80 - 0x9f range, viz 0xa0 - 0xbf. Therefore
fully half of all valid utf-8 sequences are also valid iso-8859
sequences.
Anyway, as raw utf-8 seems to be a moot issue, I suggest that we
move on to more productive topics.