From: J.B. Moreno (planb@newsreaders.com)
Date: Tue Feb 18 2003 - 00:38:44 CST
On 2/17/03 10:26 AM, Charles Lindsey at <chl@clw.cs.man.ac.uk> wrote:
> There is a test which, given a sequence of octets, will answer the
> question "does this appear to be valid UTF-8" with a straightforward "yes"
> or "no". The test is not perfect.
Speaking of which -- on re-reading Andrews last analysis, it seems that
there were relatively few articles that could even /possibly/ be confused
with UTF8 (most had no bytes in the 0x80-0x9f range).
Out of 152,000 articles, 303 had bytes in that range -- it would be useful
to know if there are any groups where they regularly appear (and the number
of posters doing so).
Of course it's probably a bit too late to ask for the previous test, it'd
have to be for a new run...
(For my own part, it seems like 12 articles that matched the utf8 regex were
posted yesterday, all in the pl hierarchy, with 4 different subjects, not
sure if it's really UTF8).
-- J.B. Moreno