From: J.B. Moreno (planb@newsreaders.com)
Date: Sat Feb 22 2003 - 10:29:06 CST
On 2/21/03 9:08 PM, Bruce Lilly at <blilly@erols.com> wrote:
> J.B. Moreno wrote:
>> On 2/18/03 2:45 AM, Mark Crispin at <mrc@cac.washington.edu> wrote:
>>
>>
>>> That's correct, and that pretty much shoots down the use of a test to
>>> determine if something is UTF-8. The test can prove that text is not
>>> UTF-8 (assuming that the UTF-8 wasn't somehow damaged in transit), but it
>>> does not reliably prove that text is UTF-8.
>>
>>
>> No, it doesn't.
>
> So we're all agreed; the test doesn't reliably prove that text is utf-8.
I've been polite and not intentionally misread your statements, please
extend the same kindness to myself: i.e. you know I was referring to the
"pretty much shoots down the use of a test", when I said no.
> <cooked numbers snipped>
Those numbers are not cooked -- since I was talking percentages, I did one
thing and one thing only with those numbers, and by no stretch of the
imagination could that be called "cooking"; I rounded it off to the nearest
1/2 of a percent (I meant to say nearest 10th of a percent, but either way
it's not "cooking").
>> That's the situation as it stands today. But we hope to change A2 by
>> encouraging people to switch over from the "local" charset to UTF8, *if*
>> that happens then it *will* help because there will actually *be* articles
>> identified as UTF8.
>
> There are some now.
In absolute terms, yes, percentage wise they don't make up enough to worry
about.
>> Since the percentage of "incorrectly identified as
>> UTF8" will NOT go up, but instead go down
>
> No, about half of those identified as "utf-8" will in fact not be
> utf-8, over a wide range.
How do you get that percentage -- if 1% of the people using the local
charset switched over to using UTF8, then instead of being "half" it'd be
10% or less. If 2% switch over it's 5%, if 4% it's 2.5%, if 8% then it's
1.75%, if 15% it's less than 1% mis-identified.
This is complete and utter nonsense. Only if the absolute number of raw
utf8 post stays *exactly* the same will it even be close to 50% (it'd have
to go down to *be* 50%), any increase in the absolute number of raw utf8
posts will result in a /dramatic/ decrease in that percentage. Just *ONE*
person participating in *one* non-english newsgroup that changes the
encoding of each of his messages to raw utf8 could reduce that percentage to
under 10% (i.e. I've seen several people on usenet that routinely post 100+
message a day, I have no reason to suspect that such people are limited to
english speakers, anyone that did that in the non-english groups and changed
the charset would make the benefit of the test quite clear).
>> If what we do does NOT encourage people to switch (or let us say doesn't
>> encourage even 1% of the people to switch), then we're no worse off than we
>> are today -- we have a standard that says do X and it is ignored.
>
> Not true; today we do *not* have a standard that says send raw utf-8 --
> in fact the standard currently prohibits that.
And? We currently have a standard that clearly and absolutely, prohibits
it, and as you point out above, it still happens, more importantly, it also
says that the local charset isn't to be used, and that is ignored about 10%
of the time. While the usage of raw utf8 isn't significant, the usage of
everything else certainly is -- enough so that various charsets make up
significant percentages of the /total/ text volume, and not just that
portion of it that uses 8 bit characters.
>> Basically, only today does it not make sense to do the test, any shift
>> towards it actually being used results in it being a good test.
>
> Not "any shift", only a complete 108 degree about-face (i.e. from very
> little utf-8 to exclusively utf-8) -- and if that were to happen a test
> would be unnecessary.
The following statement is sarcasm: And if we tell everyone to limit
themselves to English words that can be written using US-ASCII then there
won't be a need for a test either, and since we're closer to achieving that
goal than we are to anything else, we ought to do just do that.
IOW -- bullshit, exclusive use is not necessary in order to make it a useful
test or a useful shift.
-- J.B. Moreno