Re: Transformation of Non-ASCII headers

New Message Reply About this list Date view Thread view Subject view Author view

From: Bruce Lilly (blilly@erols.com)
Date: Sun Feb 23 2003 - 11:55:22 CST


J.B. Moreno wrote:
> On 2/21/03 9:08 PM, Bruce Lilly at <blilly@erols.com> wrote:
>
>
>>J.B. Moreno wrote:
>>
>>>On 2/18/03 2:45 AM, Mark Crispin at <mrc@cac.washington.edu> wrote:
>>>
>>>
>>>
>>>>That's correct, and that pretty much shoots down the use of a test to
>>>>determine if something is UTF-8. The test can prove that text is not
>>>>UTF-8 (assuming that the UTF-8 wasn't somehow damaged in transit), but it
>>>>does not reliably prove that text is UTF-8.
>>>
>>>
>>>No, it doesn't.
>>
>>So we're all agreed; the test doesn't reliably prove that text is utf-8.
>
>
> I've been polite and not intentionally misread your statements, please
> extend the same kindness to myself: i.e. you know I was referring to the
> "pretty much shoots down the use of a test", when I said no.

I had assumed that it referred to the immediately preceding clause,
viz. "it does not reliably prove that text is UTF-8". We have
failed to communicate effectively. In any event, Mark is correct;
one can determine whether or not some sequence of octets does or
does not qualify as a valid utf-8 sequence -- if it does not, it
cannot be utf-8, but even if it does qualify, one cannot be certain
that it *is* in fact utf-8. *If* one arbitrarily limits the
choice to utf-8 and the iso-8859 variants, there is still a
50% (of the valid utf-8 sequences) which is also valid iso-8859,
and is therefore indeterminate (for a 2-octet utf-8-like
sequence, diminishing somewhat for longer sequences). N.B., the
reality is that the untagged charset s are not limited to iso-8859
variants, and some of the charsets in use do use a wider range
of the octets that can appear in utf-8 sequences.

> Those numbers are not cooked -- since I was talking percentages, I did one
> thing and one thing only with those numbers, and by no stretch of the
> imagination could that be called "cooking"; I rounded it off to the nearest
> 1/2 of a percent (I meant to say nearest 10th of a percent, but either way
> it's not "cooking").

Throwing away significant digits amounts to cooking the numbers.

>>> there will actually *be* articles
>>>identified as UTF8.
>>
>>There are some now.
>
>
> In absolute terms, yes, percentage wise they don't make up enough to worry
> about.

Which is another reason that attempting to force utf-8 is a bad
idea. The current situation is that roughly 85% of articles are
compliant with RFCs 1036, 822, 2822, 2045-2049 as far as the
restriction on octets in header fields is concerned, roughly
15% are non-compliant in some unspecified charset which cannot
possibly be utf-8, and approximately 0.0060% are in some
unspecified charset which has utf-8-like sequences, and some
0.0060% are probably utf-8. On this issue, RFC 1036, the
Kohn draft, and the Lindsey draft differ as follows assuming
the same mix of actual generated content, and maintaining two
significant figures:

                RFC 1036 Kohn draft Lindsey draft
compliant 85% 85% 85%
non-compliant 15% 15% 15%

I.e. not significantly. However, what *is* significant is
that the Kohn draft (with some additional work to clarify
some news-specific issues) is compatible with the architecture
of the internet, with best current practivce, with existing
injection agents, gateways, and user agents in a backwards-
compatible manner, and stands a good chance of being approved
by the IESG. The Lindsey draft is incompatible and has zero
chance of approval.

>>>If what we do does NOT encourage people to switch (or let us say doesn't
>>>encourage even 1% of the people to switch), then we're no worse off than we
>>>are today -- we have a standard that says do X and it is ignored.
>>
>>Not true; today we do *not* have a standard that says send raw utf-8 --
>>in fact the standard currently prohibits that.
>
>
> And? We currently have a standard that clearly and absolutely, prohibits
> it, and as you point out above, it still happens, more importantly, it also
> says that the local charset isn't to be used, and that is ignored about 10%
> of the time. While the usage of raw utf8 isn't significant, the usage of
> everything else certainly is -- enough so that various charsets make up
> significant percentages of the /total/ text volume, and not just that
> portion of it that uses 8 bit characters.

Whether it's 10% or 15% doesn't matter. And given that, as
you point out, current use of utf-8 is insignificant, and
given that there *are* documented interoperability and
backwards-compatibility issues with utf-8, and given that
any attempt to force raw utf-8 on other protocols is doomed,
and given that raw utf-8 for text in the absence of a
language-information-preserving protocol mechanism compatible
with RFCs 1958 and 2277 cannot be part of a Standards Track
RFC, what's the point of that 0.0060% tail trying to wag the
dog?


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.