From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Mon Feb 17 2003 - 08:19:37 CST
In <ylisvmg1w8.fsf@windlord.stanford.edu> Russ Allbery <rra@stanford.edu> writes:
>J B Moreno <planb@newsreaders.com> writes:
>> There *will* be untagged 8-bit content -- the only question is whether
>> we can convince people to use UTF8 or not.
>Again, this isn't the only question. There are two questions: whether we
>can convince people to use UTF-8 or not, and whether convincing some
>people to use UTF-8 will result in a net improvement in the current
>situation.
Clearly, for every person who switches to UTF-8, that is one less person
sending "guess-the-charset", so clearly there is net improvement (I cannot
conceive of a scenario where introducing UTF-8 would make the current mess
_worse_).
In practical terms, I think users of most European languages can be
persuaded relatively easily. But their texts use mainly the Roman
alphabet, with a smaller proportion (<10%) of charactes with assorted
strange accents. So if you see a text in which those characters are
garbled, you can usually make out what it was meant to say, or at least be
able to determine whether you had missed anything that was important.
That would not apply to Greek, Arabic or Hebrew, but then the experiment
that Andrew tried did not show any examples of those AFAIR. The chief
European offenders seemed to be the Poles, and there were not all that
many of them.
>(I assume that we can all stipulate to the fact that no matter what we do,
>we're not going to convince *everyone* to use UTF-8?)
The chief problem is, as I think everyone agrees, the Chinese. I think we
have to accept that they will continue what they are doing, but if they
are successfully communicating with each other, then we just have to let
them get on with it. They will form a "cooperating subnet"; anyone in the
outside world who wants to join will just have to learn to play it their
way (which is, after all, the current situation).
However, there ARE a couple to thing which might help to move it along.
1. If we define that newsgroup-names are in UTF-8, then the Chinese
_might_ just be persuaded to adopt that; in which case their user agents
would need to acquire some UTF-8 capability. Note that all current Chinese
newsgroup-names are still in ASCII. But if we procrastinate for much
longer they may well start doing newsgroup-names their way. That would be
bad.
2. We are in any case under some pressure to introduce a header of the
form:
This-Message-Includes-8bit-Headers: [yes/no]
as an aid to IMAP, future interoperability with email, and so on. There is
no reason why that header should not include charset and language
parameters:
This-Message-Includes-8bit-Headers: Yes; charset=utf-8; language=cn_TW
I think I would still want to say, within Usefor, "the charset parameter
MUST be utf-8", and for sure newsgroup-names would be UTF-8 regardless.
But it would at least give the Chinese the possibility of migration, and
"without loss of face", too.
-- Charles H. Lindsey ---------At Home, doing my own thing------------------------ Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl Email: chl@clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K. PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5