From: Martin J. Dürst (mduerst@ifi.unizh.ch)
Date: Mon Nov 24 1997 - 06:51:58 CST
On Thu, 20 Nov 1997, Leonid Yegoshin wrote:
> Hi,
>
> >From: sommar@algonet.se (Erland Sommarskog)
> >
> >Leonid Yegoshin <egoshin@genesyslab.com> writes:
> >> I don't agree with it. With this requerement news process software
> >>MUST have UTF-8 understanding, and additional headache. SMTP/MIME allow _any_
> >>coding in any headers. And news-processing software should convert
> >>MIME-encoding Newsgroup: to UTF-8 ... Why ?
> >
> >Notice that a relayer does not need to know much at all about UTF-8. It
> >should just passed the octets around. The Newsgroup header is by
> >definition never encoded, which means that if the server sees a
> >Newsgroups line with se.test.?iso-8859-1?Q?r=E5ksm=F6rg=E6s it should
> >look for a group with this funny name in its active file, and nothing else.
> >
> If "a relayer does not need to know much" and "It
> should just passed the octets around", then we can expect sometime
> to see in active file the newsgroup named
>
> relcom.=?koi8-r?Q?=C0=CD=CF=D2?= (prev relcom.humor)
If we knew that it would always look like this, it might work in
some cases. But there are many other possibilities to express
the above (I'll gloss over the fact that the MIME 2047 encoding
would have to start before "relcom", because only "words" can be
encoded with it):
relcom.=?koi8-r?B?.......?=
relcom.=?koi8-ru?Q?=C0=CD=CF=D2?=
(by some software in Ukrainia)
relcom.=?iso-8859-5?......?=
and many more. You need a lot of work to make all these work.
> >I guess the argument is that at some point in future all other character
> >sets are hopefully eradicated.
> >
> ... Happiness in belief ... (I know it from Bible but I it is well-known
> phrase in Russia and I can't find translation to English).
It will take quite some time for all the old stuff to get eradicated.
But in places such as newsgroup names, where we want global interoperability
UTF-8 is the way to go. It will not be completely painless at the start,
but we will be happy to have done it very soon.
> You would not have a problem with Latin1 countries like West Europe.
> The Oriental countries with hieroglyphs have many another problems
> and change of coding probably (I am not specialist in Chinese or so)
> has minor significance. But there are Russian and a lot of alphabet
> languages, which can (CAN !) fit the second half of ASCII.
Yes, but for many of the languages besides Russian, special extensions
are necessary. KOI-8 or ISO-8859-5 don't cover them.
> Hey, the transmision to variable-length code should have hard problems
> in word-processing software. In software which does not intended for
> word processing itself it should be especially painful. It is like "grep"
> and other rare pattern-related or text-positioned programs.
General variable-length codes are indeed a big problem. But UTF-8
doesn't have these problems. You never get false positives with
UTF-8. Please for examlpe have a look at my paper at
http://www.ifi.unizh.ch/mml/mduerst/papers.html#IUC11-UTF-8
> The price of it too high for this language-community. That community
> also want simple way to process bytes of native language as it can be done
> with ASCII/Latin1. MIME gives this - it is possible to extend ASCII and named
> this 8-bit code somehow for network transmission, but UTF-8 don't give it.
Ever had tried to process a RFC 2047-encoded text character-by-character?
If you are decoding it to your local representation before processing,
you can do the same for UTF-8, which is considerably easier.
> In Russia in time of network establishing there are two groups of
> people who were fans of two different coding - one KOI8 and other Unicode.
> The first won due to second didn't write a working software.
Up to now, each country had to write its own software even for
very basic things. But Unicode software will be usable much more
widely. Russian programmers will be freed from having to spend their
time on Russian character encoding stuff, and can use their creativity
to write great software that can be used by the whole world!
> >I guess a good software implementor will have some fallback for the
> >case when there is no MIME headers, but the text is obviously not
> >UTF-8. For instance, he could opt to present the data as-is, and
> >hope that sender and receiver is using the same character set. This
> >should probably not be in the RFC, but only leave this case undefined.
>
> It can't work for exam for oriental languages in EUC - there are at least
> 2.7% legal words which looks like UTF-8 and can suffer from implicit
> conversion.
Why do you say "at least"? And in Japanese (where the 2.7% probably comes
from) there is the problem that most PCs (including Macs) use Shift_JIS,
and that not both of them can be used together.
Regards, Martin.