From: Bruce Lilly (blilly@erols.com)
Date: Mon Mar 10 2003 - 11:12:17 CST
Martin Duerst wrote:
> At 09:48 03/03/05 -0800, Russ Allbery wrote:
>
>> I'm saying that UTF-8 will still be, for that OS, an *encoding* that has
>> to be undone. You're not actually getting to "just put the bits on the
>> wire," which is the world that proponents of UTF-8 seem to believe that
>> they're going to be living in. You still have to encode and decode the
>> bits, at which point it's just as easy to use RFC 2047 as well.
>
>
> There is a huge difference between RFC 2047 with all its special
> rules and UTF-8, which is just a plain character encoding.
> As a very simple example, we can convert a whole file from
> iso-8859-1 to UTF-8. Converting iso-8859-1 to RFC 2047 isn't
> defined at all for a whole file.
But we're not discussing files, we're discussing a protocol. RFC 2047
intentionally and specifically applies only to text strings which are
part of message header fields and MIME-part header fields. Of course
it isn't defined "for a whole file" -- it doesn't apply to files at
all, only to specific parts of messages. And one cannot convert an
entire message between charsets, because the protocol elements are
required to be in specific subsets of characters; changing the code
values for the digits in a date-time spec. isn't going to work, for
example. When one is dealing with messages, one needs tools that
operate on messages, using the appropriate rules for messages.
Moeeover, in general, one cannot perform a conversion on an entire
file, as some definitions of "file" involve multiple parts (e.g.
on Apple platforms). Just as a simple 822/2822 message consists
of multiple parts (header and body).
And UTF-8 is far from simple; like RFC 2047, there are mutiple
representations that map to the same text -- where 2047 has B
and Q encodings, ISO 10646 has precomposed characters and
combining marks -- a given ISO-8859-1 text string may have
multiple RFC 2047 representations and multiple UTF-8 representations.
Indeed, RFC 2047 is much easier to handle; it requires no
complex normalization rules and does not require huge tables.
There are pros and cons either way; about the only thing that
we can probably all agree upon is that using utf-8 with RFC 2047
accumulates all of the complexities.