From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Wed Dec 11 2002 - 05:22:19 CST
In <Pine.BSI.3.91.1021210160636.14324B-100000@spsystems.net> Henry Spencer <he!
>On Tue, 10 Dec 2002, Charles Lindsey wrote:
>> >> No character will ever be mapped to the 5 and 6 byte form,
>> >> (they do not exist anymore in the UNICODE standard).
>>
>> It is my belief that those characters which in UTF-16 require the use of
>> surrogates (private use, Egyptian hieroglyphics, whatever) occupy code
>> points using at least 5 bytes...
>Nope. The highest Unicode character is 0x10ffff, 21 bits. A surrogate
>pair has 20 bits of character in it, and the count goes up to 21 because
>there is a one-plane offset built into the surrogate encoding (the encoded
>characters start with plane 1).
>There are 21 character bits in a 4-byte UTF-8 sequence. (In fact, there
>are 17 planes in today's Unicode and 10646, while a 4-byte UTF-8 sequence
>can handle 32 planes.) So you never need 5 bytes.
Yes, I realised I was wrong after I posted yesterday's message
>> And if that is not so, why does the Yergeau draft (and the RFC which it
>> replaces) still include all those possibilities?
>The RFC it is replacing pre-dates the 17-planes limit. The Yergeau draft
>inherited the longer sequences from the RFC, and may lose them before it
>becomes an RFC.
I think not. I emailed Yergeau, and here is his response:
> From: Francois Yergeau <FYergeau@alis.com>
> To: Charles Lindsey <chl@clw.cs.man.ac.uk>
> Subject: RE: draft-yergeau-rfc2279bis-02
> Date: Tue, 10 Dec 2002 16:59:04 -0500
> X-MIME-Autoconverted: from quoted-printable to 8bit by clw.cs.man.ac.uk id D!
>
> Charles Lindsey wrote:
> > I note that your latest draft still allows for the full UCS-4
> > character
> > set to be encoded (i.e. it allows UTF-8 sequences of up to 6 bytes).
> >
> > I see, however, that Unicode 3.2 restricts legal UTF-8 to
> > only 4 bytes,
> > since the largest allowable code point is now U+10FFFF. Would you care
> > to comment on this?
>
> This is one of the differences between Unicode and ISO 10646. Although kept
> in synchronism (all code points assignments are the same), the two standards
> are not identical. 10646 has had a 31-bit code space from day one. Unicode
> started with a 16-bit (0-FFFF) code-space and in 1996 moved to 0-10FFFF to
> accomodate UTF-16 after it was clear that 64 Kchars was not enough. Since
> then, they (Unicode) have been pressuring ISO SC2/WG2 to restrict 10646 to
> the same range; ISO has been reluctant, because it would be an incompatible
> change (in principle, not in practice) and they want to keep their options
> open in case the new upper limit in Unicode is busted like the earlier one
> was. But they have agreed to have a policy to not encode anything above
> 10FFFF and have actually rescinded the allocation of large swatches of code
> space above 10FFFF to Private Use Areas (an incompatible change!).
>
> So Unicode goes to 10FFFF while 10646 continues to go to 7FFFFFFF, at least
> officially. The IETF has an explicit preference for ISO standards, which is
> reflected in the RFC. That's the sole reason for the way it is, AFAIK.
>
> > I am including your syntax in the latest
> > draft-ietf-usefor-article-*.txt, and I am being hauled over the coals
> > for including UTF-5 and UTF-6 in it :-( .
>
> You could raise this as a comment against rfc2279bis, but some people would
> hate you for bringing this so late in the game; this is a bit of a religious
> issue!
>
> Another possibility is to say "this protocol uses UTF-8 [RFC2279] but does
> not allow code points above U+10FFFF" and then include the syntax without
> UTF-5 and UTF-6.
>
> Regards,
>
> --
> François Yergeau
>
>
>
>However, Charles has a point in that there is some risk in attempting to
>anticipate future standards -- we got burned on this with 8-bit characters
>in headers, remember.
I think that is Yergeau's point too.
What I propose is to leave the syntax as is, but I have added a final
sentence to the following paragraph:
The syntax for UTF8-xtra-char excludes those redundant sequences of
octets which cannot occur in UTF-8, as defined by [RFC 2279], either
because they would not be the shortest possible encodings of some UCS
character [ISO/IEC 10646], or they would represent one of the
characters D800 through DFFF, disallowed in UCS because of their
surrogate use in the UTF-16 encoding. These sequences MUST NOT be
generated by posting agents. Where they occur inadvertently, they
SHOULD be passed on untouched by other agents, but attempts to
interpret them as malformed UTF-8 MUST NOT be made. However, if there
is reason to suppose they are representations of some other character
set they MAY, as suggested in section 4.4.1, be interpreted as such.
The syntax also includes, for completeness, the cases UTF8-5 and
UTF8-6 which cannot, in fact, arise in [UNICODE 3.2] (though they
might conceivably arise in some future extension).
Is that acceptable to everybody?
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl
Email: chl@clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5