From: Henry Spencer (henry@spsystems.net)
Date: Tue Dec 10 2002 - 15:17:48 CST
On Tue, 10 Dec 2002, Charles Lindsey wrote:
> >> No character will ever be mapped to the 5 and 6 byte form,
> >> (they do not exist anymore in the UNICODE standard).
>
> It is my belief that those characters which in UTF-16 require the use of
> surrogates (private use, Egyptian hieroglyphics, whatever) occupy code
> points using at least 5 bytes...
Nope. The highest Unicode character is 0x10ffff, 21 bits. A surrogate
pair has 20 bits of character in it, and the count goes up to 21 because
there is a one-plane offset built into the surrogate encoding (the encoded
characters start with plane 1).
There are 21 character bits in a 4-byte UTF-8 sequence. (In fact, there
are 17 planes in today's Unicode and 10646, while a 4-byte UTF-8 sequence
can handle 32 planes.) So you never need 5 bytes.
> And if that is not so, why does the Yergeau draft (and the RFC which it
> replaces) still include all those possibilities?
The RFC it is replacing pre-dates the 17-planes limit. The Yergeau draft
inherited the longer sequences from the RFC, and may lose them before it
becomes an RFC.
However, Charles has a point in that there is some risk in attempting to
anticipate future standards -- we got burned on this with 8-bit characters
in headers, remember.
Henry Spencer
henry@spsystems.net