Re: UTF-8 syntax

New Message Reply About this list Date view Thread view Subject view Author view

From: Henry Spencer (henry@spsystems.net)
Date: Tue Dec 10 2002 - 15:17:48 CST


On Tue, 10 Dec 2002, Charles Lindsey wrote:
> >> No character will ever be mapped to the 5 and 6 byte form,
> >> (they do not exist anymore in the UNICODE standard).
>
> It is my belief that those characters which in UTF-16 require the use of
> surrogates (private use, Egyptian hieroglyphics, whatever) occupy code
> points using at least 5 bytes...

Nope. The highest Unicode character is 0x10ffff, 21 bits. A surrogate
pair has 20 bits of character in it, and the count goes up to 21 because
there is a one-plane offset built into the surrogate encoding (the encoded
characters start with plane 1).

There are 21 character bits in a 4-byte UTF-8 sequence. (In fact, there
are 17 planes in today's Unicode and 10646, while a 4-byte UTF-8 sequence
can handle 32 planes.) So you never need 5 bytes.

> And if that is not so, why does the Yergeau draft (and the RFC which it
> replaces) still include all those possibilities?

The RFC it is replacing pre-dates the 17-planes limit. The Yergeau draft
inherited the longer sequences from the RFC, and may lose them before it
becomes an RFC.

However, Charles has a point in that there is some risk in attempting to
anticipate future standards -- we got burned on this with 8-bit characters
in headers, remember.

                                                          Henry Spencer
                                                       henry@spsystems.net


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.