[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Review of sort
On Sat, 18 Nov 2006, Martin Duerst wrote:
> > The "i;unicode-casemap" collation is a simple collation which operates
> > on octet strings and attempts to Unicode characters case-insensitively.
> 'operates on octets and attempts ... Unicode' sounds very weird.
> Unless you say something about character encoding.
All strings are coerced to UTF-8, but subsequently are treated as octet
strings instead of variable-width character strings. The reference to
"attempts to compare Unicode characters case-insensitively" [I agree that
the verb "compare" needs to be included] refers to a higher-level effect
rather than what the algorithm actually does.
> I think I understand your use of octet here; I guess that what you are
> trying to say is that except for the provisions below, comparison is
> by codepoint (which means octet comparison in the case of UTF-8).
"Codepoint" to me would refer to a Unicode 20.5 bit codepoint. I really
mean 8-bit quantities, without regard to the character boundaries.
> > except that first, all pre-composed Unicode characters are fully
> > decomposed;
> Does this mean NDF, or just any kind of depomposition?
I'm not familiar with the decomposition names. What I meant is "if a
decomposition for a codepoint is defined in UnicodeData.txt, substitute
that decomposition for the codepoint."
> > and second, all characters which have a title case are
> > changed to their title case.
> There are not many characters with title case. Maybe about 5.
> Given the name of the collation, I suspect that you want to
> define some mappings for characters that don't have title case,
> too.
AFAICT, all characters with an upper case defined also have a title case
defined in UnicodeData.txt. As you noted, only a handful of these differ.
> Hope this helps. Regards, Martin.
Thanks for your help! I hope that the people who decide these things
respond favorably to my request for an i;unicode-casemap as an alternative
to i;ascii-casemap.
i;unicode-casemap, to my thinking, fills in the gap between the legacy
i;ascii-casemap and full-fledged stringprep. Full-fledged stringprep is
the ideal, but I'm afraid that if we require stringprep without an
incremental approach many implementations will just punt and use
i;ascii-casemap.
The all-or-none approach was taken with SSL/TLS and proved to be very
painful. It was entirely too easy for developers to choose "none" and
make the people who followed the specification look like bad guys.
In this case, we don't have a security crisis. We are just trying to make
things better for non-ASCII character sets and non-English languages.
This will prove to be a long-term evolutionary process; nobody really has
solved the general comparison/collation problem.
[As you and I well know, anyone who says that their software can collate
Japanese text without yomi or contextual hints is a liar or fool; does
大和 collate after 山寺 or 対話? i;unicode-casemap won't collate this
correctly, but neither will anything else.]
So, I think that even though reasonable people may disagree on what is
"good enough", we can have general agreement on what is "better" and that
"better" is always preferable to "nothing". i;unicode-casemap should
always be better for any script whose collation can be inferred from
Unicode codepoint value; and for Latin-script languages that use
diacriticals it should be MUCH better.
-- Mark --
http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.