[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Review of sort



On Sat, 18 Nov 2006, Martin Duerst wrote:
> >    The "i;unicode-casemap" collation is a simple collation which operates
> >    on octet strings and attempts to Unicode characters case-insensitively.
> 'operates on octets and attempts ... Unicode' sounds very weird.
> Unless you say something about character encoding.

All strings are coerced to UTF-8, but subsequently are treated as octet strings instead of variable-width character strings. The reference to "attempts to compare Unicode characters case-insensitively" [I agree that the verb "compare" needs to be included] refers to a higher-level effect rather than what the algorithm actually does.

> I think I understand your use of octet here; I guess that what you are
> trying to say is that except for the provisions below, comparison is
> by codepoint (which means octet comparison in the case of UTF-8).

"Codepoint" to me would refer to a Unicode 20.5 bit codepoint. I really mean 8-bit quantities, without regard to the character boundaries.

> >    except that first, all pre-composed Unicode characters are fully
> >    decomposed;
> Does this mean NDF, or just any kind of depomposition?

I'm not familiar with the decomposition names. What I meant is "if a decomposition for a codepoint is defined in UnicodeData.txt, substitute that decomposition for the codepoint."

> >    and second, all characters which have a title case are
> >    changed to their title case.
> There are not many characters with title case. Maybe about 5.
> Given the name of the collation, I suspect that you want to
> define some mappings for characters that don't have title case,
> too.

AFAICT, all characters with an upper case defined also have a title case defined in UnicodeData.txt. As you noted, only a handful of these differ.

> Hope this helps. Regards,     Martin.

Thanks for your help! I hope that the people who decide these things respond favorably to my request for an i;unicode-casemap as an alternative to i;ascii-casemap.

i;unicode-casemap, to my thinking, fills in the gap between the legacy i;ascii-casemap and full-fledged stringprep. Full-fledged stringprep is the ideal, but I'm afraid that if we require stringprep without an incremental approach many implementations will just punt and use i;ascii-casemap.

The all-or-none approach was taken with SSL/TLS and proved to be very painful. It was entirely too easy for developers to choose "none" and make the people who followed the specification look like bad guys.

In this case, we don't have a security crisis. We are just trying to make things better for non-ASCII character sets and non-English languages. This will prove to be a long-term evolutionary process; nobody really has solved the general comparison/collation problem.

[As you and I well know, anyone who says that their software can collate Japanese text without yomi or contextual hints is a liar or fool; does 大和 collate after 山寺 or 対話? i;unicode-casemap won't collate this correctly, but neither will anything else.]

So, I think that even though reasonable people may disagree on what is "good enough", we can have general agreement on what is "better" and that "better" is always preferable to "nothing". i;unicode-casemap should always be better for any script whose collation can be inferred from Unicode codepoint value; and for Latin-script languages that use diacriticals it should be MUCH better.

-- Mark --

http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.