[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Review of sort
At 06:24 06/11/20, Mark Crispin wrote:
>On Sat, 18 Nov 2006, Martin Duerst wrote:
>> > The "i;unicode-casemap" collation is a simple collation which operates
>> > on octet strings and attempts to Unicode characters case-insensitively.
>> 'operates on octets and attempts ... Unicode' sounds very weird.
>> Unless you say something about character encoding.
>
>All strings are coerced to UTF-8, but subsequently are treated as octet strings instead of variable-width character strings. The reference to "attempts to compare Unicode characters case-insensitively" [I agree that the verb "compare" needs to be included] refers to a higher-level effect rather than what the algorithm actually does.
If that's all really cristal clear from the specification, it's probably
okay. It was just difficult to guess from the part you sent. Also,
please keep in mind that collations should be reusable, so there
may be a need to explain more in the collation definition than what's
needed in the context of the protocol you're looking at.
>> I think I understand your use of octet here; I guess that what you are
>> trying to say is that except for the provisions below, comparison is
>> by codepoint (which means octet comparison in the case of UTF-8).
>
>"Codepoint" to me would refer to a Unicode 20.5 bit codepoint. I really mean 8-bit quantities, without regard to the character boundaries.
Well, yes, but if you do octet-by-octet comparison on data encoded as
UTF-8, then this is exactly the same (assuming you don't have any
overlong encodings, which would be a security risk anyway) as
codepoint-by-codepoint.
>> > except that first, all pre-composed Unicode characters are fully
>> > decomposed;
>> Does this mean NDF, or just any kind of depomposition?
>
>I'm not familiar with the decomposition names. What I meant is "if a decomposition for a codepoint is defined in UnicodeData.txt, substitute that decomposition for the codepoint."
In that case, it's definitely not NFD (sorry, 'NDF' above was wrong).
Is this done recursively or not?
Given your arguments below about 'easy to implement', I'm wondering
whether this is worth the effort. It probably is, it gives close
to 'real' results for many languages such as French, German,...
The nordic languages (Swedish, Norwegian, Danish) loose out, but
that's the same for i;basic. And the implementation complexity
is the same as for casing: take characters and replace them
by character strings. It can be done in one step by combining
both mappings together.
>> > and second, all characters which have a title case are
>> > changed to their title case.
>> There are not many characters with title case. Maybe about 5.
>> Given the name of the collation, I suspect that you want to
>> define some mappings for characters that don't have title case,
>> too.
>
>AFAICT, all characters with an upper case defined also have a title case defined in UnicodeData.txt. As you noted, only a handful of these differ.
Ah, now I know what you mean. I think you should be very clear that you
mean the value of the Simple_Titlecase_Mapping property as given in
the Unicode Data file.
I still don't understand why you choose Titlecase. The output,
applied to individual characters, may look weird if you get
a Serbocroatian digraph, e.g. like so: LjUBLjANA (yes, I know,
Ljubljana is in Slovenia, neither in Serbia nor in Croatia).
Why not lower case (what's usually recommended) or upper case?
>Thanks for your help! I hope that the people who decide these things respond favorably to my request for an i;unicode-casemap as an alternative to i;ascii-casemap.
>
>i;unicode-casemap, to my thinking, fills in the gap between the legacy i;ascii-casemap and full-fledged stringprep.
I'm not thinking about stringprep as a collation. In many ways, your proposal
is closer to i;basic than sorting on the output of stringprep, because stringprep
uses precomposition, not decomposition in most cases.
>Full-fledged stringprep is the ideal, but I'm afraid that if we require stringprep without an incremental approach many implementations will just punt and use i;ascii-casemap.
I'm affraid that if you say stringprep is the ideal, then you are either talking
about something else than sorting, or haven't understood that stringprep isn't
designed for sorting.
>The all-or-none approach was taken with SSL/TLS and proved to be very painful. It was entirely too easy for developers to choose "none" and make the people who followed the specification look like bad guys.
Oh, so for SSL/TLS, I guess it would only be used for equality comparison,
not for actual ordering.
>In this case, we don't have a security crisis. We are just trying to make things better for non-ASCII character sets and non-English languages. This will prove to be a long-term evolutionary process; nobody really has solved the general comparison/collation problem.
>
>[As you and I well know, anyone who says that their software can collate Japanese text without yomi or contextual hints is a liar or fool; does 大和 collate after 山寺 or 対話? i;unicode-casemap won't collate this correctly, but neither will anything else.]
>
>So, I think that even though reasonable people may disagree on what is "good enough", we can have general agreement on what is "better" and that "better" is always preferable to "nothing". i;unicode-casemap should always be better for any script whose collation can be inferred from Unicode codepoint value; and for Latin-script languages that use diacriticals it should be MUCH better.
Yes indeed, except for the nordic languages that sort their 'diacriticized'
characters after the z.
Regards, Martin.
#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@xxxxxxxxxxxxxxx