[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Review of sort




On Wed, 22 Nov 2006, Cyrus Daboo wrote:
I think treating this as an "upgrade" to i;ascii-casemap as opposed to an alternative to i;basic is a good approach, so I think your proposal is fine.

OK, I've just sent draft-crispin-comparator-unicode-00.txt to the I-D repository. It's definitely going to need some work, but I hope that the general idea survives more or less intact.

For your convenience, I'm including it below

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.






Network Working Group                                         M. Crispin
Internet-Draft                                  University of Washington
Document: internet-drafts/draft-crispin-comparator-unicode-00.txt
                                                           November 2006


       Internet Application Protocol Simple Unicode Comparator

Status of this Memo

   By submitting this Internet-Draft, each author represents that
   any applicable patent or other IPR claims of which he or she is
   aware have been or will be disclosed, and any of which he or she
   becomes aware will be disclosed, in accordance with Section 6 of
   BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as
   Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   A revised version of this document will be submitted to the RFC
   editor as an Informational Document for the Internet Community.

   A revised version of this draft document will be submitted to the RFC
   editor as a Proposed Standard for the Internet Community.  Discussion
   and suggestions for improvement are requested, and should be sent to
   ietf-imapext@xxxxxxxx  This document will expire before 27 May 2007.
   Distribution of this memo is unlimited.


Abstract

   This document describes "i;unicode-casemap", a simple
   case-insensitive collation for Unicode strings.  It provides
   equality, substring and ordering operations.


Introduction

   The "i;ascii-casemap" collation described in [COMPARATOR] is quite
   simple to implement and provides case-independent comparisons for the
   26 Latin alphabetics.  It is specified as the default and/or baseline
   comparator in some application protocols, e.g., [IMAP-SORT].

   It is possible, with a modest extension, to provide a more
   sophisticated collation with greater multilingual applicability than
   "i;ascii-casemap".

   This collation, "i;unicode-casemap", is intended to be an alternative
   to, and preferred over, "i;ascii-casemap".  It does not replace the
   "i;basic" collation described in [BASIC].


1. Unicode Casemap Collation Description

   The "i;unicode-casemap" collation is a simple collation which
   operates on Unicode strings and treats characters case-insensitively.
   It provides equality, substring and ordering operations.  All input
   is valid.

   For the equality and ordering operations, each input string is
   prepared by converting it to "titlecased canonicalized UTF-8" as
   follows on a per-character basis:

      (1) If the string is in a non-Unicode character set, the codepoint
          is converted from that character set to the associated
          codepoint in Unicode.
      (2) If the codepoint has a titlecase property in UnicodeData.txt
          (this is normally the same as the uppercase property) the
          codepoint is converted to the titlecased codepoint.
      (3) If the codepoint as a decomposition property in
          UnicodeData.txt the codepoint is converted to the decomposed
          codepoints.
      (4) The resulting codepoint(s) is/are appended to the titlecased
          canonicalized UTF-8 string.

   The resulting two titlecased canonicalized UTF-8 strings are then
   treated as in i;octet for equality and ordering.

   Care should be taken when using OS-supplied functions to implement
   this collation as it is not locale sensitive.  Functions such as
   strcasecmp and toupper are sometimes locale sensitive and may
   inconsistently casemap letters.

   The i;unicode-casemap collation is well suited to to use with many
   Internet protocols and computer languages.  Use with natural language
   is often inappropriate: even though the collation apparently supports
   languages such as Swahili and English, in real-world use it tends to
   mis-sort a number of types of string:

   o  people and place names containing scripts that are not collated
      according to "alphabetical order".
   o  words with characters that have diacriticals.  However,
      i-unicode-casemap generally does a better job than i;ascii-casemap
      for most (but not all) languages.  For example, German umlaut
      letters will sort correctly, but some Scandinavian letters will
      not.
   o  names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike
      in English),
   o  strings containing other non-letter symbols; e.g., euro and pound
      sterling symbols, quotation marks other than '"', dashes/hyphens,
      etc.

2. Unicode Casemap Collation Registration

   <?xml version='1.0'?>
   <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
   <collation rfc="XXXX" scope="local" intendedUse="common">
     <identifier>i;unicode-casemap</identifier>
     <title>Unicode Casemap</title>
     <operations>equality order substring</operations>
     <specification>RFC XXXX</specification>
     <owner>IETF</owner>
     <submitter>mrc@xxxxxxxxxxxxxxxxxx<submitter>
   </collation>

3. Security Considerations

   Collations will normally be used with UTF-8 strings.  Thus the
   security considerations for [UTF-8], [STRINGPREP] and
   [UNICODE-SECURITY] also apply and are normative to this
   specification.


4. IANA Considerations

   The i;unicode-casemap collation should be added to the registry of
   collations defined in [COMPARATOR]


5. Normative References

   The following documents are normative to this document:

   [BASIC]               ???, Work in Progress.

   [COMPARATOR]          Newman, C., "Internet Appplication Protocol
                         Collation Registry", Work in Progress.

   [STRINGPREP]          Hoffman, P. and M. Blanchet, "Preparation of
                         Internationalized Strings ("stringprep")",
                         RFC 3454, December 2002.

   [UTF-8]               Yergeau, F., "UTF-8, a transformation format
                         of ISO 10646", STD 63, RFC 3629, November 2003.

   [UNICODE-SECURITY]    Davis, M. and M. Suignard, "Unicode Security
                         Considerations", February 2006,
                         <http://www.unicode.org/reports/tr36/>.


6. Informative References:

   [IMAP-SORT]           Crispin, M. "Internet Message Access Protocol -
                         SORT and THREAD Extensions", Work in Progress.


Appendices

Author's Address

   Mark R. Crispin
   Networks and Distributed Computing
   University of Washington
   4545 15th Avenue NE
   Seattle, WA  98105-4527

   Phone: +1 (206) 543-5762

   EMail: MRC@xxxxxxxxxxxxxxxxxx


Full Copyright Statement

   Copyright (C) The Internet Society (2006).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at ietf-
   ipr@xxxxxxxxx


Acknowledgement

   Funding for the RFC Editor function is currently provided by the
   Internet Society.