[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Normalisation and matching
> From all I have read the best thing is if sender does normalisation,
not
> receiver. It is often easy during input to normalise without
overhead.
> To write code that can normalise every time you get data before it
can
> be usable will cost a lot more.
The "will cost a lot more" is a bit overstated. Because most text, in
practice, is already normalized, the usual practice is to check
whether the text is already normalized, using a very fast algorithm
like that in http://www.unicode.org/reports/tr15/#Annex8, or
enhancements thereof. Only if unnormalized sequences are detected does
full normalization need to be invoked.
It does cost a bit more, since you do have to check each character
[only in House Republican fancy does increasing (government spending)
reduce size (of government)]. But depending on how much other
processing is going on, it may or may not be significant. In parsing
XML, for example, it is not.
For more descriptions of different strategies, see the end of
http://www.unicode.org/reports/tr15/#Canonical_Equivalence.
Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄
----- Original Message -----
From: "Dan Oscarsson" <Dan.Oscarsson@xxxxxxxxxxxxxxx>
To: <ietf-imaa@xxxxxxx>
Sent: Wednesday, August 13, 2003 05:41
Subject: Re: Normalisation and matching
>
>
> Adam M. Costello wrote:
>
>
> >Guess what recommendation I would argue for? Minimal restrictions.
I
> >would recommend that IMA-aware protocols allow all valid IMAs
(including
> >non-Nameprepped ones) in whatever charsets they want to support
(and
> >they should at least support UTF-8, as recommended by
BCP-whatever).
> >Applications should not apply Nameprep or NFC or anything before
> >putting the mail address into the slot; they should leave that to
the
> >receiver to do if necessary. (It will be necessary if receiver
wants
> >to compare the address, or relay it into a IMA-unaware slot, but
not if
> >the receiver merely wants to relay it into an IMA-aware slot, or
display
> >it).
>
> I am sure you will find that a lot of software capable of displaying
> UCS will fail when it is not normalised, and also for things like
> full width or circled forms.
>
> >
> >I see two advantages to this approach. First, it allows
presentational
> >details to be preserved (like fullwidth characters, superscript
> >characters, sharp-s, etc).
>
> Normalised does not mean that that goes away. If you use NFC all is
> preserved. But to simplify character handling only ONE
representation
> of a character should be allowed. This does not mean NFKC - it
unfortunately
> does more than that. I want sharp-s and masculine ordinal indicator
> to be preserved.
> I do not want full width characters as not all letters can be full
width and
> it is just a second encoding of the standard width letter, nor do I
want
> ligatures. I prefer simple forms.
>
>
> >
> >Second, it reduces superfluous computation at the sending end in
> >two cases. Case 1: If the receiver doesn't need the string to be
> >Nameprepped, then it would be a waste for the sender to apply
Nameprep.
>
> I do not want it to be nameprepped. I want it to be normalised and
> no multiple represenations of the same character.
>
>
> >So the approach I would recommend is to let applications take
> >responsibility for applying Nameprep when they themselves need it,
don't
> >depend on other applications to pre-apply it for you, and don't
bother
> >trying to pre-apply it for someone else.
>
> From all I have read the best thing is if sender does normalisation,
not
> receiver. It is often easy during input to normalise without
overhead.
> To write code that can normalise every time you get data before it
can
> be usable will cost a lot more.
>
> IRI uses NFC for this reason.
>
> Dan
>
>