[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Normalisation and matching



Dan Oscarsson <Dan.Oscarsson@xxxxxxxxxxxxxxx> wrote:

> If IMAA does not want to say anything on form of IMA, then it should
> NOT require full width @ to be recognised. That is up to those
> defining how IMAs are to be used to define.
>
> As IDNA I now understand does not define how IDNs should be encoded
> there is no reason to require anything about full dots.  That will
> be up to protocol implementors that create protocols with IDN aware
> slots.

IDNA/IMAA define the space of valid IDNs/IMAs and the equivalence
relation among them.  Neither requires that a slot allow all forms.
IMA-unaware slots allow only ASCII forms, and a future IMA-aware slot
could choose to allow only normalized forms, in which case fullwidth
at-sign would not be allowed, and therefore obeying the requirement
that fullwidth at-sign be recognized would be effortless and automatic,
because a receiver cannot fail to recognize something that it never has
an opportunity to see.  Similarly, a future IDN-aware slot might allow
only normalized forms devoid of ideographic full stop, in which case
it would be impossible for a receiver to violate the requirement that
ideographic full stops be recognized as dots.

Perhaps the specs could be more clear on this point.  Here is the
current wording:

    Whenever dots are used as label separators, the following characters
    MUST be recognized as dots:

    In an internationalized mail address, the following characters MUST
    be recognized as at-signs for separating the local part from the
    domain name:

Maybe adding "if they appear" would make them clearer:

    Whenever dots are used as label separators, the following characters
    MUST be recognized as dots if they appear in an IDN:

    For separating the local part from the domain name, the following
    characters MUST be recognized as at-signs if they appear in an
    internationalized mail address:

There is no requirement that these characters be allowed in any given
slot; the intention is merely to require that if they appear they must
be treated as equivalent to the ASCII delimiters.

> I am sure you will find that a lot of software capable of displaying
> UCS will fail when it is not normalised, and also for things like full
> width or circled forms.

That's an argument for normalizing UCS text before displaying it.  That
doesn't imply that UCS text should be normalized any earlier than
that.  Deferring normalization until it's really necessary would allow
applications with non-buggy display systems (and applications that don't
display the address at all) to opt out of the normalization and save
that cost.

> If you use NFC all is preserved.  But to simplify character handling
> only ONE representation of a character should be allowed.  This does
> not mean NFKC - it unfortunately does more than that.  I want sharp-s
> and masculine ordinal indicator to be preserved.

NFKC preserves sharp-s; it's case-folding that destroys sharp-s.  But
NFKC does destroy the ordinal indicators.

> I do not want full width characters as not all letters can be full
> width and it is just a second encoding of the standard width letter,
> nor do I want ligatures.

So what you're saying is, you want to require pre-normalization, but not
NFC or NFKC, but rather NF-Dan.  I guess your next step is to write a
spec for that.

Even assuming you write that spec and convince the Unicode Consortium or
this mailing list to consider NF-Dan, I still don't see the advantage
of requiring early normalization.  The receiver knows what it needs,
but the sender doesn't know what the receiver needs.  If a receiver
needs normalized strings for whatever reason, it can normalize them
itself, but if it doesn't need normalized strings, there's no need to
waste the sender's effort on pre-normalization.  How often would the
receiver benefit from pre-normalization anyway?  If the receiver is
going to display the address, there might be a benefit, if the display
system can't handle non-normalized strings (how common is that?).  But
if the receiver is going to compare the address, or resolve the address,
or gateway it to an ASCII slot, then there is no benefit, because the
receiver needs to apply Nameprep's case-folding and NFKC, which means
any prior normalization is redundant.

> From all I have read the best thing is if sender does normalisation,
> not receiver.  It is often easy during input to normalise without
> overhead.  To write code that can normalise every time you get data
> before it can be usable will cost a lot more.
>
> IRI uses NFC for this reason.

IRI recommends NFKC when creating IRIs and requires NFC when converting
IRIs to URIs.  The reason given for having senders rather than receivers
perform normalization has nothing to do with performance or cost, it has
to do with correctness:

    Equivalence of IRIs MUST rely on the assumption that IRIs are
    appropriately pre-normalized, rather than applying normalization
    when comparing two IRIs.

    Because we do not know how a particular field is treated with
    respect to text normalization, it would be inappropriate to
    allow third parties to normalize an IRI arbitrarily.  This
    does not contradict the recommendation that if you create a
    resource, and an IRI for that resource, you try to be as normalized
    as possible (i.e. NFKC if possible).  This is similar to the
    upper-case/lower-case problems in URIs.  Some parts of an URI are
    case-insensitive (domain name).  For others, it is unclear whether
    they are case-sensitive or case-insensitive, or something in
    between (e.g. case-sensitive, but if you use the wrong case, may
    not directly get a result, but rather a 'Multiple choices').  The
    best recipe we have there is that the generator uses a reasonable
    capitalization, and when transfering the URI, you do not change
    capitalization.

(Section 5.3.)

In summary, IRIs need to use pre-normalization because there is no
single well-known equivalence relation; only the creator of an IRI knows
for sure if any other strings are equivalent to it, and knows how to
compare that IRI against others.

That's not a problem for IDNs and IMAs, because IDNA and IMAA define
a standard well-known equivalence relation for all non-ASCII IDNs
and IMAs.  Anyone can compute canonical forms and compare IDNs/IMAs,
not just the creator.  Therefore there is no need for the creator to
pre-normalize; if and when normalization is needed, it can be performed
by whomever needs it.

AMC