[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: IDN registration policy list <idn-reg-policy@xxxxxxx>, IDN registration policy list <idn-reg-policy@xxxxxxx>*Subject*: Re: model with overlapping variants*From*: Martin Duerst <duerst@xxxxxx>*Date*: Fri, 04 Apr 2003 16:43:49 -0500*In-reply-to*: <>*List-archive*: <http://www.imc.org/idn-reg-policy/mail-archive/>*List-id*: <idn-reg-policy.imc.org>*List-unsubscribe*: <mailto:idn-reg-policy-request@imc.org?body=unsubscribe>*Sender*: owner-idn-reg-policy@xxxxxxxxxxxx

Hello Adam,

Let's suppose for a moment that it turns out to be essential to use a relation that is not transitive, as Roozbeh has suggested, and as (I now think) Paul would agree. That is, even if labels X and Y are unrelated, there might exist a label V that is related to both.

But let's assume the relation can be symmetric; that is, there is no need to distinguish between "X is related to Y" versus "Y is related to X" versus "X and Y are related". Someone tell me if you think that's an unreasonable assumption.

I currently tend to think that it could be quite reasonable. The main thing I'm worried about is what happens if X or Y (or both) are strings rather than characters. There may be some weird cases where the 'longest prefix' match somehow destroys the symmetry.

For the moment I'll call the relation "confusability".

What about "similarity"?

Given any two labels (in no particular order), they are either confusable or not, and it is possible to compute that boolean value. The computation might enumerate all the labels that could be confused with one of the inputs and check whether the other input is among them, or it might be possible to be more efficient by using a clever algorithm. We'll worry about that later.

In this model, there are no atomic groups of labels. In a different model that used an equivalence relation, we could be sure that every group is either entirely registered or entirely available. But with this non-transitive relation, it's possible the set of labels confusable with a submitted label is partially registered and partially available.

What properties ought a registration policy to have in this model?

1. Two labels in the zone belonging to different registrants must not be confusable. Corollary: Two labels in the zone belonging to different registration bundles must not be confusable, even if they belong to the same registrant (because the registrant could sell them to different buyers).

2. Bundles must not tie together unrelated things in the zone. (That would cheat the registry out of fees for what should be separate bundles.) There are at least three ways to define the relatedness property that a bundle must satisfy:

a. loosely related: For every pair of labels X and Y in the zone that are in the same bundle, there must exist a sequence X, Z1, Z2, Z3, ..., Zk, Y such that every pair of adjacent labels in the sequence is confusable. In other words, X and Y must be related by the transitive closure of the confusability relation.

b. tightly related: Every pair of labels X and Y in the zone that are in the same bundle must be confusable. In other words, the labels in the zone in a particular bundle form a clique.

c. radially related: Among the labels in the zone that belong to a particular bundle, one is designated as the center, and all the others are confusable with it.

I think properties 1 and 2(a|b|c) are all we need. Am I overlooking anything?

Notice that the properties speak only of labels in the zone. They place no constraints on labels that are not in the zone. Nobody really cares what sort of behind-the-scenes bookkeeping the registry might be doing with the labels that are not in the zone, as long as the labels in the zone are well behaved.

2a looks difficult to enforce, and might lead to bundles that are too large, so let's put that one aside for now.

If we assume that registrants try to grab as much as possible, then the bundles could become very large. In fact, if everybody would try to grab as many entries as possible into a bundle, we would end up with an equivalence relation.

Using ~ to express confusability, 2c gives us situations where we can have X~Y and Y~Z, which we could call a confusability chain of length 3. The question is whether there are longer confusability chains in practice.

2b and 2c both look reasonable to me.

How could a registry enforce those properties? Here's one general approach:

Each bundle contains a set of related labels, all of which are in the zone. Bundles do not contain "blocked" labels. For 2c, one of the labels is flagged as the primary label and cannot be removed from the bundle.

When a registrant asks to create a new bundle, containing a single initial label, the request is denied if the label is confusable with any label already in the zone.

When a registrant asks to add a label to their existing bundle, the request is denied if the label is confusable with any label in any other bundle.

One way to visualize this is that there would be a one layer thick 'neutral zone' between any two bundles that might be transitively confusable.

Also, the request is denied if the label is not confusable with: every label in the bundle (2b) / the primary label (2c).

When a registrant asks to remove a label from their existing bundle, the request is denied if it is the only(2b)/primary(2c) label in the bundle.

The 'only(2b)' case seems irrelevant because of what you say in the next point.

A request by a registrant to remove one of their entire bundles is never denied.

We do not need to specify whether a registry keeps records regarding labels that are not in the zone. It might do so, as a precomputation to help with the required checks, or there might be clever pattern-matching algorithms and clever indexing data structures that allow the registry to perform the required checks without extra storage.

If this model looks promising, the big question is whether there is a confusability relation (parameterized by tables) that is both expressive enough to be useful and tractable enough to permit a feasible implementation of the model.

With the sort of tables that have been proposed so far, the checks to be performed are effectively regular expression matches on the zone file or on bundles. And not arbitrary regular expressions, but a restricted class of regular expressions built from rows of the mapping table. It wouldn't surprise me if there were tricks that could be played, but I have no expertise in this area.

I think it's easy to make matching a regular expression against an ordered zone file much more efficient than matching that regular expression against each of the entries in the zone in turn.

The main observation is that the state 'matched up to here' is represented not just by an index into the target string, but by an index and an interval from the first to the last string matched.

To show a very simple example, assume we have a label "1bm", and a similarity "1" ~ "i". This gives us a regular expression "(1|i)bm". Now assume we try to match this against a zone file with the following entries:

1) 0bm 2) 1bb 3) 1oo 4) abb 5) bbc 6) ibc 7) ibm 8) ibx 9) quu

Starting matching the "1", this gives us an interval of 2)-3) (i.e. it can match either string 2) or string 3). Trying with "1b", this reduces the interval to only 2), and trying with "1bm" reduces the interval to nothing, so we have to backtrack. Trying with "i" gives us an interval of 6)-8), and with "ib", we continue with the same interval, to then find a match with "ibm" at 7).

Regards, Martin.

**Follow-Ups**:**Re: model with overlapping variants***From:*Adam M. Costello

**References**:**model with overlapping variants***From:*Adam M. Costello

- Prev by Date:
**Re: New Internet Draft on registering IDNs** - Next by Date:
**Re: New Internet Draft on registering IDNs** - Previous by thread:
**model with overlapping variants** - Next by thread:
**Re: model with overlapping variants** - Index(es):