[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: model with overlapping variants
Thanks a lot for this very clear writeup.
At 08:16 03/04/04 +0000, Adam M. Costello wrote:
Let's suppose for a moment that it turns out to be essential to use a
relation that is not transitive, as Roozbeh has suggested, and as (I now
think) Paul would agree. That is, even if labels X and Y are unrelated,
there might exist a label V that is related to both.
But let's assume the relation can be symmetric; that is, there is no
need to distinguish between "X is related to Y" versus "Y is related to
X" versus "X and Y are related". Someone tell me if you think that's an
I currently tend to think that it could be quite reasonable.
The main thing I'm worried about is what happens if X or Y
(or both) are strings rather than characters. There may be
some weird cases where the 'longest prefix' match somehow
destroys the symmetry.
For the moment I'll call the relation "confusability".
What about "similarity"?
Given any two
labels (in no particular order), they are either confusable or not, and
it is possible to compute that boolean value. The computation might
enumerate all the labels that could be confused with one of the inputs
and check whether the other input is among them, or it might be possible
to be more efficient by using a clever algorithm. We'll worry about
In this model, there are no atomic groups of labels. In a different
model that used an equivalence relation, we could be sure that every
group is either entirely registered or entirely available. But with
this non-transitive relation, it's possible the set of labels confusable
with a submitted label is partially registered and partially available.
What properties ought a registration policy to have in this model?
1. Two labels in the zone belonging to different registrants must not
be confusable. Corollary: Two labels in the zone belonging to
different registration bundles must not be confusable, even if they
belong to the same registrant (because the registrant could sell
them to different buyers).
2. Bundles must not tie together unrelated things in the zone. (That
would cheat the registry out of fees for what should be separate
bundles.) There are at least three ways to define the relatedness
property that a bundle must satisfy:
a. loosely related: For every pair of labels X and Y in the zone
that are in the same bundle, there must exist a sequence X, Z1,
Z2, Z3, ..., Zk, Y such that every pair of adjacent labels in
the sequence is confusable. In other words, X and Y must be
related by the transitive closure of the confusability relation.
b. tightly related: Every pair of labels X and Y in the zone that
are in the same bundle must be confusable. In other words, the
labels in the zone in a particular bundle form a clique.
c. radially related: Among the labels in the zone that belong to a
particular bundle, one is designated as the center, and all the
others are confusable with it.
I think properties 1 and 2(a|b|c) are all we need. Am I overlooking
Notice that the properties speak only of labels in the zone. They place
no constraints on labels that are not in the zone. Nobody really cares
what sort of behind-the-scenes bookkeeping the registry might be doing
with the labels that are not in the zone, as long as the labels in the
zone are well behaved.
2a looks difficult to enforce, and might lead to bundles that are too
large, so let's put that one aside for now.
If we assume that registrants try to grab as much as possible, then
the bundles could become very large. In fact, if everybody would try
to grab as many entries as possible into a bundle, we would end up
with an equivalence relation.
Using ~ to express confusability, 2c gives us situations where we
can have X~Y and Y~Z, which we could call a confusability chain
of length 3. The question is whether there are longer confusability
chains in practice.
2b and 2c both look reasonable to me.
How could a registry enforce those properties? Here's one general
Each bundle contains a set of related labels, all of which are in the
zone. Bundles do not contain "blocked" labels. For 2c, one of the
labels is flagged as the primary label and cannot be removed from the
When a registrant asks to create a new bundle, containing a single
initial label, the request is denied if the label is confusable with any
label already in the zone.
When a registrant asks to add a label to their existing bundle, the
request is denied if the label is confusable with any label in any other
One way to visualize this is that there would be a one layer thick
'neutral zone' between any two bundles that might be transitively
Also, the request is denied if the label is not confusable
with: every label in the bundle (2b) / the primary label (2c).
When a registrant asks to remove a label from their existing bundle, the
request is denied if it is the only(2b)/primary(2c) label in the bundle.
The 'only(2b)' case seems irrelevant because of what you say in the
A request by a registrant to remove one of their entire bundles is never
We do not need to specify whether a registry keeps records regarding
labels that are not in the zone. It might do so, as a precomputation to
help with the required checks, or there might be clever pattern-matching
algorithms and clever indexing data structures that allow the registry
to perform the required checks without extra storage.
If this model looks promising, the big question is whether there
is a confusability relation (parameterized by tables) that is both
expressive enough to be useful and tractable enough to permit a feasible
implementation of the model.
With the sort of tables that have been proposed so far, the checks to be
performed are effectively regular expression matches on the zone file
or on bundles. And not arbitrary regular expressions, but a restricted
class of regular expressions built from rows of the mapping table. It
wouldn't surprise me if there were tricks that could be played, but I
have no expertise in this area.
I think it's easy to make matching a regular expression against
an ordered zone file much more efficient than matching that regular
expression against each of the entries in the zone in turn.
The main observation is that the state 'matched up to here' is
represented not just by an index into the target string, but by
an index and an interval from the first to the last string matched.
To show a very simple example, assume we have a label "1bm", and
a similarity "1" ~ "i". This gives us a regular expression
"(1|i)bm". Now assume we try to match this against a zone file
with the following entries:
Starting matching the "1", this gives us an interval of 2)-3)
(i.e. it can match either string 2) or string 3). Trying with
"1b", this reduces the interval to only 2), and trying with
"1bm" reduces the interval to nothing, so we have to backtrack.
Trying with "i" gives us an interval of 6)-8), and with "ib",
we continue with the same interval, to then find a match with
"ibm" at 7).