[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: model with overlapping variants



> > I think it's easy to make matching a regular expression against an
> > ordered zone file much more efficient than matching that regular
> > expression against each of the entries in the zone in turn.
>
> Thanks for the illustration.  I didn't realize that so much could be
> gained with such a simple data structure (a sorted list).  I think

1. Any equivalence relation on strings corresponds to a partition of all
strings. By far the most efficient matching of strings is where one member
of each partition is chosen as the 'skeleton' or representative for that
partition. All you need is an efficient mapping function toSkeleton() that
takes each member of each partition onto its skeleton. The toSkeleton()
function can be any idempotent string function.

Once you have that, for each partition, you only need to store the skeleton,
*not* all the members of the partition. And to test any probe string, you
only need to convert that probe to its skeleton, and do a straight, normal
lookup in a SkeletonSet (typically a HashSet of some sort; could also be
sorted if needed).

Since the skeleton might not be the preferred, most-readable form, you need
to store the original string as well. So the SkeletonSet can actually be a
HashMap (or SortedMap), one that maps from the skeleton to the preferred
form. So you end up storing 2N strings -- you *don't* have to store all the
possible strings in each partition.

2. One can generalize the process in #1, if there are multiple equivalence
relations and some kind of registration process, as long as the
registrations are serial. For example, one could support both German and
French equivalence relations. It's a bit more work:

You have to keep a set of skeletons for all registrations, for each
equivalence relation, e.g.
FrenchSkeletonSet
GermanSkeletonSet
SwedishSkeletonSet
...

What you do is when each name N is registered, you have to say which
equivalence relation is to be used with it (e.g. German). To make sure that
N doesn't collide with a pre-existing registered name, look N up against
each of the above sets using *that* set's isSkeleton function. If it ever
collides, don't register it. If it doesn't collide in any of the sets, add
its skeleton to the appropriate Skeletons set (e.g. GermanSkeletonSet).

To lookup an incoming string, probe each of the skeleton sets according to
the set's isSkeleton function until you find a match. Since these are
equivalence relations, the order of lookup in the sets doesn't matter. You
end up with M hash lookups per test.

Note that each SkeletonSet can have its own preferred forms, so the
preferred form for simplified Chinese can be different than the preferred
form for traditional, for example.

Mark
(مرقص بن داود)
________
mark.davis@xxxxxxxxx
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

----- Original Message -----
From: "Adam M. Costello" <idn-reg-policy.amc+0@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
To: "IDN registration policy list" <idn-reg-policy@xxxxxxx>
Sent: Friday, April 04, 2003 15:25
Subject: Re: model with overlapping variants


>
> Martin Duerst <duerst@xxxxxx> wrote:
>
> > The main thing I'm worried about is what happens if X or Y (or both)
> > are strings rather than characters.  There may be some weird cases
> > where the 'longest prefix' match somehow destroys the symmetry.
>
> Could be.  If so, we could force the symmetry by defining the relation
> this way:  X and Y are neighbors iff table(X) matches Y or table(Y)
> matches X, where table() and "matches" are yet-to-be-defined operations
> that might be asymmetric.
>
> > > For the moment I'll call the relation "confusability".
> >
> > What about "similarity"?
>
> I think I like "neighbors" better.  "Similarity" sounds like a
> real-valued function (how similar are X and Y?), while "neighbors"
> sounds more boolean (are X and Y are neighbors?).
>
> > > 2a looks difficult to enforce, and might lead to bundles that are
> > > too large, so let's put that one aside for now.
> >
> > If we assume that registrants try to grab as much as possible, then
> > the bundles could become very large.  In fact, if everybody would try
> > to grab as many entries as possible into a bundle, we would end up
> > with an equivalence relation.
>
> If the registrants performed their grabbing sequentially (each bundle is
> maximally expanded before the next bundle is created), then we'd end up
> with an equivalence relation.  If the bundles expand concurrently, then
> the equivalence classes would not form, because labels destined to be in
> the same class could start off in separate bundles, and bundles never
> merge.
>
> > Using ~ to express confusability, 2c gives us situations where we can
> > have X~Y and Y~Z, which we could call a confusability chain of length
> > 3.  The question is whether there are longer confusability chains in
> > practice.
>
> Character-chains of length 2 can give rise to string-chains of greater
> length.  For example, suppose we have characters related as follows:
>
>     A~B B~C
>     X~Y Y~Z
>
> Now suppose a zone contains AX, BX, CY, CZ.  The shortest path (using
> labels in the zone) from AX to CZ goes through all four labels.  There
> is a shorter path involving the label BY, but BY might not exist in the
> zone.
>
> > One way to visualize this is that there would be a one layer thick
> > 'neutral zone' between any two bundles that might be transitively
> > confusable.
>
> Right, any labels that neighbor multiple bundles are unavailable to
> everyone.  (A label neighbors a bundle if it neighbors any label in the
> bundle.)
>
> We need to remember that we're speaking in the context of a model where
> bundles contain only labels that are in the zone, not any "blocked"
> labels.  Maybe I shouldn't have used the term "bundle" for this purpose,
> since Paul introduced it to include blocked labels.
>
> > > When a registrant asks to remove a label from their existing bundle,
> > > the request is denied if it is the only(2b)/primary(2c) label in the
> > > bundle.
> >
> > The 'only(2b)' case seems irrelevant because of what you say in the
> > next point.
> >
> > > A request by a registrant to remove one of their entire bundles is
> > > never denied.
>
> It is relevant because it means that bundles cannot be empty.  A
> registrant might imagine that they could remove the last label from
> a bundle, and still have an empty bundle until it expires normally,
> and before then they could add a label back into the bundle.  But I
> want to assert that this is not allowed, because it could be used to
> disguise the registration of a completely new & different name as a mere
> expansion of an existing bundle (the fees for those two operations are
> likely to be different).
>
> > I think it's easy to make matching a regular expression against an
> > ordered zone file much more efficient than matching that regular
> > expression against each of the entries in the zone in turn.
>
> Thanks for the illustration.  I didn't realize that so much could be
> gained with such a simple data structure (a sorted list).  I think
> there are still pathological cases where the time spent backtracking is
> exponential, but maybe a more sophisticated data structure could help
> with that (if needed).
>
> AMC
>