[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: initial thoughts

Paul Hoffman / IMC <phoffman@xxxxxxx> wrote:

> I believe that it would be really, really hard to describe how one
> would use strings (sequences of characters) as input to the table.

Couldn't you simply say that when more than one key matches, use the
longest match?  It would be trickier to implement, but that's exactly
how routing tables work.  It seems pretty easy to describe.  Whether it
would be worth the additional complexity, I don't know.

> > We both propose partitioning the set of admissible labels into
> > groups (bundles).  Paul imagines a function that constructs the
> > entire group given any member of the group, while I imagine a
> > function that computes a group identifier given any member of the
> > group, where the group identifier could be a single member of the
> > group arbitrarily chosen to stand for the whole group.  Either
> > kind of function implies the same partition of the space, and both
> > functions would use the same tables, but I think the group-id
> > function might be easier to describe and understand.
> This seems like a somewhat academic difference

Yes, it is academic, because it's possible to define the exact same
bundles either way.  But people might find it difficult to wrap their
heads around the bundle-generating function and reason about it.  I
know I do.  I can't really follow the description in the draft, and
even if I could, I think I would have difficulty answering this
fundamental question:

Is it possible that bundle(labelX) and bundle(labelY) are neither equal
nor disjoint (that is, they overlap but are not exactly the same)?  I've
been assuming that you intended for the bundles to form a partition;
that is, any two bundles are either equal or disjoint.  Was that indeed
your intention?

The approach I proposed is to define the bundles implicitly, like so:
labelX and labelY belong to the same bundle iff bundleID(labelX) ==
bundleID(labelY).  This obviously forms a partition, no matter how the
bundleID() function behaves.

With this approach, you wouldn't need to generate all the variants when
the label is registered (or ever).  You could just create one entry
under the bundle ID, containing the registrant info and a list of active
variants (which could be just a handful).  Whenever someone wants to
register a label, or activate/deactivate a variant, you would compute
bundleID(label) to see which bundle it belongs to.

I find this approach easier to conceptualize, perhaps because it goes
in the same direction as Stringprep's mapping and normalization steps
(many input strings map to the same output string) and would use tables
in a similar way.  The function you propose uses tables in the other
direction, so that one input string maps to many different output
strings.  I have no practice thinking that way.  :)

> > Will it scale?
> Absolutely.

Suppose JPNIC decides that hiragana and katakana should block each
other.  [Background for readers not familiar with Japanese writing:
In addition to the ideographic script (kanji), there are two parallel
phonetic scripts: hiragana (for normal Japanese words) and katakana (for
words recently imported from other languages, and sometimes merely for
emphasis, like italics).]

Now imagine a label consisting of 39 hiragana (for example, the Japanese
translation of "good morning good day good evening good night thank you
very much" can be written using 39 hiragana and fits in a single label
with one byte to spare).

The number of variants is over a half million million (5e11).  At a
storage cost of 64 bytes per variant, that's 32 terabytes, just for this
one registration.

> > You could also imagine that the registry, rather than the
> > registrant, chooses which member(s) will be visible, but I think it
> > would be difficult for registries to come up with rules that would
> > please everyone; it would probably be easier to let the registrant
> > choose, and simply store the list.
> Speaking of "scaling", your suggestion makes scaling harder.  That
> is, the registration process would have to include a step where
> the registrant chooses some things from a list.  This is much more
> difficult than the registry saying "here's what you get, you can
> contact us if you don't like it".  Well, of course they won't like it,
> but at least the process isn't blocked.

I would still call that "registrant chooses", even though the registry
is offering a default choice, because the registry still needs to be
able to store lists of visible names for registrants who ask to deviate
from the default.  When I said "registry chooses", I meant that the
registrant has zero input, so that the registry could define the choice
algorithmically and omit any capability of storing lists.

> It is up to the registry to decide what makes more sense to their
> customers.