[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: confusability


Thanks for the correction. I figured that someone would nail me on something I said. For some reason I thought the limit was FFFF, but now I see it's considerably larger than that.

In any event, was there anything else that I was mistaken about? From my limited perspective the "confusability" issue appears very large, even larger than I originally thought.


> I don't know the actual number of additional characters added thus
far, but the upward limit is 65,535. So, as I see it, you will have

A small correction: there are currently over 95,000 characters in Unicode 3.2; in Unicode 4.0 (very soon to be released) there will be an additional thousand-odd characters. In addition, there are 131,068 possible private use characters, and there are 871,758 reserved positions still available for future characters.

IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

----- Original Message -----
From: "tedd" <tedd@xxxxxxxxxxxx>
To: "IDN registration policy list" <idn-reg-policy@xxxxxxx>
Sent: Saturday, April 05, 2003 07:13
Subject: Re: confusability

>tedd <tedd@xxxxxxxxxxxx> wrote: > > > > For the moment I'll call the relation "confusability". Given any >> > two labels (in no particular order), they are either confusable or >> > not, and it is possible to compute that boolean value. >> >> From an earlier post, someone talked about IBM.com vs 1BM.com -- which >> should have been ibm.com vs 1bm.com, but none the less this type of >> similar-looking-glyph use can be confusing. It can be even more >> confusing if one uses a Greek small letter iota with tonos (U03AF) to >> produce an ibm.com. Is this the type of confusion you are talking >> about? > >Could be. A registry would define its confusability relation as it >sees fit. It doesn't want to define confusability so narrowly that >not enough things are considered confusable, because then it would be >swamped by disputes about name ownership. But it doesn't want to define >confusability so broadly that it drastically curtails the number of >registrations (and hence revenue). > >Maybe "confusable" is not the best term. Maybe "neighboring" would be >better. It's got some of the right intuition: If you are my neighbor, >then I am your neighbor (symmetry), but my neighbor's neighbor is >not necessarily my neighbor (intransitivity). You can speak of the >neighborhood centered around a particular label. Neighborhoods centered >around different labels can partially overlap. A bundle would be either >a set of labels that are all neighbors of each other, or a subset of the >neighborhood centered around the bundle's primary label, depending on >which version of property 2 we use. Property 1 says that neighboring >labels in a zone must not belong to distinct bundles. > >I just noticed that I forgot to state an assumption, which we can call >property 0: Every label in a zone belongs to exactly one bundle. > >AMC


 I understand -- but, I cannot see how the "confusability" avoidance
 issue can be implemented to the entire Unicode database.

 It appears to me (perhaps I'm wrong) that this group is trying to
 predict and solve all possible problems that may arise from IDN
 registrations because of look-alike possibilities within the Unicode

 I don't know the actual number of additional characters added thus
 far, but the upward limit is 65,535. So, as I see it, you will have
 some 65,000 different possibilities of character confusion at a
 single character domain level (i.e., a.com). Now, move to two
 characters (aa.com) and figure becomes much larger -- something in
 the order of 65000 x 65000 range.

 Now, what's the upper limit to the number of characters allowed in a
 domain name and what's it's factorial?  Do you honestly believe that
 you can solve this confusability problem for all possible
> combinations -- even if your interpretation is the correct one for
 each situation? Be reasonable, you're approaching a number that
 rivals the US national debt. Plus, no offense, you're making
 decisions about glyphs in other languages that are not you're own.

 I think this group has made some significant progress in that some
 characters have been already mapped to others -- such as all
 occurrences of glyphs looking like "A" and have been mapped to "a"
 and so on. But,you have done that primarily because you are familiar
 with the Latin character set and it's use.

 Now, to map all occurrences of everything that looks similar to one
 character may do more harm than good in ways not apparent to you
 presently. Plus, considering the shear number of combinations and
 thoughtful considerations required for each one -- I don't think this
 group has enough time nor resources to accomplish the task.

It might be best, for all concerned, to let the market and courts work it

tedd -- http://sperling.com/