[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: New Internet Draft on registering IDNs




Hello Paul,


Below are some comments on your draft.


At 09:31 03/03/25 -0800, Paul Hoffman / IMC wrote:


Greetings. I have just submitted a new Internet Draft that gives suggestions on how to register IDNs. You can find a link to the draft at the web site for this mailing list at <http://www.imc.org/idn-reg-policy/>. Comments are, of course, welcome.

small aside: it would be helpful to have a direct link, and the name of the draft.


This document is different than the JET document in many ways.

I often see 'JET document' in this discussion. Is this draft-jseng-idn-admin, or something else?


It is meant to be generic and usable by anyone, not just registries using CJK characters. It also attempts to make the registry policy easier and more predictable from the outside. I have heard that other folks will also be preparing Internet Drafts on this issue, so it will be good to see what the differences are.




Internet Draft                                        Paul Hoffman
draft-hoffman-idn-reg-00.txt                            IMC & VPNC
March 25, 2003
Ex pires in six months
Intended status: Best Current Practice (BCP)


Framework for Registering Internationalized Domain Names

I have been thinking about 'Framework' quite a bit. Is this draft a framework? It seems to be a definition of a table format with some associated algorithm(s), and a recommendation to use this table format/algorithms. And the recommendation isn't very clear: Should all registries use this table format/algorithm? Or is it just one of potentially many formats/algorithms, and registries can choose?


Abstract

This document describes a framework for registering internationalized
domain names (IDNs) in a zone. Before accepting registrations of domain
names into a zone, the zone's registry should decide which codepoints in
the Unicode character set the zone will accept. The registry should also
decide whether particular characters in a registered domain name should
cause registration of multiple equivalent domain names. With those
decisions, the registry can safely register names using the steps
described here.

This does not mention table format or algorithm. It gives rather detailed instructions to registries; just saying 'this memo gives advice to registries on ...' should be okay. Also, if the abstract mentiones 'registration of multiple equivalent', which is 'mapping', it should probably also mention 'blocking'.


1. Introduction

The intro does not mention the table format and algorithm, it just speaks about a 'mechanism'. That had me wonder what was going on for too long.


IDNA [IDNA] specifies an encoding of characters in the Unicode character
set [UNICODE]

the 'in' here confused me, because on first reading, I read it as 'into'. maybe change to 'from'. Also, IDNA isn't about encoding [single] characters, but character strings.


which is backwards-compatible with the current definition
of hostnames. This implies that domain names encoded according to IDNA
will be able to be transported between peers using any existing
protocol, including DNS.

IDNA, through its requirement of Nameprep [NAMEPREP], uses equivalence
tables that are based only on the characters themselves; no attention is
paid to the intended language (if any) for the domain name. However, for
many domain names, the intended language of one or more parts of the
domain name actually does matter to the registry for the names and to
users.

If there are no constraints on registration in a zone, people can
register characters that increases

increases -> increase


the risk of misunderstandings,
cybersquatting, and other forms of confusion. A similar situation
existed before

existed before -> exists separate of (it didn't go away with IDNA)


the introduction of IDNA exemplified by domain names such
as example.com and examp1e.com (note that the latter domain has

has -> contains (this is just a small stylistic issue)


the
digit "1" instead of the letter "l").

For some human languages, there are characters and/or strings that have
equivalent or near-equivalent meanings.

I would change 'meanings' to 'usages'. Because this is mostly about single characters, and these in general don't have meanings.


If someone is allowed to
register a name with such a character or string, the registry might want
to automatically register all the names that have the same meaning in
that language. Further, some registries might want to restrict the set
of characters to be registered for language-based reasons. In addition,
IDNA allows the use of thousands of non-alphanumeric characters, and
some zone administrators will want to prohibit some or all of these
characters.

This paragraph may look much clearer if it is changed to a list of bullet points.

The need for documenting what is done should also be listed here.


The intent of this document is that checking whether a label
can be approved can be a mathematical, objective inspection of the
codepoints in the label with no human intervention, and that all
applications of a particular table will yield identical results.

The mechanism

see above about 'mechanism'.



described here does not require a registry to know the
"intended language" of a label. It is impossible to describe the
"intended language" of names that include numbers or acronyms.

It is in many cases impossible to know the 'intended language' of names even without numbers or acronyms.


Proposals
that have this requirement require human intervention to validate the
assertion from the registrant and are therefore susceptible to fraud
from the registrant. Further, such a requirement prevents

I don't think this is true. It would not prevent, but it would make things more difficult.

the
registration of labels that have two languages, some of which are common
in countries with multiple languages.

[IDN-ADMIN] shows a different proposal to the problem of registration
policy. That document uses a more complex algorithm and a different
registration philosophy that what is described here.

that -> than



It is suggested that a registry act conservatively when starting
accepting IDNA-based domain names.

This should also say that this means starting with a small set of base characters, and maybe adding more later. It should probably also say something about informing all current registrants about changes in policy.


Equivalences are very hard (if not
impossible) to define after registration has started. Assume that the
labels "x" and "y" at first are different, but later the tables for the
registry are changed so that "x" and "y" are then treated as being the
same. If x.example.com and y.example.com both were already registered to
different registrants, it is unclear which of them has to withdraw the
registration, how that selection process

insert 'is'



done, and so on. Thus, having
complete, publicly-stated policies before accepting registration will
lead to a much more stable registration process.

The 'act conservatively' to some extent suggests that policies could be relaxed as we go on, e.g. blocked variants reduced. Is this a good idea or not? Should probably be stated explicitly.


This document does not deal with how to handle whois data for multiple
registrations, and does not deal with regitrar-registry protocols.

regitrar -> registrar



This document also only deals only with variants of single characters,

remove one 'only'.


not variants of strings.

Add something like 'although variants can be strings'.



1.1 Terminology

Say that this is terminology used in this memo, not necessarily of general use.


A "string" is an ordered set of one or more characters.

'ordered set' -> 'sequence' (an ordered set does not allow duplicates, but a sequence allows repetitions).


This document discusses characters that have equivalent or
near-equivalent characters or strings. The "base character" is the
character that has one or more equivalents; the "variant(s)" are the
character(s) and/or string(s) that are equivalent to the base character.

'base character' is used in the context of combining characters, it would be better to find another term.

It would also be good to clearly state what the purpose of the base is.
My understanding is as follows:
- The base must be a single character, variants don't.
- If blocking is used, variants are blocked, but not the base.
- The base, and not any variant, must be used in a registration request.

Say that characters are Unicode codepoints.

A "registration bundle" is the set of all labels that comes from
expanding all base characters for a single name into their variants.

A registry is the administrative authority for a DNS zone. That is, the
registry is the body that makes and enforces policies that are used in a
particular zone in the DNS.

add quotes around 'registry'. Is this the same as the general use of the term, or is this specific to the discussion here?


The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119
[RFC2119].

2. Language-based tables

The registration strategy described in this document uses a table that
lists all characters allowed for input and any variants of those
characters.

If there are two equivalent characters, and I as a registry want to allow a registrant to use either of these when registering, do I need two entries in the table?


Note that the table lists all characters allowed, not only
the ones that have variants.

It is widely expected that there will be different tables for the same
language created by different people. Many languages are spoken in many
different countries, and each country might have a different view of
which characters should or should not be considered needed for that
language. For example, some people would say that the Latin characters
are needed for various Indic languages, while others would say that
they are not.

I think the question of whether ASCII is allowed or not is a very special one, which should be considered separately. There may be cases where ASCII is already allowed; there is a good argument for always allowing ASCII, in particular if the higher-level domains are all ASCII; there is a good argument to not allow ASCII if the higher-level domains are all non-ASCII. These arguments are rather different from arguments about allowing a few more or less characters.


A zone needs to have exactly one table; having more than one table can
lead to unpredictable results because the variants in the different
tables may conflict. The table must be carefully composed

The word 'composed' here suggests that it is composed from several different tables. But I think the general statement is that a table must be carefully checked, in all cases, even for single languages.

so that all
expected variants will be created, and no unexpected variants are
created.

The registry's table MUST NOT have more than one entry for a particular
base character. A table with more than one variant rule

add 'for the same base character'.


requires that
some names be evaluated by humans and will open the registration process
to dispute.

what about tables that don't have the same base character twice, but may map to the same character? E.g.:

U+00E8|U+0065
U+00E9|U+0065

Or where a base character also appears as a variant

U+00E8|U+0065
U+0065

Or where a base character appears as part of a variant:

U+00FC|U+0075U+0065
U+0075
U+0065


The tables are language-specific, although it is possible to create a
single table that covers multiple languages. The following three
sub-sections describe the use of tables in three scenarios.

2.1 Table for a zone that uses names from one language

A zone that has a single language has a significant advantage over
zones that cover multiple languages. Its table can be constructed
without concern for variants that appear in other languages for the
base characters of the language used in the zone.

2.2 Table for a zone that uses names from a small number of languages

If a zone covers more than one language, the registry must create its
registration table from multiple language tables. Creating a table from
many languages is easy if none of the languages have overlapping
character variants for any single base character.

A registry MUST NOT blindly combine multiple tables which have
overlapping equivalences. Instead, the registry MUST carefully analyze
every instance in the combined table where a base character has one or
more different variants and select the desired set of variants for the
base character.

2.3 Table for a zone that has no language restrictions

A registry that does not restrict the number of languages will probably
allow a much wider range of characters to be used in names. At the same
time, that registry cannot easily use character variants because
variants for one language will be different from the variants used in a
different language. To handle conflicting variants among languages, the
registry can choose to have no variants for any base characters, or can
choose to have variants for a subset of the languages that are
expressible in the characters allowed.


3. Table processing rules


The input to the process is called the "input label".

The input is a label and a table, I guess.



The output of the
process is either failure (the input label cannot be registered at all),
or a registration bundle that contains one or more labels that have been
processed with ToASCII.

It doesn't seem necessary to describe the algorithm with using ToASCII at the end. It would be more straightforward to have one Unicode string as input, and several as output. From a registrant's and from a user's point of view, and from the rest of this memo, these are the things mapped/blocked.


Processing the input label requires two versions of ToASCII: "standard
ToASCII" and "enhanced ToASCII". Standard ToASCII is exactly the same as
the ToASCII in [IDNA]. Enhanced ToASCII is standard ToASCII with the
steps from section 3.1 added.

Note that the process MUST be executed only once. The process MUST NOT
be run on any output of the process, only on the new label that was
input.


3.1 Creating enhanced ToASCII.

probably no need for a '.' at the end of the title.


It would be clearer if there was a description for 'checking the
input label' (preparation and checking against base characters in
table) and another description for 'creating variant strings'
(iteration through string and creation of combinations).
This would avoid the term 'enhanced ToASCII', which is
bound to create some confusion.


During the processing, an "temporary bundle" contains partial labels,
that is, labels that are being built and are not complete labels. The
partial labels in the temporary bundle consist of Unicode characters.

the partial labels are strings, not necessarily single characters.



The following steps after step 2 but before step 3 of ToASCII.

This implies that we continue with ToASCII after 2e). But the continuation for the input label is in 2b), and for the variants, in 2da).


2a) Split the input label into individual characters, called "candidate
characters". Compare each candidate character against the base
characters in the table. If any candidate character does not exist in
the set of base characters, the system MUST stop and not register any
names (that is, it MUST not register either the base name or any labels
that would have come from character variants).

2b) Continue the steps in standard ToASCII for the input label. If
ToASCII fails for the input label, the system MUST stop and not register
any of the labels (even if the other labels would have passed ToASCII).
If ToASCII succeeds, add the result to the registration bundle.

2c) For each candidate character in the input label, do the following:

This is confusing. To show you why I think it is confusing, let's assume we have the following input label: U+0064U+0064U+0065 (dde)

and the following table:

U+0064 (d)
U+0065|U+0066:U+0067 (e|f:g)

If we start out with a 'temporary bundle' containing a single
empty string, i.e. [""], then after 2c1), we have ["d"], then we
go to 2c3) which (probably) does nothing because there are no
variants, then back to 2c1), we get ["dd"], and the next time
round, we get ["dde"]. Then at 2c2), for variant 'f', we get
at 2c2a): ["dde","dde"], and then for variant 'g' we get
["dde","dde","dde","dde"] again by duplication at 2c2a).
Then at 2c3), we get ["ddeg","ddeg","ddeg","ddeg"].

Of course what was intended was to get ["dde","ddf","ddg"],
but that needs a different description.


   2c1) Copy the candidate character into every partial label in the
   temporary bundle. If the base character that matches the candidate
   character has no variants, go to step 2c3.

2c2) For each variant of the base character, do the following:

      2c2a) Duplicate all of the current partial labels in the
      temporary bundle.

      2c2b) If this is the last variant, go to step 2c3; otherwise,
      select the next variant, and go to step 2c2a.

2c3) Copy the variant into each partial label.

   2c4) If there are more candidate characters, select the next
   candidate character and got to step 2c1. Otherwise, go to step 2d.

2d) The temporary bundle now contains zero or more labels that consist
of Unicode characters. For each label in the temporary bundle:

2da) Process the label with standard ToASCII.

   2db) If ToASCII succeeds, put the result in the registration bundle.
   Otherwise, do not put anything into the registration bundle.

2dc) Select the next label and go to step 2da.

2e) The resulting registration bundle has all the labels in ToASCII
encoding. Finish.

What if some labels in this bundle conflict with already existing registrations? What if the same label appears more than once in the bundle?



4. Table format

The format of the table is meant to be machine-readable but not
human-readable. It is fairly trivial

For some people, writing a C program or a perl script is 'fairly trivial'. For others, it's not. It is easy to change the format to make it even more trivial.


to convert the table into one
that can be read by people.

Each character in the table is given in the "U+" notation for Unicode
characters. The lines of the table are terminated with either a carriage
return character (ASCII 0x0D), a linefeed character (ASCII 0x0A), or a
sequence of carriage return followed by linefeed (ASCII 0x0D 0x0A). The
order of the lines in the table do not matter.

Each line in the table starts with the character that is allowed in the
registry. If that character has any variants, the base character

is 'the character'/'that character' the 'base character'?



is
followed by a vertical bar character ("|", ASCII 0x7C) and the variant
string. If the base character has more than one variant, the variants
are separated by a colon (":", ASCII 0x3A). Strings are given without
any intervening spaces

The following is an example of how a table might look. The entries in
this table are purposely silly and should not be used by any registry as
the basis for choosing variants. For the example, assume that the
registry:
- allows the FOR ALL character (U+2200) with no variants
- allows the COMPLEMENT character (U+2201) which has a single variant
  of LATIN CAPITAL LETTER C (U+0043)
- allows the PROPORTION character (U+2237) which has one variant which
  is the string COLON (U+003A) COLON (U+003A)
- allows the PARTIAL DIFFERENTIAL character (U+2202) which has two
  variants: LATIN SMALL LETTER D (U+0064) and GREEK SMALL LETTER DELTA
  (U+03B4)

The table would look like:
U+2200
U+2201|U+0043
U+2237|U+003AU+003A
U+2202|U+0064;U+03B4

The registry's table MUST NOT have more than one entry for a particular
base character.

What about other restrictions?



Implementors of table processors should remember that there are tens of
thousands of characters whose codepoints are greater than 0xFFFF. Thus,
any program that assumes that each character in the table is represented
in exactly six octets ("U", "+", and exactly four octets representing
the character value) will fail with tables that use characters whose
value is greater than 0xFFFF.


5. Steps after registering an input label


A registry has three options for how to handle the case where
the registration bundle has more than one label. The policy options are:

1) Allocate all labels to the same registrant, making
the zone information identical to that of the input label.

2) Block all labels so they cannot be registered in the
future.

Does 'all labels' include the input label?



3) Allocate some labels and block some other labels.

Option 1 will cause end users to be able to find names with variants
more easily, but will result in larger zone files. For some
language tables, the zone file could become so large that it
could negatively affect the ability of the registry to perform name
resolution.

Option 2 does not increase the size of the zone file, but it
may cause end users to not be able to find names with variants
that they would expect.

Option 3 is likely to cause the most confusion with users because
including some variants will cause a name to be found, bout using
other variants will cause the name to be not found.

With any of these three options, the registry MUST keep a database that
links each label in the registration bundle to the input label. This link
needs to be maintained so that changes in the non-DNS registration
information (such as the label's owner name and address) is reflected in
every member of the registration bundle as well.

If the registry chose option 1, when the zone information for the input
label changes, the zone information for all the members of the
registration bundle MUST change in exactly the same way. The zone
information for every member of the registration bundle MUST remain
identical as long as any of the members of the registration bundle
remain in the zone. A registry can keep the zone information for the
registration bundle identical using a database, or using DNAME records,
or using a combination of the two.

If the registry chose option 2, when the zone information for the input
label changes, the blocked information for all the members of the
registration bundle MUST be identical to that of the input label, and
MUST remain identical as long as the input label remains in the zone. A
registry can keep the zone and blocked name information for the
registration bundle identical using a database.

If the registry chose option 3, it must use an unspecified method to
keep the elements in the registration bundle cohesive. This option
SHOULD NOT be used except under carefully-controlled circumstances.


7. Owner implications of multiple labels

The creation of a registration bundle for equivalent or near-equivalent
labels in a zone at the time of registration leads to many delegations.
This leads to records in parallel zones which MUST be synchronized. That
is, the owner of a registration bundle MUST keep the same information in the
zone for each label in the bundle.

Using the examples from section 6, assume that the owner of the label
"pale" and "pa1e" creates a subdomain, "www". If the owner of
"example.com" used multiple delegations for the labels, the owner of
"pale" and "pa1e" would use two records:

  $ORIGIN pale.example.com.
  www IN A 1.2.3.4

  $ORIGIN pa1e.example.com.
  www IN A 1.2.3.4

An alternative for these two records, which helps the registrant
keep their names in synch, would be:

  $ORIGIN pale.example.com.
  www IN A 1.2.3.4

  $ORIGIN pa1e.example.com.
  www IN CNAME www.pale.example.com.

If the owner of "example.com" used a DNAME

CNAME or DNAME?



record to make "pale" and
"pa1e" equivalent, the owner of "pale" and "pa1e" could instead use one
record:

lots of 'if' and 'would' here, suggesting that this is not necessarily a good thing to do. But looking at the result, CNAME is much easier to handle than anything else. Relying on an arbitrary registrant to keep their (potentially many) variants straight doesn't sound like something reliable. So I would propose that we make a strong recommendation for CNAME.


  $ORIGIN pale.example.com.
  www IN A 1.2.3.4


8. Security considerations


Apart from considerations listed in the IDNA specification, this
document explicitly talks about equivalences that a registry can define
as part of the policy which can be applied in a zone. A registry can
apply an equivalence table which solves some problems with homographs
already outlined in the security consideration section of IDNA. This
might be considered good for security because it will reduce the
possible confusion for the user, and lower the risk that the user will
"connect" to a service which was not intended.

This should mention a) the potential of security problems created by badly designed tables, and b) the potential for user confusion and related security problems created by different tables for different zones (e.g. in .com, certain equivalences are valid, and users get used to these, but then in .new, these equivalences are not used, and users can get spoofed).


Regards, Martin.