[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bidi issues



 > Well, it increases the number of user-visible effects of the
 > segmentation mechanism.

Ok, I was working on the basis that the ACE encoded form is not
intended to be user-visible.
 > 
 > The model suggested by the IRI draft is that complex identifiers (like
 > IRIs and IMAs) be thought of as made up of components, which may in turn
 > be thought of as made up of smaller components, and so on, eventually
 > bottoming out.  The bidi restrictions are applied to the bottom-level
 > components.

You're right, I misremembered slightly what the IRI document said.
But in IRI it's a SHOULD, not a MUST, so it is essentially still my
option 3.

 > It's possible that there is a compelling reason use a different
 > segmentation rule for the bidi check than for the encoding.

I'm not sure there is a compelling reason, but there are many other
possibilities.  Two that immediately spring to mind:

 * Segment only on dot.  This means that the rules for what is allowed
   on the LHS and RHS are the same.

 * Segment on _all_ punctuation, not just ASCII puncuation.
   (Spefically, segment on bidi categories ET, ES, CS and ON)

 > For the sake of argument, do you see a simple tweak to the bidi
 > restrictions that would improve these corner cases?  For example,
 > expanding LCat and/or RandALCat to include more bidi classes?

 > (I'm just talking about fixing the cases where an invidual component
 > gets split apart.  For the cases where whole components get ordered
 > ambiguously, Martin explained why we're stuck with them.)

This could be fixed by disallowing labels such as 3com.  If LTR labels
were required to begin and end with a LTR character (as RTL labels
are) then most of these ugly cases would go away.  But whilst these
cases might cause confusion, they don't actually create any ambiguity,
so we can leave it up to users to avoid creating incredibly ugly
domain names.

I think I have rules which would prevent the A-123,456B case and the
123.ABC.com case, but they're not very pretty.  I can tidy them up a
bit and post them if there's any interest.  I think it's probably
worth documenting these cases somewhere, since it would obviously be
foolish to attempt to use them in real life.

 > characters" and "right-to-left characters" without refering to any
 > specific bidi classes.  Are those phrases precisely defined anywhere?

I don't think so.  I presume it means LCat and RandALCat, in
stringprep speak.

 > If we do this, then we should likewise add a warning to IDNA about
 > 123.ABC.com and ABC.123.com.

I think there should be a warning in IDNA.  Probably more than a
warning; domains such as these SHOULD NOT be used in contexts where
they need to be identified or manipulated by humans, IMHO.

But is there any scope for adding such a warning to IDNA in the near
future, now that the IDN WG has been disbanded?

 > As for the "other cases", do they all involve non-directional
 > components (components that contain no left-to-right characters and
 > no right-to-left characters)?

As far as domain names ago, I think so.  As for e-mail addresses, it
really depends on what restrictions we end up with on the LHS, I
think, and whether they are mandatory or only recommedations.

	-roy