[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Bidi issues
Sorry for the much delayed response...
I think I am starting to get a handle on at least part of the
STRINGPREP bidi restritions. (Though with the disclaimer that I'm no
bidi expert, so the examples below could be wrong.)
Assume for the moment that it's taken as a given that the address is
rendered by the bidi algorithm in an LTR context. (I think this
assumption must have been present in IDNA, otherwise the bidi
restrictions as they stand wouldn't make sense; unfortunately it's not
stated explicitly in the RFCs.)
If you allow labels to mix LTR and RTL characters, you get effects
such as the following:
abCD.EFgh.com (logical order)
abFE.DCgh.com (display order)
Note that CD.EF has been recognized by bidi as a single run of RTL
characters, and rendered accordingly. The result is that you've lost
the integrity of a label. The text of a label can be split across a
separator, and can abutt the text of another label. I think it was
reasonble for the authors of IDNA to conclude that this is just too
Now, if you allow mixing of LTR and RTL text in the localpart, then
you end up allowing the localpart to be split across the @ sign
owner-LIST@xxxxxxxxxx (logical order)
owner-NIAMOD@xxxxxxxx (display order)
or even worse, of course
abcDEF@xxxxxxxxxx (logical order)
abcNIAMOD@xxxxxxx (display order)
The latter is clearly bad; is the former OK, or does it need to be
avoided? ie should the localpart always render contiguously?
Requiring the localpart to be contiguous would mean either disallowing
owner-LIST or changing the display model to something other than
simple application of the bidi algorithm. (And the IRI folks seem to
have already decided against the latter approach.)
Adam M. Costello wrote:
> > If we don't want the bidi check to apply to the whole local part, but
> > rather to individual segments, then I think we'll need to break Nameprep
> > into two halves. The order of processing would be:
> > Nameprep first half (mapping, normalization, prohibition)
> > Segmentation
> > Nameprep second half (bidi check) applied to each segment
> The prohibition step works just as well before or after the
> segmentation, and it seems cleaner to put it after:
I think this is the wrong place to approach the problem, though it
might end up being the right solution.
The problem with this is that it elevates the segmentation mechanism
from an internal mechanism of ToASCII/ToUNICODE to a user-visible
feature of IMAs.
I think there are three basic choices here (from most restrictive to
1. Apply the stringprep bidi check to the entire localpart.
2. Apply a different, (presumably less restrictive) bidi check to the
3. Use stringprep without the bidi restrictions (nameprep would still
prohibit the directional formating codes, though). Make a
recommendation that care be taken when mixing LTR and RTL
characters in e-mail addresses intended for human use.
Option 1 is the status quo in the draft. If option 2 were chosen, we
may well end up with what Adam proposes, but I don't think the nature
of the bidi restrictions should be driven by the current design of the
ToASCII algorithm. Option 3 is, as I understand it, essentially what
IRI currently does. Nameprep already
I'm not sure I have a strong opinion as to what is the right thing to do.
Option 1 worries me a bit, because if legacy software is going to
construct addresses of the form
indescriminately, that's an open invitation to MUA authors to violate
the standard and produce a non-conformant ToUnicode in an attempt to be
'helpful' to users by displaying such domains in a comprehensible form.
Option 2 has a number of possibilities that could be debated. But note
that applying the stringprep bidi restrictions to individual segments
is not strictly less restrictive than option 1. Consider, for
instance the localpart A1-2B (where A and B are RTL characters). This
is valid under option 1 (and indeed as a domain label under IDNA).
The entire localpart satisfies the stringprep rules, but the
individual segments don't (because they don't each begin and end with
an RTL character).