[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Bidi issues
Roy Badami <roy@xxxxxxxxxxxxx> wrote:
> http://www.gnomon.org.uk/bidi-ambiguities.txt
>
> Not sure how useful it is, and it's definitely unfinished
I have now read it, and found it very helpful in building some intuition
about bidi issues. Thanks!
Here's an attempt to distill it down to a few simple easy-to-remember
rules of thumb. I am assuming that an identifier is a sequence of
components and delimiters, and that strong characters and numbers
can appear in components, but not between components. I am also
assuming that if the direction of the context of an identifier is
not standardized, it is at least known to the person looking at the
identifier.
Imagine that strong characters are opaque and all other characters are
transparent, so that we can talk about "visibility".
1) All ambiguities involve numbers. If a component contains no
numbers, then don't worry about it.
2) If a number can see the beginning or end of a component, that is
asking for trouble, unless you know what will/won't appear in the
surrounding components and can therefore apply rules 3 and 4.
3) If a number can see both strong LTR and strong RTL characters, that
is asking for trouble.
4) If a number can see class R before itself, and if its field of view
also contains number separators, that is asking for trouble. In
Unicode 3.2 class R is just the Hebrew letters/ligatures/punctuation
and the right-to-left mark. The six number separators are / , . :
no_break_space arabic_comma (and their compatibility equivalents,
which are removed by NFKC).
5) If an identifier is not asking for trouble as described in 2-4,
then it has no bidi ambiguities. Otherwise, it might or might not
be ambiguous; more complex scrutiny would be needed to make the
determination.
Did I make any mistakes? (It's quite likely.)
This raises a question about the Stringprep bidi check, which says:
* The string must not contain both strong LTR and strong RTL
characters.
* The string that contains a strong RTL character must begin and end
with a strong RTL character.
This is not completely effective at avoiding ambiguity because there
were other constraints (simplicity and backward compatibility). But if
avoiding amiguity had been the only goal, then I think the following
much less restrictive (and simpler) check would have been as effective:
* The string must not contain strong LTR and strong RTL and number
characters.
I'm still assuming that the direction of the context is known to the
person looking at the string. If we drop that assumption, would that
motivate the more restrictive rules of the Stringprep bidi check?
Another possible motivation might be an unstated second goal, like
avoiding having a component get displayed in disjoint pieces when part
of a larger identifier.
AMC