[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Implementing encoded-character
On Thu, 2007-04-05 at 21:10 +0000, Aaron Stone wrote:
> On Thu, Apr 5, 2007, Ned Freed <ned.freed@xxxxxxxxxxx> said:
>
> >> > > > "${unicode:200000}" -> error
> >> > > > "${unicode:2000000}" -> "${unicode:2000000}"
> >
> >> Ugh, if it looks like encoded-char and walks like encoded-char...
> >
> >> My test implementation left-shifts the current value of the encoded
> >> character, then adds the next hex digit. When it hits whitespace, it
> >> checks if the value is within appropriate bounds; if so, stores the
> >> character then loops, if not, stores '?' then loops. Would we really
> >> rather be very strict about this? I'm in favor of some flexibility.
> >
> > You need to strictly implement the grammar in the specificaiton, whatever
> > that ends up being. Any flexibility will allow someone to write one of
> > these things that works in your implementation but silently fails and causes
> > wierd results elsewhere.
> >
> > Past experience with RFC 2047 encoded-words has shown that allowing leeway in
> > this situations is a curse, not a blessing.
>
> Indeed, point taken!
It's not strict yet (I'll cross that bridge when we agree on where it is ;-),
it just translates the hex values to utf-8. And now, counting from 0-9 in
Western Arabic, Eastern Arabic and Amharic (thanks unicode.org/charts!)...
Converting [${unicode:30 31 32 33 34 35 36 37 38 39}]
to [0123456789] length 11
Converting [${unicode:06f0 06f1 06f2 06f3 06f4 06f5 06f6 06f7 06f8 06f9}]
to [۰۱۲۳۴۵۶۷۸۹] length 21
Converting [${unicode:1369 136a 136b 136c 136d 136e 136f 1370 1371 1372}]
to [፩፪፫፬፭፮፯፰፱፲] length 31
(Are there any number systems up in the four bytes per symbol ranges?)
If anybody would like to use my code, I'd be happy to make it available
without restriction. It's all of 100 lines, and most of the fun was
generating utf-8 by hand.
Aaron