[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: First strawman for UTF-8 headers proposal
On Jan 2, 2004, at 5:20 PM, Martin Duerst wrote:
Hello Keith,
At 20:35 03/11/30 -0500, Keith Moore wrote:
One simple example. Bernstein and others have pointed out that it's
easier to parse header fields with address lists from the right to
the left rather than from the left to the right, because this
requires less lookahead. It's still possible to do this with UTF-8
(particularly if you do lexical analysis left-to-right and parsing
right-to-left), but it's probably not a trivial change to existing
code.
Can you give more details?
yes. when parsing ASCII you can look at one octet at a time. so when
parsing
To: Martin Duerst <duerst@xxxxxx>
right to left the parser sees ">" then "g", then "r", etc. as soon as
the parser sees ">" it knows that this is a production of the form
[ phrase ] "<" addr-spec ">"
(forgive me for using 822 rather than 2822 - I have never memorized the
latter)
if you're parsing utf-8 then you can't look at one octet at a time -
you first have to parse octets into characters. you can do it, but
it's more of a pain - for instance, you have to do more checking for
boundary conditions. it's certainly not as simple as something like
if (ptr <= bufstart)
break;
c = *ptr--;
i.e. it's not a trivial change to code written to assume that a
character is a fixed width and fits into a single octet.
As long as lexing or parsing treats anything
non-ascii the same, things shouldn't change at all (as long as the code
is 8-bit clean). If different non-ASCII characters have to lex or parse
differently, then you have to use tables, do some conversion, or do
some
hand-coding with a byte-by-byte approach, and the complexity of this is
virtually the same whether you go one way or the other. If you already
have the UTF-8 forward code, then that's not trivial to change to
reverse scanning code. But if you only have ASCII, the changes to move
to UTF-8 are about the same for both directions, except that you
probably have a bigger chance to find already existing code that
goes forward.
uh, no. not even close. and experience with 2047 indicates that
people don't want to make large changes to their existing codebases.